Putting the World Service radio archive online with machine-generated and crowd-sourced metadata.
Project from 2011 - 2014
What we are doing
We built a prototype website containing the whole of the Βι¶ΉΤΌΕΔ World Service English-language radio archive. We did this by developing algorithms that listen to the radio programmes and create new descriptive metadata automatically and we then provided the ability for people to correct or add to this data. The video above shows the prototype in action.
Why it matters
We want to make it easier to catalogue and cross-reference large video and audio collections such as the Βι¶ΉΤΌΕΔ's archive, and therefore create enjoyable and useful ways to explore our wealth of programmes and discover hidden gems when the archives are made public. To do this we need metadata about these programmes, and often it doesn't exist in a useful form.
Manually tagging programmes with metadata about them is expensive and time-consuming, so we are researching advanced algorithms and machine-learning techniques that can do it automatically. And where these methods aren't good enough, we want to harness the power of data to improve the metadata.
Our Goals
- To develop automated methods to create metadata for audio-visual archives where none, or not much, exists
- To develop features that encourage people to add to this automated metadata, and to understand if this leads to increased accuracy
- To determine if it is acceptable to launch an archive where the metadata hasn't been comprehensively checked by hand
- To explore the features required to make such an archive proposition work
- To understand what kind of metadata and tags are good and useful
Outcomes
This project started as part of the ABC-IP workstream and is a followup to KiWi, a project aimed at using Amazon Web Services to process the large amount of audio in the World Service archive. Some components of this have been made available on Github: ruby parsers for Wikipedia’s and boxes and .
As well as giving us the chance to explore Kiwi and cloud processing further, this project resulted in a prototype for the World Service audio archive.
Following this project, Βι¶ΉΤΌΕΔ World Service worked to transfer many of the programmes into iPlayer, resulting in over 20,000 additional archive programmes available to the public.
How it works
Our starting point is the massive audio archive of the World Service in English, dating back six decades and covering over 70,000 radio programmes, or more than three years' worth of continuous audio. Metadata for this archive is currently sparse or non-existent.
To counter this, we are first using speech-to-text technology to create transcripts, albeit "noisy" ones. We have then built a "semantic tagger" called KiWi, specially designed to work on the "noisy" transcripts, that automatically assigns topics, drawn from , Wikipedia's store of structured data, to the radio programmes.
From this data we have built a prototype website that lets people explore this archive. And while doing so they can approve, correct, or add to this machine-generated metadata to make the whole thing better for all. You can read more on our blog about how we and the site.
Topics
Project Team
Project Partners
-
Metadata specialists