KiWi

Published: 1 January 2011

In this project we have investigated the possibility of automatically assigning topics to large programme archives in a reasonable time.

Project from 2011 - 2012

What we are doing

Kiwi is a framework aimed at automatically identifying topics in speech radio programmes, with topic identifiers being drawn from Linked Open Data sources such as DBpedia.

In order to generate such topics in a reasonable time for large programme archives, we built a processing infrastructure distributing computations on cloud resources (e.g. Amazon EC2). We used this infrastructure to automatically tag the entire 麻豆约拍 World Service archive (70,000 programmes) in around two weeks.

Why it matters

The 麻豆约拍 manually tags recent programmes on its website. Editors draw and assign these tags from open datasets made available within the Linked Data cloud, but this is a time consuming process. Aside from recent programming, which is tagged, the 麻豆约拍 has a very large radio archive that is currently untagged.

Tags enable a wide variety of use cases, such as the dynamic building of topical aggregations, retrieval through topic-based search, or cross-domain navigation. Automatic tagging of archive content would ensure archive programmes are as findable as recent programmes. It would mean that topic-based collections of archive content can easily be built, for example to find archive content that relates to current news events. Kiwi provides an algorithm and an infrastructure to automatically tag very large programme archives in a cost-effective and scalable manner.

Outcomes

We used this algorithm and infrastructure to automatically tag the entire 麻豆约拍 World Service archive (around 70,000 programmes or three years of continuous audio), for which we have very little annotations. The resulting tags were used to unlock this archive and make it available through the World Service Archive prototype. Some of the tools created have been made available as open source software on Github: and an .

How it works

We built an automated tagging algorithm using speech audio as an input.

We use the open source software, with the and a language model extracted from the Gigaword corpus. The resulting transcripts are very noisy and have no punctuation or capitalisation, which means off-the-shelf concept tagging tools perform badly on them. We therefore designed an alternative concept tagging algorithm.

We start by generating a list of web identifiers used by 麻豆约拍 editors to tag programmes. Those web identifiers identify people, places, subjects and organisations within . This list of identifiers constitutes our target vocabulary. For each of those identifiers, we aggregate a number of textual label and look for them in the automated transcripts. The output of this process is a list of candidate terms found in the transcripts and a list of possible corresponding DBpedia web identifiers for them. For example if 鈥榩aris鈥� was found in the transcripts it could correspond to at least two possible DBpedia identifiers: "Paris" and "Paris, Texas".

In order to disambiguate and rank candidate terms, we consider the subject classification in DBpedia, derived from Wikipedia categories and encoded as a hierarchy. We start by constructing a vector space for those SKOS categories, capturing hierarchical relationships between them. Two categories that are siblings will have a high cosine similarity. Two categories that do not share any ancestor will have a null cosine similarity. The further away a common ancestor between two categories is, the lower the cosine similarity between those two categories will be. We implemented such a vector space model within our RDF-Sim project. We consider a vector in the same space for each DBpedia web identifier, corresponding to a weighted sum of all the categories attached to it. We then construct a vector modelling the whole programme, by summing all vectors of all possible corresponding DBpedia web identifiers for all candidate terms. Web identifiers corresponding to wrong disambiguations of specific terms will account for very little in the resulting vector, while web identifiers related with the main topics of the programme will overlap and add up. For each ambiguous term, we pick the corresponding DBpedia web identifer that is the closest to that programme vector.

For example if the automated transcripts mention 鈥榩aris鈥�, 鈥榝rance鈥� and 鈥榯our eiffel鈥� a lot, the resulting programme vector will point towards France-related categories e.g. "France" or "Cities in France". The right disambiguation of 鈥榩aris鈥� will be the one that is closest to that programme vector, hence d:Paris. We then rank the resulting web identifiers by considering their TF-IDF score and their distance to the programme vector. We end up with a ranked list of DBpedia web identifiers, for each programme.

We separated each step of this workflow into independent self-contained applications, or "workers". Each worker takes input in the form of the results of the previous step of the workflow, and produces output to be given to the next step of the workflow. We also configured a message-queueing system using RabbitMQ to allow workers to pickup new tasks and assign tasks to one-another. We built an HTTP interface centralising all intermediate and final results, as well as keeping track of the status of each worker. We deployed a number of workers on a cloud infrastructure in order to process as much data as we can in parallel.

Project Team

Yves Raimond (PhD)
Senior R&D Engineer

麻豆约拍

Accessibility links

Yves Raimond (PhD)

Rebuild Page

Useful links

Theme toggler