Quick skim of "Automated Metadata in Multimedia Information Systems"

I just skimmed "Automated Metadata in Multimedia Information Systems - Creation, Refinement, Use in Surrogates, and Evaluation" Turns out the lecture is by Mike Christel of CMU, which I hadn't noticed ... it's a quick read, and worth the hour it takes. Here's my quick notes:

Some interesting tidbits he cites re ASR results:

  • 1994 Sphinx 2 ASR on broadcast news showed error rates of 65% ; Sphix 3, *trained* on broadcast news got error rate down to 24%. Adding in general news vocabularies based on "news of the day" from crawling CNN, AP, Reiters brought error rate down further, to 19% ... evidence of the general value of using better vocabularies.
  • ASR word error rates vary by conditions: 5% ~ 10% in a lab environment, 20% in a TV studio, 30% in broadcast news, 40% in TV dialog, and up to 90% in advertisement/commercials (due to music and inter-word silence compression to fit in 30-second spots)
  • Search performance (as opposed to transcription performance) seems to be robust against the problems of limited vocabulary: "Other experiments showed that effects of words missing in the recognizer's lexicon could be mitigated. Specifically, word error rates up to 25% did not significantly impact information retrieval and error rates of 50% still provided 85%-90% of teh recall and precision relative to fully accurate transcripts in teh same retreival system" ... "The [TREC Spoken Document Retrieval (SDR) track ended in 1997 with the conclusion that retrieval of excerpts from broadcast news using ASR for transcription permits relatively effective information retrieval, even with word error rates of 30%"
  • A use case Mike notes even when ASR is too inaccurate for some purpose is taking a human-generated transcription and automatically adding timing info, to support rich UI such as jump-to navigation. (We've run into that use case with commercial ASR providers, too.)

Some non-ASR bits:

  • terminology he uses:
  • "storyboards" and "thumbnails" and "abstracts" are examples of "document surrogates" - surrogates stand in to represent the full document
  • On pg 41: "... semantic compression [is] regulating the contents of multimedia presentation based on personal interests"
  • user interface effectiveness: significant improvements in performance and satisfaction on a fact-finding task when interleaving text to on a storyboard (mixed-analytic output)
  • phrase-length surrogates work better than word-length surrogates. This was also true with playable video summaries (video- skimming accompanied by audio chunking) - users do better with 5 seconds audio chunks at each play-point than with twice as many 2.5 second chunks.
  • There seems to be active interest in playable summaries as replacement for storyboards; interesting work done in TRECVID BBC Rushes Summarization task seems to show "indicative and informative summaries" can be as short as 2% of the source material.