The Problems We're Solving


Video and audio (AV) data is being generated and consumed at exponentially growing rates. Though the data is incredibly rich, it is also essentially unstructured and orders of magnitude larger than conventional structured data. It's unstructured nature combined with it's size makes multimedia data unwieldy to manage using conventional access control, search, policy enforcement, storage control and transmission software technologies. 

Simple compression algorithms can eliminate redundancy and reduce the overall size of audio and video files.  But, managing their contents requires unlocking the meaning and structure within these files; i.e. the semantics.   Semantics can be expressed as structured metadata describing "events" that occur at specific points in time within an AV file. These events can be as simple as motion beginning or ending, a person begins speaking; or as complex as the text of the speech in an audio track or recognizing a person or a suspicious behavior.

Combinations of these semantic event descriptions define logical "segments" that correspond to meaningful activities; i.e. a meeting, a transaction, a Powerpoint slide being displayed, a particular speaker talking, a certain language being spoken, etc. These logical segments that can in turn be (indirectly) managed with conventional software.  For instance, access to the segment of surveillance video recorded in a conference room during a Board meeting can be restricted using existing identity management and access control software.

Temporal semantic data is available from two generic sources:

  1. Intelligent devices associated with the multimedia file (e.g. RFID tag readers in the vicinity of video conference equipment, POS systems within the field of view of security cameras, etc.), OR
  2. From the media files themselves (e.g. algorithms that detect, classify, track, or recognize objects within the AV channels.)

Well understood, standard mechanisms exist for capturing, integrating and consuming the structured data generated by intelligent devices. Appscio is developing a new class of software to address the issues involved in capturing, integrating, synchronizing, evaluating and consuming semantic data generated by media algorithms.

Integrating Audio/Video Algorithms

The availability of professional and consumer digital AV devices (camcorders, audio recorders, etc.) several decades ago triggered research resulting in algorithms capable of harvesting structured data from the files these devices generate. The early efforts were focused on targeting weapons and supporting surveillance. To that end, the US government invested in computer vision and speech recognition research through programs such as VACE and CALO. That research produced many powerful, specific algorithms capable of recognizing objects, people, behaviors and words in digital media. Funding continues to advance the state-of-the-art (albeit more slowly) with new approaches and faster hardware.

However, these research efforts were typically funded with a narrow focus on either audio or video algorithms to solve one specific problem. Each was funded with different requirements for media formats, evaluation criteria, test frameworks, etc. To date, no technology has been developed to serve as a standard development and deployment environment for new algorithmic work. The result is that each time researchers begin work on a new problem, they first must build a supporting environment for their algorithms. Worse yet, if and when newly developed algorithms are commercialized (many remain proofs-of-concept or published only in academic papers), they're typically embedded in vertical applications and/or custom hardware. These "stove-pipe" systems are rigid, proprietary deliverables that are extremely expensive to extend and upgrade as new and better algorithms emerge.

In addition, the absence of a standard "media algorithms" software framework causes other problems:

  1. Reuse: Building upon algorithms developed by others is difficult because they're typically packaged within different, proprietary environments which encourages researchers to re-invent rather than re-use.
  2. Multi-modal recognition: Developing algorithms that utilize metadata from all audio and video channels within AV files requires an environment supporting the communication of metadata across and between audio and video algorithms; none exists today.
  3. Evaluation: Comparing algorithms and combinations of algorithms (e.g. combining signal conditioners with object detectors) becomes a significant engineering project when the algorithms were developed for different frameworks.
  4. Distribution/deployment:  The lack of a standard framework inhibits the rapid deployment of new algorithms.
  5. Integration:  Integrating "best-of-breed" algorithms from independent researchers requires porting, repackaging and refactoring, a significant development effort on its own.

Why is Integration Important?

Progress in the development of AV algorithms has been widespread.  Thousands of developers in hundreds of academic institutions, research institutes and commercial organizations worldwide continue to invent new algorithms.  But, these "centers of excellence"  typically focus on difficult problems with limited scope.

Exploiting the full power of these algorithms requires combining them in a variety of ways.  Within an audio and/or video channel, obtaining optimal results requires integrating signal conditioners, signal transforms, primitive object detectors and trackers, object recognizers, metadata analysis and synthesis algorithms and (potentially) CODECs.  The following examples illustrate.

Virtually all speech-to-text applications combine a variety of algorithms to produce their results.  First, a variety of decoders are deployed to decompress the data in the target audio channel.  Typically, the decompressed audio is then filtered and enhanced to reduce noise and isolate the part of the signal most likely to correspond to human speech.  Increasingly standard algorithms then harvest numeric "features" from segments of the audio signal.  Next, phoneme recognizers generate candidates which are then analyzed and combined by word recognizer algorithms.  Word candidates are then processed through algorithms that apply language models to select appropriate words and apply punctuation.  Finally, text analytics perform such functions as topic recognition.

Similar combinations are employed in the video domain to recognize text, faces, license plates, etc.  And, future improvements will increasingly combine cross-domain (multi-modal) algorithms to produce even higher quality results.  For example, recognizing the identity of a particular speaker from the text superimposed on broadcast video should improve the quality of speech recognition.

Standard Metadata Vocabularies

Integrating elementary algorithms requires that they communicate.  And, effective communications between independently developed algorithms requires standards.  Though there have been attempts to define such standards, to date no standard multi-modal metadata formats and vocabularies have emerged.  This presents another significant obstacle to exploiting the capabilities of the full spectrum of AV algorithms.

Specifically, many of the important algorithms are, at their core, pattern recognizers.  These algorithms detect "events" in a channel within an AV stream and express confidence levels associated with the recognition of those events.  For example, a "face detector" recognizes objects in a video channel that it believes (with a certain degree of confidence) are human faces.  Typically, such an algorithm will report its results as "regions of interest" within a frame containing faces and associated numeric confidence levels.  If this face detector's output is to be used as input to a face recognizer from another source, they must share a common definition of a "region of interest" and how "confidence" is measured.

A standard format for representing metadata generated by these algorithms must emerge and, over time, standard vocabularies for describing "events" (e.g. a "region of interest") and agreement on how to express confidence.

UyUIJtrwuL

bmnd5f fatshxcndrit, [url=http://weujjapgwpxn.com/]weujjapgwpxn[/url], [link=http://qhdmruyywkik.com/]qhdmruyywkik[/link], http://rnbscasudqyw.com/