Speech/Nonspeech detector

Here are some notes on the current version of the speech/nonspeech detector known as "spnsp".

The speech/nonspeech detector takes rasta features as input and outputs "mpf-audio-features/speech-nonspeech-scores". These are two floating point numbers per 10ms of audio representing the speech score and the nonspeech score. If the speech score is greater than the nonspeech score, that 10ms chunk is (probably) speech. The actual values of the scores are not terribly meaningful, being dependent on how the models were trained. The ratio between the scores should be a good measure of confidence. The scores may be negative.

Note that the component is a detector not a segmenter. It outputs scores on each frame. The results could be used raw for something like a speech-VU Meter, indicating how speech-like (vs. noise, laughter, silence, etc.) the audio currently is, or processed by a downstream component to generate a segmentation.

The current implementation is a very fast gaussian mixture system. It's accuracy is actually fairly good on matched data (e.g. head worn microphones from meetings). The component requires a model for speech and a model for nonspeech. We've included such a model trained on a tiny amount of data from the ICSI Meeting Corpus.

The code was originally written in C++. We've written some bridge routines to allow easy calling from C (mpf handles C++ gracefully, but vanilla gstreamer doesn't). The C++ routines are unchanged from the command line version we use at ICSI. The file spnsp.c contains the gstreamer handling code; gmmbridge.cpp and gmmbridge.h contain the glue to the original C++ code.

We have included an excerpt from the ICSI Meeting Corpus as an example. There's a README.icsi with some installation instruction (which are more for our internal use, since you guys are quite familiar with the procedure!).

(Misplaced reply. How do I delete this?)

(Misplaced reply. How do I delete this?)

Meaning of negative values

Hi Adam,

Thanks for the notes.

To clarify my understanding, which of the following statements is correct?

float speech, nonspeech;

boolean isSpeech = speech > nonspeech;

or

isSpeech = absolute_value(speech) > absolute_value(nonspeech);

Thanks.

Shawn

isSpeech

boolean isSpeech = speech > nonspeech is correct.

However, isSpeech = ((speech/nonspeech) > c) might be better, since some apps might want to NEVER miss speech, while some might want to NEVER miss nonspeech. Tuning c lets you tradeoff.

 

ratio

Note that if you have values of speech = 1 and nonspeech = -2 with a default value of c = 1, you will get different results with the ratio and the direct comparison.

So the ratio doesn't seem valid.

Thanks.

Shawn

Negative scores

You're right, of course.

The metric that's usually used is the log of the likelihood ratio. The scores as output by the component are the logs of the likelihood of speech and nonspeech. So I (incorrectly) suggested the ratio of the log likelihoods rather than the log of the likelihood ratio! So it's just isSpeech = ((speech-nonspeech) > c)

Since the scores as ouput by the detector are log likelihoods, you can interpret the scores as probabilities with speechProb = exp(speech)/(exp(speech)+exp(nonspeech)) and nonspeechProb = exp(nonspeech)/(exp(speech)+exp(nonspeech)), but the log likelihood ratio is generally better.

   Adam

P.S. Is there any way to get an email when a comment is added?

probabilities and test file

Thanks Adam.

Using the speech probability function you suggest, I get values like below for the first part of the Bed003-excerpt.wav file.

Is this correct?  It seems to jump back and forth a lot. To be useful for someone trying to identify sections of speech should this data be interpreted with some sort of smoothing function?

What is the meaninf of the Not-A-Number values?

Can they be mapped to 0 or 1?

Thanks.

Shawn

First 1/4 second of Bed003-excerpt.wav.spnsp.txt:

0.0
NaN
0.0
NaN
1.0
0.0
0.0
0.2052099
0.5
0.0
NaN
NaN
NaN
0.5001138
0.5000061
NaN
0.5
0.0
0.5
0.5
NaN
0.5
1.0
NaN
0.95008856

For reference, here's my code that is generating these values from the binary result file:

          speech = din.readFloat();
          nonspeech = din.readFloat();
          speechProbability = Math.exp(speech) / (Math.exp(speech) + Math.exp(nonspeech));
          speechProbabilityString = "" + speechProbability + "\n";
          fout.write(speechProbabilityString.getBytes());

Jumping around and NaNs

The jumping around is normal and expected - a downstream component would have to do smoothing. I'm working on an hmm based smoother now that will work well for speech recongition, but other applications might use other methods. The raw values will only be useful for something like a VU-meter that provides a rough measure of the "speechiness" of each 10ms frame.

I'm not sure about the NaNs. I didn't get them in my debugging, but I'll take a look again tomorrow and see if I can figure out what might cause them.

 

NaNs

The NaNs appear to be a result of the division in your code when both likelihoods are very low (large negative). This indicates that the particular frame is not a good match to the speech OR the nonspeech models from the training data. A fix would be to set the probability of speech to 0.5 if both the likelihoods are very small.

   Adam

P.S. Just a reminder that we spent very little time training these models, so the accuracy is not likely to be too good.

 

Models

Do you have any better models that you can release?

Can you provide any information on how to create our own models?

Do you have any test data with expected (perfect) results or other means of evaluating the effectiveness of the models?

Also, I noticed that the output data seems to be slightly more than would be expected based on the length of the input.

For example, an input with length of 47.71 seconds (as reported by ffmpeg
) produces an output file with 5070 pairs of floats rather than the expected 4771 pairs. Have you seen this? Can you explain it?

Thanks!

Shawn

Training, etc.

In our last meeting with Appscio, we agreed that implementation is most important for this phase rather than training, so we have not concentrated on training. If training has become higher priority, we can switch. Note that for high accuracy, it's best to train on matched data.

We can provide software for training, but as we discussed early on, it uses software that is free but not open. Specifically, we use HTK, which we are not allowed to redistribute. As a result, packaging of training software is somewhat problematic. Again, if this has become higher priority, we can switch to working on this.

We typically benchmark vs. transcripts on one of the NIST test sets. Given the machine lean ring approach, there's no real "perfect" result other than the hand-annotated transcripts.

Regarding the length, we would actually expect it to be slightly shorter because of windowing. How many features are output by the feature generator? The speech/nonspeech detector should output exactly the same number of frames, so the problem is likely in the rasta component rather than the speech/nonspeech component. I'll take a look.

No change in priority

It's more important to have a complete pipeline soon, rather than improved performance on the speech/non-speech component. But, it will be good (IMO) for MySTT provide a short bullet-point sort of cheat sheet on how to do relevant training for their own needs, and how to attach their own models to the components.

BTW, Thom and Shawn have been hooking up the MySTT speech/non-speech component to one of our existing customer apps just to demo a first use of MySTT ... it's pretty cool ;)

mapping to a confidence value?

Rather than 'c' in [0, positive-infinity] is there a mapping f(speech,nospeech) into [0..1] that would behave sort of like a probability-of-speech?