Key word spotting
Detect a set of predefined keywords in an audioclip.
The End-to-End framework for the KeyWordSpotting task is made of a sliding window of 1 second, a Voice Activity Detection module or a Silence Filter that select only the frames containing human voice, from those frames a feature extraction module will extract the Mel Spectrogram or the Mel Cepstral Coefficients, this will be the input of the model. Finally a fusion rule aggregates all frames predictions in a single one.
We tried 4 different methods: