Stefano Ivancich

Key word spotting

Detect a set of predefined keywords in an audioclip.


The End-to-End framework for the KeyWordSpotting task is made of a sliding window of 1 second, a Voice Activity Detection module or a Silence Filter that select only the frames containing human voice, from those frames a feature extraction module will extract the Mel Spectrogram or the Mel Cepstral Coefficients, this will be the input of the model. Finally a fusion rule aggregates all frames predictions in a single one.
We tried 4 different methods:

View the Code and Download Paper

Download Paper PDF Download Presentation PDF GitHub Code