项目作者: JaesungBae

项目描述 :
Speech command recognition with capsule network & various NNs / KWS on Google Speech Command Dataset.
高级语言: Python
项目地址: git://github.com/JaesungBae/Speech-Command-Recognition-with-Capsule-Network.git

End-to-End Speech Command Recognition with Capsule Network

INTERSPEECH 2018 paper: link

We apply the capsule network to capture the spatial relationship and pose information of speech spectrogram features in both frequency and time axes, and show that our proposed end-to-end SR system with capsule networks on one-second speech commands dataset achieves better results on both clean and noise-added test than baseline CNN models.

  • 20 JAN 2019: Other baseline Keyword Spotting(KWS) models are also provided in CNN code.

Getting Started

The code is implemented based on python2(2.7.12)


You should be ready to import below libraries:

  1. tqdm, numpy(1.14.1), termcolor, scipy, sklearn, scikits
  2. tensorflow(1.6.0), keras(2.1.4)
  3. pip install numpy
  4. pip install termcolor
  5. pip install scipy
  6. pip install sklearn
  7. pip install scikit-learn
  8. pip install tensorflow-gpu==1.6.0
  9. pip install keras==2.1.4

Speech Feature Generation


We use ‘Google Speech Command Dataset’. You could refer to blog and Download Link

  • Download the dataset from above link and unzip it. (In our case we will unzip it in the folder named ‘Google_Speech_Command’)

Adding noise

To add noise to the original dataset, we use MATLAB and voicebox which is MATLAB library. We run matlab code on local which is window base and upload it to server which is linux base.

  1. Unzip download google speech command dataset.

  2. Create new folder name ‘Google_Speech_Command’ and move command folders into it. Then the folder structure will be like

    1. speech_commands_v0.01.tar
    2. |-- [_backgorund_noise_]
    3. |-- Google_Speech_Command
    4. | |-- bed
    5. | |-- bird
    6. : :
    7. | '-- zero
    8. |-- testing_list
    9. |-- validation_list
    10. '-- etc.
  3. Change ‘data_path’ in matlab code and run the matlab code. It will generate new folder and save the noise added audio files.

    1. noise_wave_generate.m
  4. You could aslo change ‘SNR’ in the code and generate noise audio files as you want.

Feature Generation

Extract speech features from raw audio file and save them as .npy file. Please adjust ‘—noise_name’ argument.

  1. cd core
  2. python feature_generation.py

Data folder structure

  1. feature_saved
  2. |-- TEST
  3. | |-- fbank
  4. | | |-- clean
  5. | | '-- [noise names]_SNR5
  6. | '-- label
  7. |-- TRAIN
  8. | |-- fbank
  9. | | |-- clean
  10. | | '-- [noise names]_SNR5
  11. | '-- label
  12. '-- VALID
  13. |-- fbank
  14. | |-- clean
  15. | '-- [noise names]_SNR5
  16. '-- label

Training & Testing

For training and testing go into ‘CNN’ or ‘CapsNet’ folder and run the code. You could change the mode with ‘—is_training’ argument.


  1. cd CapsNet
  2. python main.py -m=CapsNet --is_training='TRAIN' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4


Note that you should set ‘—keep’ argument to the number of epoch that you want to test.

  1. cd CapsNet
  2. python main.py -m=CapsNet --is_training='TEST' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4 --SNR=5 --keep=?

Various Neural Networks base KWS models

KWS models based on various kinds of Neural Networks(NNs) are also provided in CNN/model.py

1. Deep Neural Network(DNN) base KWS model from

  • G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks.” in ICASSP, vol. 14. Citeseer, 2014, pp. 4087–4091.

    Contain ‘ref_2014icassp_dnn’ in ex_name to use DNN model. For example

    1. python main.py --model='CNN' --ex_name='ref_2014icassp_dnn512' --is_training='TRAIN' --model_size_info 512 512 512

2. CNN base KWS model from

  • T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.

    Contain ‘ref_2015is_cnn’ in ex_name to use CNN model. For example

    1. python main.py --model='CNN' --ex_name='ref_2015is_cnn' --is_training='TRAIN' --model_size_info 21 8 94 1 1 2 3 6 4 94 1 1 1 1 32

3. Long Short-Term Memory(LSTM) base KWS model form

  • M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 474–480.

    Contain ‘ref_rnn’ in ex_name to use LSTM model. For example

    1. python main.py --model='CNN' -ex_name=ref_rnn_lstm --is_training='TRAIN' --model_size_info 64 32 0

4. Convolutional Recurrent Neural Network(CRNN) base KWS model from

  • S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv preprint arXiv:1703.05390, 2017.

    Contain ‘ref_crnn’ in ex_name to use CRNN model. For example

    1. python main.py --model='CNN' --ex_name=ref_crnn --is_training='TRAIN' --model_size_info 32 20 5 8 2 2 32 1 64


Preprocessing source code from https://github.com/zzw922cn/Automatic_Speech_Recognition.

Base capsule network keras source code from https://github.com/XifengGuo/CapsNet-Keras.


Jaesung Bae - Korea Advanced Institute of Science and Technology (KAIST)

contact: bjs2279@gmail.com