项目描述 :
A TF2 implementation of WaveNet
高级语言: Python
项目地址: git://github.com/CODEJIN/WaveNet.git
WaveNet in TF2
This code is an implementation of WaveNet. The algorithm is based on the following papers:
Oord, A. V. D., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., ... & Kavukcuoglu, K. (2016). Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
Paine, T. L., Khorrami, P., Chang, S., Zhang, Y., Ramachandran, P., Hasegawa-Johnson, M. A., & Huang, T. S. (2016). Fast wavenet generation algorithm. arXiv preprint arXiv:1611.09482.
Oord, A. V. D., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O., Kavukcuoglu, K., ... & Casagrande, N. (2017). Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433.
Salimans, T., Karpathy, A., Chen, X., & Kingma, D. P. (2017). Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517.
- This code is applied the Fast WaveNet and Discrete Mixture of Logistic.
I referred a lot of parts of modules from r9y9’s WaveNet github. And I referred some parts of UpsampleNet and MoL modules from fatchord’s WaveRNN github.
Please see the ‘requirements.txt’.

Used dataset
Currently uploaded code is compatible with the following datasets. The O mark to the left of the dataset name is the dataset actually used in the uploaded result. Wavenet requires speaker id for multi-speaker processing, so if you want to add another dataset, you must consider the id labelling method.
[O] LJSpeech: https://keithito.com/LJ-Speech-Dataset/
[X] Blizzard Challenge 2013: http://www.cstr.ed.ac.uk/projects/blizzard/
[O] FastVox: http://www.festvox.org/cmu_arctic/index.html
Hyper parameters
Before proceeding, please set the pattern, inference, and checkpoint paths in ‘Hyper_Parameter.json’ according to your environment.
- Setting basic sound parameters.
- Setting the parameters of WaveNet.
- In upsample, the product of all of upsample scales must be same to frame shift size of sound.
- MoL size must be a multiple of 3.
- Setting the parameters of training.
- Wav length must be a multiple of frame shift size of sound.
- Currently, when this option is true, Nan loss occurs. I don’t recommend use this option.
- Setting the usage of mixed precision.
- If using, the tensors are stored by 16bit, not 32bit.
- The weights are stored by 32bit, so the model is compatible with checkpoints learned with different mixed precisions if the rest of the parameters are the same.
- Usually, this parameter makes be possible to use larger batch size.
- In the unsupported machine, the speed is extreamly slower.
- When using, I recommend to increase the epsilon of ADAM to 1e-4 to prevent the underflow problem.
- See the following reference for details.
- Setting the inference path
- Setting the checkpoint path
- Setting which GPU device is used in multi-GPU enviornment.
- Or, if using only CPU, please set ‘-1’.
Generate pattern
python Pattern_Generate.py [parameters]
At least, one or more of datasets must be used.
- -lj
- Set the path of LJSpeech. LJSpeech’s patterns are generated.
- -bc2013
- Set the path of Blizzard Challenge 2013. Blizzard Challenge 2013’s patterns are generated.
- -fv
- Set the path of FastVox. FastVox’s patterns are generated.
- -mc
- Ignore patterns that exceed the set number of each dataset.
- -mw
- The number of threads used to create the pattern
※If you want to generate your own dataset
- In this implementation, the patterns of dataset are created through two processes.
Pattern generate
- Each pattern file is a pickle file of a dict object that contains several information.
- The dict contains the following keys and values (
: value
: wav signal with range -1 to 1Mel
: Mel spectrogram converted from signalSpeaker id
: Speaker id of pattern
- Speaker id is an int variable.
- Speaker id must start from 0 and increase by 1.
: Dataset label
- Please refer to here about a detail function used in pattern file generation.
- After finising to generate all pattern files, all of them are loaded once and basic information are saved.
- Metadata contains three pieces of information.
- Hyper parameters related to the pattern generating at the time the patterns were created
- The shape of signal and mel spectrogram, speakar id, dataset of each pattern file
- A file name list of pattern files
- The file name of metadata is always ‘METADATA.PICKLE’.
- Please refer to here about a detail function used in metadata file generation.
Inference file path while training for verification.
- Inference_Wav_for_Training.txt
- Wav path and speaker id which are used for inference while training.
python Model.py
- Run ‘ipython’ in the model’s directory.
- Run following command:
from Model import WaveNet
new_Model = WaveNet(is_Training= False)
- There are two ways to insert mels.
- Make two lists of Mel patterns and speaker ids. Each mel’s type and shape must be numpy array and ‘[Time, Mel_dim]’. And each speaker id is a scalar.
mel_List = [mel1, mel2, mel3, ...]
mel_Speaker_List = [speaker_id1, speaker_id2, speaker_id3, ...]
- Insert a path of wav files.
path_List = [
path_Speaker_List = [speaker_id1, speaker_id2, speaker_id3, ...]
- Run following command:
mel_List= mel_List,
mel_Speaker_List= mel_Speaker_List,
wav_List= path_List,
wav_Speaker_List= path_Speaker_List,
label= None,
split_Mel_Window= 7,
overlap_Window= 1,
batch_Size= 16
- Parameters
mel_List and wav_List
- The list you set at section 3.
mel_Speaker_List and wav_Speaker_List
- The speaker id list you set at section 3.
- A label of inferenced file.
- If None, the datetime is assigned.
- Mel length calculated as a single sequence.
- The higher the value, the higher the quality but slower the inference.
- The length of the part to be calculated by overlapping the previous sequence in the divided mel.
- The larger this value, the less the quality degradation during deployment.
- Decide how many split mel sequences to calculate at one time.
- Larger is faster, but it can cause out of memory problems.
- The following results are based on the checkpoint of 500000 steps.
- 100000 of 16 batchs (76.15 epochs) on my desktop (Nvidia GTX 1050ti).
- After, 400000 of 64 batchs (1218.35 epochs) on Google Colaboratory.
- 8 Talkers (1 LJSpeech + 7 FastVox) are trained.
- The result is based on the original wav file. The joint with the voice synthesizer has not been progressed yet.

Because of colab connection problem, some log information is missing.
Split mel window 7 (the concatnation of 1792 sample bathces)
Split mel window 64 (the concatnation of 16384 sample bathces)
Trained checkpoint
Checkpoint here
- This is the checkpoint of 500000 steps. of 16 batchs (76.15 epochs).
- 100000 of 16 batchs (76.15 epochs) on my desktop (Nvidia GTX 1050ti).
- After, 400000 of 64 batchs (1218.35 epochs) on Google Colaboratory.
- 8 Talkers (1 LJSpeech + 7 FastVox) are trained.
Future works
- Integrating GST Tacotron