A Pytorch based LSTM Punctuation Restoration Implementation/A Simple Tutorial for Leaning Pytorch and NLP
interspeech2015-paper-punct.pdf">[1]LSTM for Punctuation Restoration in Speech Transcripts
@inproceedings{tilk2015, author = {Ottokar Tilk and Tanel Alum{\"a}e}, title = {{LSTM} for Punctuation Restoration in Speech Transcripts}, booktitle = {Interspeech 2015}, year = {2015}, address = {Dresden, Germany} }
Training
1. In windows: execute start run.bat
in the cmd in this dir.2. In linux: execute ./run.sh
in bash in this dir. (your needs chmod
to give execute right
)> or(deprecated)
python punctuator.py -tr > ./log/yourLOgName &
at root dir of this projct.python punctuator.py -tr > ./log/yourLogName &
at root dir of this projct.python punctuator.py -t > ./log/yourLOgName &
at root dir of this projct.python punctuator.py -t > ./log/yourLogName &
at root dir of this projct.——————————-en_version————————————-
In general, a DNN system is built with the help of framework like tensorflow, pytorch or MXnet. I choose pytorch for it’s simplicity.
Most of DNN systems have 2 main module:
As below, a Training module consists 4 parts:
Among all of 4 parts, we can build Net structue later, because we can build it easily in some general pattern.
As a beginner, we should concentrate more on data processing and data inputing part. Modern machine algorithm is useless without data. No data, No Magic.
Make a assumption:
Always be careful about your data. There’s no Out of the box data for your own needs.
Pocessing:
When input data to the Net, pytorch demands more uniformed data form. Pytorch supply a DataSet
class for pack the data from raw. A Dataloader
is supplied for customizing the sampling of dataset.
Above is all for the follow needs:
For demands above, we are customizing the DataSet
and Dataloader
class:
DataSet
, we divide the data into sequences with length as 100(or whatever). This can be customized in __init__()
of DataSet
Dataloader
, we mainly change the sampling method to make the sequence be continuous(but if no batch, Dataloader
can supply a simple way sampler=SeqBatchSampler
to make continuous). What we need is to customize sampler
class’s __iter__()
, doing the sampling like what happened in pic below.It will finally make the training to use continuous data.(This can be found in SeqSampler.py
)Net training consists some techniques:
A inference module should consist of 4 parts:
Methods are identical to section 1.1.
Dataset
class. Then modifying it to standardizing the data from section 2.1.Dataset
‘s processing about raw data, we let a sequence unit be the input of Net.Note: Unlike training a Net, now we just input only 1 seq into the Net instead of a batch_size of seqs. But pytorch forces you to indicate the batch_size of the only 1 seq explicitly. Please Use np.reshape(id_seq, (1, -1))
. More details are in file ./inference/inference.py
.
Loading a model can be done by using a overloaded class method load_model()
of nn.module
.
torch.load
to load the model information.load_model
the extraction of embedding_size, hidden_size, num_layers, num_class and state_dict.Inputing the Dataset
to the loaded model. Then inference start.
————————————————————-zh_version————————————-
DNN系统的简单搭建需要依赖深度学习框架进行,pytorch是一个非常好的选择,使用的逻辑比较简单易懂。
通常DNN系统包括两个大部分:
一个DNN模型的训练模块的构建应当包括几个部分:
其中网络模型和train的代码的构建是较为模式化的,不需要投入过量的精力。
我们需要对数据的处理和输入加以重视,数据于模型如同燃料于汽车,这一点在自然语言处理中尤甚。
我们做如下假设:
以上两种假设,一个导致模型脱离现实,另一个则使得训练根本无法开始,如同无米之炊。
即使你找到了开源的规范化数据集,你也会发现,数据集仍旧需要进行处理再进入你自己的DNN系统
预处理:
数据输入网络时,需要被组织成相对于pytorch来说规范的数据格式,pytorch提供了Dataset模块对于数据进行打包、提供了Dataloader对打包好的数据进行加载使用。
针对LSTM(数据前后强依赖)模型的输入会有如下需求:
通过对于Dataset,Dataloader进行定制,我们可以满足对数据的需求:
定制train代码,选择mini-batch反向传递策略,还有epoch反向传递更新一次的策略???
一个DNN模型的推理模块应当包括几个部分:
数据预处理模块和section 1.1中的过程相同。
数据输入pytorch网络,需要继承Dataset
类并进行改造后,对预处理好的数据进行规范化。
通过类似section1.2的Dataset
构造方式,我们可以得到标准化的数据集,再将标准化的数据集一个一个序列地送进网络。
需要注意的是:和训练网络时不同,我们一次只推理一个单元的文字。在从Dataset
取出一个序列后,需要np.reshape(id_seq, (1, -1))
将序列改为batch_size为1的shape。
网络模型的载入,是通过重载nn.module
自身的类方法load_model
实现的。
torch.load
载入保存好的model后,将数据输入由section 2.3得到的网络模型,可以进行推理