项目作者: akjindal53244
项目描述 :
Text classification using char-CNN + word-CNN
高级语言: Python
项目地址: git://github.com/akjindal53244/text_classification_hierarchical_cnn.git
text_classification_hierarchical_cnn
Text classification using char-CNN + word-CNN
Reference:
Convolutional Neural Networks for Sentence Classification
http://www.aclweb.org/anthology/D14-1181
Character-Aware Neural Language Models
https://arxiv.org/pdf/1508.06615.pdf
File configuations:
- Assign suitable name to variable ‘datasetidentity’ (dataset specific name) under class DataConfig. Create folder with name “dump“ + dataset_identity under ‘data’ folder. All dataset specific vocab, embedding_matrix will be automatically stored in this folder. It helps maintaining consistency while doing experiments on multiple datasets.
- Like above, copy train, valid and test files under folder with name as ‘dataset_identity’ variable’s value under ‘data’ folder and accordingly change file/folder paths into class DataConfig under ‘utils/feature_extraction.py’.
- data format: Label TAB Utterance (No space around “\t”)
- Model and architecture related settings (number of layers, filters, enable fully connected layers, dropout, epochs, batch_size, lr etc..) can be adjusted via class ‘ModelConfig’ under ‘utils/feature_extraction.py’
- Current code uses SENNA 50d embeddings - https://ronan.collobert.com/senna/ however you can use your own embeddings. Just change filename via variable name ‘embedding_file’ into class ‘DataConfig’
Traininig/Testing => train.py
For Training:
main(Flags.TRAIN, load_existing_dump=False)
‘load_existing_dump’:
If set to False, will create vocabs, embedding_matrix etc. from input dataset and saves into ‘dump_dir’ as mentioned above.
If True, will load it from existing ‘dump_dir’ without creating vocabs again, thus little faster. It is useful while performing multiple training with same dataset:
main(Flags.TRAIN, load_existing_dump=True)
For Testing:
main(Flags.TEST, load_existing_dump=True)
Error Analysis:
Incorrect test predictions will automatically get written under model_saver directory.