项目作者: akurniawan

项目描述 :
char-rnn implementation for sentiment analysis on twitter data
高级语言: Python
项目地址: git://github.com/akurniawan/pytorch-sentiment-analysis.git
创建时间: 2017-08-31T09:54:08Z
项目社区:https://github.com/akurniawan/pytorch-sentiment-analysis

开源协议:

下载


pytorch-rnn-sentiment-analysis

Description

Just assume this is my toy for learning pytorch for the first time (it’s easy and definitely awesome!). In this repo you can find the implementation of both char-rnn and word-rnn to do sentiment analysis based on twitter data.

Not only sentiment analysis, you can also use this project as a sentence classification with multiple classes. Just put your class ids on the csv and you’re good to go!

Implementation Details

  1. You can choose between LSTM and CNN-LSTM for the character decoder
  2. Each batches will be grouped in respect of their lengths

Implementation Limitations

  1. For current implementation, the dataset is set to tokenize the input based on characters. There is still no way to update the tokenization via config.
  2. Still no way to update RNN cell from config.
  3. Still no way to update optimizer from config.

How to run?

  1. Install pytorch
  2. Run pip install -r requirements.txt

Run python run.py with the following options

  1. optional arguments:
  2. -h, --help show this help message and exit
  3. --epochs EPOCHS Number of epochs
  4. --dataset DATASET Path for your training, validation and test dataset.
  5. As this package uses torch text to load the data,
  6. please follow the format by providing the path and
  7. filename without its extension
  8. --batch_size BATCH_SIZE
  9. The number of batch size for every step
  10. --log_interval LOG_INTERVAL
  11. --save_interval SAVE_INTERVAL
  12. --validation_interval VALIDATION_INTERVAL
  13. --char_level CHAR_LEVEL
  14. Whether to use the model with character level or word
  15. level embedding. Specify the option if you want to use
  16. character level embedding
  17. --model_config MODEL_CONFIG
  18. Location of model config
  19. --model_dir MODEL_DIR
  20. Location to save the model

This is the example of how you can run it

  1. python run.py --model_config config/cnn_rnn.yml --epochs 50 --model_dir models --dataset data/sentiment

Dataset

You can download the raw data from [1]. It contains 1,578,627 classified tweets, each row is classified as 1 for positive sentiment and 0 for negative sentiment. Kudos to [2] for providing the link to the data!. However, the data provided by [1] have 4 columns, while on this code we only need the text and the sentiment only, you can convert the data first by grabbing the first and the last columns before feeding into the algorithm.
For the alternative, you can also download the data from [4], this contains the same number of data as the original, but I have already cleaned it up a bit and you can run the code without any further modification.

Want to run with your own data? No problem, create csv files for training and testing with two columns, the first one being the sentiment and the second being the text. Don’t forget to use the same name for both files and differentiate it with suffix .train and .test.

Reference

[1] http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

[2] http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

[3] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

[4] https://drive.google.com/file/d/1-1QNrYebNxge9vMP7YJJceehQkZtqAcO