项目作者: jungwhank

项目描述 :
Transformer Implementation for NMT using PyTorch Lightning (Korean to English)
高级语言: Python
项目地址: git://github.com/jungwhank/transformer-pl.git
创建时间: 2020-09-13T13:11:54Z
项目社区:https://github.com/jungwhank/transformer-pl

开源协议:MIT License

下载


transformer-pl

This repository is implementation of Transformer using :zap:Pytorch Lightning
to translate Korean to English

:zap: PyTorch Lightning is an open-source Python library that provides a high-level interface for PyTorch.
It is my first time using Pytorch Lightning and I feel it is very flexible and easy to organize the code :smile:

Requirements

  1. pytorch-lightning>=0.9.0
  2. sentencepiece==0.1.91
  3. torchtext==0.7.0
  4. torch>=1.5.0

Dataset

For this project, I used 1,100,000 sentences from AI HUB Korean-English AI Training Text Corpus.

DATASET SENTENCES
TRAIN 1,000,000
VALID 5,000
TEST 5,000

To use torchtext and this repo, please check the sample.tsv in ./data folder for data format.

Training

To train,

  1. python main.py --epochs 30

If you use GPU,

  1. python main.py --gpus 1 --epochs 30

Optional (Train tokenizer)

I uploaded my pretrained sentencepiece tokenizer files, but if you want to train tokenzier with your own corpus please run the code like below.

  1. import sentencepiece as spm
  2. input_file = 'kor.txt'
  3. vocab_size = 32000 # Choose your vocab size
  4. model_name = 'kor'
  5. model_type = 'bpe'
  6. character_coverage = 0.9995
  7. input_argument = '--input=%s --model_prefix=%s --vocab_size=%s --model_type=%s --character_coverage=%s --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 '
  8. cmd = input_argument%(input_file, model_name, vocab_size, model_type, character_coverage)
  9. spm.SentencePieceTrainer.Train(cmd)
  1. import sentencepiece as spm
  2. input_file = 'eng.txt'
  3. vocab_size = 32000 # Choose your vocab size
  4. model_name = 'eng'
  5. model_type = 'bpe'
  6. character_coverage = 1
  7. input_argument = '--input=%s --model_prefix=%s --vocab_size=%s --model_type=%s --character_coverage=%s --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 '
  8. cmd = input_argument%(input_file, model_name, vocab_size, model_type, character_coverage)
  9. spm.SentencePieceTrainer.Train(cmd)

Result

If you use :zap: PyTorch Lightning, you can easily see the learning process with TensorBoard or other loggers.

  1. %load_ext tensorboard
  2. %tensorboard --logdir lightning_logs/

Train Loss Curve

Valid Loss Curve

Test Bleu Score

BLEU BLEU1 BLUE2 BLEU3 BLEU4
26.28 56.7 33.3 21.2 14.0

Translate

To translate, set the checkpoint in translate.py file after you finish train and run this file.

  1. python translate.py

Examples,

  1. kor : 안녕! 내일 뭐해?
  2. eng : Hi! What are you doing tomorrow?
  1. kor : 어제 무슨 영화봤어?
  2. eng : What movie did you watch yesterday?
  1. kor : 인공지능 공부는 재밌어요!
  2. eng : Artificial intelligence studies are fun!

References