项目作者: lt3

项目描述 :
Data augmentation pipeline that can add fuzzy target translations to an input sentence to improve the performance of an MT system. Fuzzy matches can be found with edit distance, set similarity and semantic matching.
高级语言: Python
项目地址: git://github.com/lt3/nfr.git
创建时间: 2020-12-30T10:28:34Z
项目社区:https://github.com/lt3/nfr

开源协议:Apache License 2.0

下载


Neural fuzzy repair

Installation

For basic usage you can simply install the library via clone from git and pip install.

  1. git clone https://github.com/lt3/nfr.git
  2. cd nfr
  3. pip install .

By default, semantic matching capabilities with sent2vec and Sentence Transformers are not enabled because the
dependencies are considerably large. If you want to enable semantic matching, you need to install FAISS and one of
Sentence Transformers or Sent2Vec.

  • FAISS (pip install faiss-cpu or pip install faiss-gpu)
  • Sentence Transformers (pip install sentence-transformers)
    • Sentence Transformers relies on PyTorch. Depending on your OS, it might be that a CPU-version of torch will be
      installed by default. If you want better performance, and you have a CUDA-enabled device avaialble, it is
      recommended to install a CUDA-enabled version of torch before
      installing sentence-transformers.
  • Sent2Vec (clone and install from GitHub; do not use pip as that is a
    different version)

Usage

After installation, four commands are exposed. In all cases, you can type <command> -h for these usage instructions.

  1. nfr-create-faiss-index: Creates a FAISS index for semantic matches with Sent2Vec or Sentence Transformers.
    This is a necessary step if you want to extract semantic fuzzy matches later on.
  1. usage: nfr-create-faiss-index [-h] -c CORPUS_F -p MODEL_NAME_OR_PATH -o
  2. OUTPUT_F [-m {sent2vec,stransformers}]
  3. [-b BATCH_SIZE] [--use_cuda]
  4. Create a FAISS index based on the semantic representation of an existing text
  5. corpus. To do so, the text will be embedded by means of a sent2vec model or a
  6. sentence-transformers model. The index is (basically) an efficient list that
  7. contains all the representations of the training corpus sentences (the TM). as
  8. such, this index can later be used to find those entries that are most similar
  9. to a given representation of a sentence. The index is saved to a binary file
  10. so that it can be reused later on to calculate cosine similarity scores and to
  11. retrieve the most resembling entries.
  12. optional arguments:
  13. -h, --help show this help message and exit
  14. -c CORPUS_F, --corpus_f CORPUS_F
  15. Path to the corpus to turn into vectors and add to the
  16. index. This is typically your TM or training file for
  17. an MT system containing text, one sentence per line
  18. -p MODEL_NAME_OR_PATH, --model_name_or_path MODEL_NAME_OR_PATH
  19. Path to sent2vec model (when `method` is sent2vec) or
  20. sentence-transformers model name when method is
  21. stransformers (see
  22. https://www.sbert.net/docs/pretrained_models.html)
  23. -o OUTPUT_F, --output_f OUTPUT_F
  24. Path to the output file to write the FAISS index to
  25. -m {sent2vec,stransformers}, --mode {sent2vec,stransformers}
  26. Whether to use 'sent2vec' or 'stransformers'
  27. (sentence-transformers)
  28. -b BATCH_SIZE, --batch_size BATCH_SIZE
  29. Batch size to use to create sent2vec embeddings or
  30. sentence-transformers embeddings. A larger value will
  31. result in faster creation, but may lead to an out-of-
  32. memory error. If you get such an error, lower the
  33. value.
  34. --use_cuda Whether to use GPU when using sentence-transformers.
  35. Requires PyTorch installation with CUDA support and a
  36. CUDA-enabled device
  1. nfr-extract-fuzzy-matches: Here, fuzzy matches can be extracted from the training set. A variety of options are
    available, including semantic fuzzy matching, setsimilarity and edit distance.
  1. usage: nfr-extract-fuzzy-matches [-h] --tmsrc TMSRC --tmtgt TMTGT --insrc
  2. INSRC --method
  3. {editdist,setsim,setsimeditdist,sent2vec,stransformers}
  4. --minscore MINSCORE --maxmatch MAXMATCH
  5. [--model_name_or_path MODEL_NAME_OR_PATH]
  6. [--faiss FAISS] [--threads THREADS]
  7. [--n_setsim_candidates N_SETSIM_CANDIDATES]
  8. [--setsim_function SETSIM_FUNCTION]
  9. [--use_cuda] [-q QUERY_MULTIPLIER]
  10. [-v {info,debug}]
  11. Given source and target TM files, extract fuzzy matches for a new input file
  12. by using a variety of methods. You can use formal matching methods such as
  13. edit distance and set similarity, as well as semantic fuzzy matching with
  14. sent2vec and Sentence Transformers.
  15. optional arguments:
  16. -h, --help show this help message and exit
  17. --tmsrc TMSRC Source text of the TM from which fuzzy matches will be
  18. extracted
  19. --tmtgt TMTGT Target text of the TM from which fuzzy matches will be
  20. extracted
  21. --insrc INSRC Input source file to extract matches for (insrc is
  22. queried against tmsrc)
  23. --method {editdist,setsim,setsimeditdist,sent2vec,stransformers}
  24. Method to find fuzzy matches
  25. --minscore MINSCORE Min fuzzy match score. Only matches with a similarity
  26. score of at least 'minscore' will be included
  27. --maxmatch MAXMATCH Max number of fuzzy matches kept per source segment
  28. --model_name_or_path MODEL_NAME_OR_PATH
  29. Path to sent2vec model (when `method` is sent2vec) or
  30. sentence-transformers model name when method is
  31. stransformers (see
  32. https://www.sbert.net/docs/pretrained_models.html)
  33. --faiss FAISS Path to faiss index. Must be provided when `method` is
  34. sent2vec or stransformers
  35. --threads THREADS Number of threads. Must be 0 or 1 when using
  36. `use_cuda`
  37. --n_setsim_candidates N_SETSIM_CANDIDATES
  38. Number of fuzzy match candidates extracted by setsim
  39. --setsim_function SETSIM_FUNCTION
  40. Similarity function used by setsimsearch
  41. --use_cuda Whether to use GPU for FAISS indexing and sentence-
  42. transformers. For this to work properly `threads`
  43. should be 0 or 1.
  44. -q QUERY_MULTIPLIER, --query_multiplier QUERY_MULTIPLIER
  45. (applies only to FAISS) Initially look for
  46. `query_multiplier * maxmatch` matches to ensure that
  47. we find enough hits after filtering. If still not
  48. enough matches, search the whole index
  49. -v {info,debug}, --logging_level {info,debug}
  50. Set the information level of the logger. 'info' shows
  51. trivial information about the process. 'debug' also
  52. notifies you when less matches are found than
  53. requested during semantic matching
  1. nfr-add-training-features: Adds features to the input. These involve the side of a token (source token or fuzzy
    target) or whether or not a token was matched.
  1. usage: nfr-add-training-features [-h] [-o OUT] [-l] [-v] fin falign
  2. Given a file containing source, fuzzy source and fuzzy target columns, finds
  3. the tokens in fuzzy_src that match with src according to the edit distance
  4. metric. Then the indices of those matches are used together with the word
  5. alignments (GIZA) between fuzzy_src and fuzzy_tgt to mark fuzzy target tokens
  6. with m (match) or nm (no match). This feature indicates whether or not the
  7. fuzzy_src token that is aligned with said fuzzy target token has a match in
  8. the original source sentence. The feature is also added to source tokens when
  9. a match was found according to the methodology described above. In addition, a
  10. "side" feature is added. This indicates which side the token is from, S
  11. (source) or T (target). So, in sum, every source and fuzzy target token will
  12. have two features: match/no-match and its side. These features can be filtered
  13. in the next processing step, nfr-augment-data.
  14. positional arguments:
  15. fin Input file
  16. falign Alignment file
  17. optional arguments:
  18. -h, --help show this help message and exit
  19. -o OUT, --out OUT Output file. If not given, will use the input file with
  20. '.trainfeats' before the suffix
  21. -l, --lazy Whether to use lazy processing. Useful for very large
  22. files
  23. -v, --verbose Whether to print intermediate results to stdout
  1. nfr-augment-data: Prepares the dataset to be used in an MT system. Allows you to combine fuzzy matches and choose
    features to use.
  1. usage: nfr-augment-data [-h] --src SRC --tgt TGT --fm FM --outdir OUTDIR
  2. --minscore MINSCORE --n_matches N_MATCHES --combine
  3. {nbest,max_coverage} [--is_trainset] [--out_ranges]
  4. [-sf {side,matched} [{side,matched} ...]]
  5. [-ftf {side,matched} [{side,matched} ...]]
  6. Prepares your data for training an MT system. The script creates combinations
  7. of source and (possibly multiple) fuzzy target sentences, based on the
  8. initially created matches (cf. extraxt-fuzzy-matches). The current script can
  9. also filter features that need to be retained in the final files.
  10. Corresponding translations are also saved as well as those entries for which
  11. no matches were found.
  12. optional arguments:
  13. -h, --help show this help message and exit
  14. --src SRC Input source file
  15. --tgt TGT Input target file
  16. --fm FM File containing fuzzy matches for the input source
  17. --outdir OUTDIR Output directory
  18. --minscore MINSCORE Min. fuzzy match score threshold
  19. --n_matches N_MATCHES
  20. Number of fuzzy target to be used in augmented source
  21. --combine {nbest,max_coverage}
  22. Method of combining fuzzy matches
  23. --is_trainset Whether the input file the training set for the MT
  24. system
  25. --out_ranges Whether to save augmented data for different fuzzy
  26. match range categories (considering the best fuzzy
  27. match score)
  28. -sf {side,matched} [{side,matched} ...], --src_feats {side,matched} [{side,matched} ...]
  29. Features to retain in the source tokens
  30. -ftf {side,matched} [{side,matched} ...], --fuzzy_tgt_feats {side,matched} [{side,matched} ...]
  31. Features to retain in the fuzzy target tokens

Best Configuration: Step-by-step guide

The best configuration (for majoirty of the language pairs tested in Tezcan, Bulté & Vanroy; 2021) consists of the following parameters/properties:

  • Preprocessing: Tokenization, Truecasing and sub-word segmentation using Byte-Pair Encoding (BPE) with a merged vocabulary of 32K of the source and target languages (these preprocessing steps are performed before Step 1 below). We used Moses Toolkit for tokenization and truecasing and OpenNMT for BPE, which relies on the original BPE implementation.
  • Fuzzy matching using cosine similarity between segments using sent2vec (with a min. fuzzy match score of 0.5 and max. 40 fuzzy matches per source segment)
  • Data augmentation using “maximum coverage” using 2 fuzzy matches per segment with features (“source” feature on the source tokens, “match/no-match” feature on fuzzy-match target tokens)

Note: You can follow along with this step-by-step guide by making use of a dummy data set that we provide on the
release page. This dataset and the supplementary models and indices
are purposefully kept small in size. Therefore, their performance is not great. They merely are intended to show you how
to use our code.

Step 1. Extract Fuzzy Matches (preprocessed data)

Fuzzy matches need to be extracted for the training, test and development sets separately.

Fuzzy matching using sent2vec requires a sent2vec model built on the source side (source language) of the (preprocessed) training set and a FAISS index for this model. Please see sent2vec documentation on how to build a sent2vec model and our paper for the parameters we used in our experiments.

To generate a FAISS index (for the source side of the training data using the sent2vec model ‘sent2vec.model.bin’)

  1. nfr-create-faiss-index -c ./0_preprocessing/bpe_merged/train.tok.truec.bpe.en --model_name_or_path ./sent2vec/sent2vec.train.tok.truec.10dim.bpe.bin -o ./sent2vec/sent2vec.faiss.en

To extract fuzzy matches (for the training set):

  1. nfr-extract-fuzzy-matches --tmsrc ./0_preprocessing/bpe_merged/train.tok.truec.bpe.en --tmtgt ./0_preprocessing/bpe_merged/train.tok.truec.bpe.nl --insrc ./0_preprocessing/bpe_merged/train.tok.truec.bpe.en --method sent2vec --faiss ./sent2vec/sent2vec.faiss.10dim.en --model_name_or_path ./sent2vec/sent2vec.train.tok.truec.10dim.bpe.bin --maxmatch 40 --minscore 0.5 --threads 1

*Note 1: This command generates the ‘fuzzy match file’ (train.tok.truec.bpe.en.matches.mins0.5.maxm40.sent2vec.txt) in the same folder as the original file (--insrc).

**
Note 2: Modify --insrc parameter to extract fuzzy matches for the development or test sets separately.
Note 3:* To run the process on GPU, remove --threads parameter and use --use_cuda instead.

Step 2. Add (NMT) Training Features

Features are added to fuzzy match files for the training, test and development sets separately.

This step requires a word alignment file in Pharaoh format (only for the training set!), where the the source and target token indices are separated by a dash. For instance: 0-0 1-1 2-2 2-3 3-4 4-5

We used GIZA++ to obtain alignments for the training set (tokenized, truecased, byte-pair encoded). Please see the paper for the parameters we used for GIZA++.

To add features to the fuzzy match file of the training set:

  1. nfr-add-training-features ./1_fuzzy_matches/train.tok.truec.bpe.en.matches.mins0.5.maxm40.sent2vec.txt ./word_alignments/train.bpe.alignments -l

*Note 1: This command generates a new ‘fuzzy match file’ with features (train.tok.truec.bpe.en.matches.mins0.5.maxm40.sent2vec.trainfeats.txt) in the same folder as the input file (--insrc).

**
Note 2: Change the first positional argument to add features to the fuzzy match files for the development or test set separately.
Note 3: Only the word alignment file obtained on the training set is used (also for adding features to test or dev sets)!
Note 4:* In this step, both “side (source/target)” and “match/nomatch” features are added to the fuzzy matche files. While in the final augmented data we do not use the “match/nomatch” features on the source tokens, this feature is still necessary to apply “max. coverage” during the data augmentation step (Step 3).

Step 3. Augment Data

To augment training set:

  1. nfr-augment-data --src ./0_preprocessing/bpe_merged/train.tok.truec.bpe.en --tgt ./0_preprocessing/bpe_merged/train.tok.truec.bpe.nl --fm ./2_fuzzy_matches_w_features/train.tok.truec.bpe.en.matches.mins0.5.maxm40.sent2vec.trainfeats.txt --minscore 0.5 --n_matches 2 -sf side -ftf matched --outdir ./3_augment_data/train/ --combine max_coverage --is_trainset

*Note 1: When augmenting the training set we need to use the --is_trainset parameter as the training set is augmented in a different way compared to the test and dev sets. Please see the paper for details on data augmentation.

**
Note 2:* This command creates the augmented training set (source/target) in the output directory, which can be used to train the NMT model.

To augment test (or dev) set:

  1. nfr-augment-data --src ./0_preprocessing/bpe_merged/test.tok.truec.bpe.en --tgt ./0_preprocessing/bpe_merged/test.tok.truec.bpe.nl --fm ./2_fuzzy_matches_w_features/test.tok.truec.bpe.en.matches.mins0.5.maxm40.sent2vec.trainfeats.txt --minscore 0.5 --n_matches 2 -sf side -ftf matched --outdir ./3_augment_data/test/ --combine max_coverage

*Note 1: Modify --src, --tgt, --fm and the output directory (--outdir) to augment the dev set.

**
Note 2: This command creates the augmented test/dev set (source/target) in the output directory, which can be used to translate using the NMT model trained on augmented training set.
Note 3:* If you want to generate documents containing sentences per fuzzy match range, add the parameter --out_ranges.

Step 4. Train NMT model

In our experiments we used OpenNMT-py (version 1.x) to train NMT models but the augmented data sets can be used to train models with any toolkit provided that it supports source-side features (OpenNMT-py version 2 does not support source-side features at the time of writing this guide).
Some parameters to pay attention to during the “preprocessing” step in OpenNMT:

  • onmt_preprocess: src-seq_length is increased based on the no. of fuzzy matches used for data augmentation (x2 when a single fuzzy match is used; x3 when 2 fuzzy matches are used etc.)
  • onmt_train: word_vec_size + feat_vec_size = rnn_size. For ex. word_vec_size = 506, feat_vec_size = 6 and rnn_size = 512. If The sum of word_vec_size and feat_vec_size is not equal to rnn_size it gives an error (source).

Please see the paper for a list of all the paramater values we used in our experiments.

Citation

Please cite our paper(s) when you use this library.

If you perform automatic or manual evaluations or analyse how similar translations influence the MT output:

Tezcan, A., Bulté, B. (2022). Evaluating the Impact of Integrating Similar Translations into Neural Machine Translation. Informatics, 13(1). https://www.mdpi.com/2078-2489/13/1/19

  1. @Article{info13010019,
  2. AUTHOR = {Tezcan, Arda and Bulté, Bram},
  3. TITLE = {Evaluating the Impact of Integrating Similar Translations into Neural Machine Translation},
  4. JOURNAL = {Information},
  5. VOLUME = {13},
  6. YEAR = {2022},
  7. NUMBER = {1},
  8. ARTICLE-NUMBER = {19},
  9. URL = {https://www.mdpi.com/2078-2489/13/1/19},
  10. ISSN = {2078-2489},
  11. ABSTRACT = {Previous research has shown that simple methods of augmenting machine translation training data and input sentences with translations of similar sentences (or fuzzy matches), retrieved from a translation memory or bilingual corpus, lead to considerable improvements in translation quality, as assessed by a limited set of automatic evaluation metrics. In this study, we extend this evaluation by calculating a wider range of automated quality metrics that tap into different aspects of translation quality and by performing manual MT error analysis. Moreover, we investigate in more detail how fuzzy matches influence translations and where potential quality improvements could still be made by carrying out a series of quantitative analyses that focus on different characteristics of the retrieved fuzzy matches. The automated evaluation shows that the quality of NFR translations is higher than the NMT baseline in terms of all metrics. However, the manual error analysis did not reveal a difference between the two systems in terms of total number of translation errors; yet, different profiles emerged when considering the types of errors made. Finally, in our analysis of how fuzzy matches influence NFR translations, we identified a number of features that could be used to improve the selection of fuzzy matches for NFR data augmentation.},
  12. DOI = {10.3390/info13010019}
  13. }

If you use semantic fuzzy matching (sent2vec, sentence-transformers), sub-word segmentation, max. coverage for combining fuzzy matches, source-side features for training NMT models:

Tezcan, A., Bulté, B., & Vanroy, B. (2021). Towards a better integration of fuzzy matches in neural machine
translation through data augmentation. Informatics, 8(1). https://doi.org/10.3390/informatics8010007

  1. @article{tezcan2021integration,
  2. AUTHOR = {Tezcan, Arda and Bulté, Bram and Vanroy, Bram},
  3. TITLE = {Towards a Better Integration of Fuzzy Matches in Neural Machine Translation through Data Augmentation},
  4. JOURNAL = {Informatics},
  5. VOLUME = {8},
  6. YEAR = {2021},
  7. NUMBER = {1},
  8. ARTICLE-NUMBER = {7},
  9. URL = {https://www.mdpi.com/2227-9709/8/1/7},
  10. ISSN = {2227-9709},
  11. ABSTRACT = {We identify a number of aspects that can boost the performance of Neural Fuzzy Repair (NFR), an easy-to-implement method to integrate translation memory matches and neural machine translation (NMT). We explore various ways of maximising the added value of retrieved matches within the NFR paradigm for eight language combinations, using Transformer NMT systems. In particular, we test the impact of different fuzzy matching techniques, sub-word-level segmentation methods and alignment-based features on overall translation quality. Furthermore, we propose a fuzzy match combination technique that aims to maximise the coverage of source words. This is supplemented with an analysis of how translation quality is affected by input sentence length and fuzzy match score. The results show that applying a combination of the tested modifications leads to a significant increase in estimated translation quality over all baselines for all language combinations.},
  12. DOI = {10.3390/informatics8010007}
  13. }

If you use lexical fuzzy matching (editdist, setsim, setsimeditdist):

Bulte, B., & Tezcan, A. (2019). Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine
Translation. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics
(pp. 1800–1809). Association for Computational Linguistics. https://www.aclweb.org/anthology/P19-1175

  1. @inproceedings{bulte2019neural,
  2. AUTHOR = {Bulte, Bram and Tezcan, Arda},
  3. TITLE = {Neural Fuzzy Repair: Integrating Fuzzy Matches into Neural Machine Translation},
  4. BOOKTITLE = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  5. MONTH = jul,
  6. YEAR = {2019},
  7. ADDRESS = {Florence, Italy},
  8. PUBLISHER = {Association for Computational Linguistics},
  9. URL = {https://www.aclweb.org/anthology/P19-1175},
  10. PAGES = {1800--1809},
  11. ABSTRACT = {We present a simple yet powerful data augmentation method for boosting Neural Machine Translation (NMT) performance by leveraging information retrieved from a Translation Memory (TM). We propose and test two methods for augmenting NMT training data with fuzzy TM matches. Tests on the DGT-TM data set for two language pairs show consistent and substantial improvements over a range of baseline systems. The results suggest that this method is promising for any translation environment in which a sizeable TM is available and a certain amount of repetition across translations is to be expected, especially considering its ease of implementation.},
  12. DOI = {10.18653/v1/P19-1175},
  13. }

Development

After larger refactors and before new releases, always run the following commands in the root of the current directory
to ensure a consistent code style. We use additional plugins to help with that. They can automatically be installed
by using the dev option.

  1. pip install .[dev]

For a consistent coding style, the following command will reformat the files in nfr/.

  1. make style

We use black-style conventions alongside isort.

For quality checking we use flake8. Run the following command and make sure to fix all warnings. Only
publish a new release when no more warnings or errors are present.

  1. make quality