Implementation of Phrase Based Model to translate sentences from English to German and vice versa
This repository consists of project done as part of the course Natural Language Processing - Advanced, Spring 2014.The course was instructed by Dr. Dipti Misra Sharma, Dr. Ravi Jampani and Mr. Akula Arjun Reddy
A detailed report is available here
In this project, the phrase based model is implemented. A phrase based model is a simple model for machine translation that is based solely on lexical translation, the translation of phrases. This requires a dictionary that maps phrases from one language to another. We first find the alignment of the word. Next, using the bi-text corpus we train the model and calculate the translational probability. Along with the translation probabilities we use the language model to reflect fluency in English.
The source folder consists of the following methods:
Run the following command to create a random set of x sentences:
python preprocess.py sourceCorpus targetCorpus numberOfSentencesForTraining
It will generate four files:
trainingSource.txt trainingTarget.txt testingSource.txt testingTarget.txt
trainingSource.txt, trainingTarget.txt: contains the given number of sentences
testingSource.txt, testingTarget.txt: contains 5 test sentences which we use later
Next run the word alignment tool, GIZA++ to obtain the alignments.
In order to run GIZA++ do the following:
./plain2snt.out trainingSource.txt trainingTarget.txt
./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt
If the previous step gives error, then do:
./snt2cooc.out trainingSource.vcb trainingTarget.vcb trainingSource_trainingTarget.snt > cooc.cooc
./GIZA++ -s trainingSource.vcb -t trainingTarget.vcb -c trainingSource_trainingTarget.snt -CoocurrenceFile cooc.cooc
This will generate several files. The word alignments are present in A3 file. Repeat this step by swapping the trainingSource.txt and trainingTarget.txt to get the other direction alignment.Let sourceAlignment.txt and targetAlignment.txt be the two files. Then we obtain the phrases as follows:
python phraseExtraction.py sourceAlignment.txt targetAlignment.txt
The phrases are generated in the file phrases.txt. Next we calculate the translation probability.
Run the following command:
python findTranslationProbability.py phrases.txt
It will generate two files:
translationProbabilitySourceGivenTarget.txt
translationProbabilityTargetGivenSource.txt
python languageModelInput.py trainSource.txt trainS.txt
python languageModelInput.py trainTarget.txt trainT.txt
Create the zip file for this which is now input for the language model. It is run as follows:
./ngt -i=”gunzip -c trainS.gz” -n=3 -o=train.www -b=yes
./tlm -tr=train.www -n=3 -lm=wb -o=trainS.lm
./ngt -i=”gunzip -c trainT.gz” -n=3 -o=train.www -b=yes
./tlm -tr=train.www -n=3 -lm=wb -o=trainT.lm
After obtaining the translationProbability from the alignment matrix,it combines the translation probability from the language model and returns the findTranslationProbability.
Run the follwowing command for both directions:
python finalScore.py translationProbabilityTargetGivenSource.txt trainSource.lm
finalTranslationProbabilityTargetGivenSource.txt
python finalScore.py translationProbabilitySourceGivenTarget.txt trainTarget.lm finalTranslationProbabilitySourceGivenTarget.txt
It returns the file final Translation Probabilities
python finalScore.py finalTranslationProbabilityTargetGivenSource.txt testingTarget.txt
python finalScore.py finalTranslationProbabilitySourceGivenTarget.txt testingSource.txt
The method errorAnalysis.py takes as input in a very specific format. Given the source sentence, the translated sentence and the actual translation separated by newline, it returns the precision and recall for the input file in evalution.txt