code-mixing

This repository contains all the codes pertaining to “Code mixing patterns in Celebrities” / “Quantifying Sense Deviation in Twitter”. (Project 16) by students of group 6 as a part of Speech & Natural Language Processing course (CS60057), Autumn 2017.

Students involved

Task distribution

Our work was largely classified into 3 tasks:

Task 1: Tagging and Formatting

For every tweet in our dataset, we classify each word and assign a word-level tag and phrase(matrix)-level tag.
Apart from this, each tweet is tagged as En / Hi / Code-switched / Code-mix-En / Code-mix-Hi / Code-mix-Equal / Other.
This will be used later. Please refer to Task_1_Formatting for further details.

Task 2: Dataset Analysis

Using the paper All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media, we shall be computing UUR values, UTR and UPR values.
To check the efficiency of our tagging, we shall use Jaccard coefficient and Spearmann coefficients as evalution measures.
Refer to Task_2_DatasetAnalysis for further details.

Task 3: Sense Deviation

Referring to the techniques used in Hamilton’s (2016): Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change and Analyzing Semantic Changes in Japanese Loanwords to understand what senses an English word when used in Hindi context is used in context of social media.
Further details are mentioned in Task_3_SenseDeviation.

The results for each part are mentioned in the corresponding task. For a more comprehensive study of results,
please refer to Group6_report.pdf.

Report and Slides

Please refer to Group6__sense_deviation_slides.pdf for the presentation slides
and Group6_report.pdf for the report.

NOTE: Due to space restrictions, we could not upload everything on GitHub. All the code and data can
also be found on our mentor Jasabanta Patro’s server. [in CelebrityCodeMixingTermProject directory]