项目作者: Jash-2000

项目描述 :
This repository contains additional features, extended to the traditional Word2Vec library, launched in 2013
高级语言: Jupyter Notebook
项目地址: git://github.com/Jash-2000/Optimized-Word2Vec.git
创建时间: 2020-10-30T03:55:22Z
项目社区:https://github.com/Jash-2000/Optimized-Word2Vec

开源协议:Creative Commons Zero v1.0 Universal

下载


Customized-Word2Vec

This repository contains additional features, extended to the traditional Word2Vec library, launched in 2013.
This project was part of a larger project that involved sentimental analysis and sarcasm detection. The project details have been added in the repo-description section.

Directly clone the repository and start using it. The details of how the files are named and strored is given below:-

-> The model made using 3 options - Skipgram model, Common Bag of Words, Negative Sampling.

-> You can change the following model parameters:

  • Window Size
  • Vector Size
  • Learning Rate
  • Sub-sampling
  • number of epochs you want to train for.

-> Optimizer used is SGD.

-> Other that the main code ipynb, entire training corpus and testing corpus (numpy array) of reuters dataset (can be downloaded directly from here), I have included an example directory where I have applied my code base.

For more information, please refer to the project report uploaded.


The Example folder contains :

-> The folder contains numpy array of training and testing datasets

-> The folder also contains a pickle file containing the dictionary:
key - Word (string format)
values - onehot encoding (numpy array)

-> Naming Convention of the weight1 files are :

../Window{window size}/{choice of model}/{learning rate}{vector size}weight_numpy.npy

These, weight vectors must be dot producted with the onehot vectors to get the actual embeddings in
the format given : W’X { W - Weight ; X - Onehot Vector}