项目作者: NathanBrunet
项目描述 :
Projet de 1A à l'ENSAM visant au développement d'un algorithme de prédiction des mots sous Python, via l'utilisation de la plateforme TensorFlow et du groupe de modèles Word2Vec.
高级语言: Python
项目地址: git://github.com/NathanBrunet/Algorithme-de-prediction-des-mots---Projet-IA.git
Word Prediction Model using Neural Networks and Word Embeddings
This project implements a simple neural network model for word prediction based on the context of a given word in a text corpus. Inspired by the principles of the Word2Vec algorithm, the model learns word embeddings that capture the semantic relationships between words based on their neighboring words in a given text.
Project Overview
The goal of this project is to build a word prediction model that can predict neighboring words based on a word input. The model learns the contextual relationships of words from an input text by encoding words into one-hot vectors and training a neural network to predict words in proximity. The neural network is trained to create 2-dimensional word embeddings which can later be used to measure distances and similarities between words.
Key Features
- One-hot encoding of input and neighboring words.
- Neural network architecture with an embedding layer and softmax output.
- Training using gradient descent to learn word embeddings.
- Prediction of context words based on distance between embeddings.
- Customizable window size around a target word to select context words.
How It Works
Data Preparation:
- The input text is tokenized into individual words, and unique words are mapped to integers.
- Pairs of words within a fixed context window are generated for training (word, context-word).
Model:
- A simple neural network with one hidden layer is used to learn 2-dimensional word embeddings.
- The network is trained using gradient descent to minimize the error between the predicted and actual neighboring words.
Prediction:
- After training, the model can predict the most probable neighbors of a word by calculating the Euclidean distance between embeddings.
Future Improvements
- Use larger embedding dimensions for more accurate word predictions.
- Implement additional optimization techniques like Negative Sampling or Hierarchical Softmax.
- Train the model on a larger and more diverse corpus of text.
- Integrate pre-trained embeddings like Word2Vec, GloVe, or FastText.