项目作者: Aashish-1008

项目描述 :
This repo introduces word embeddings. It contains complete code to train word embeddings from scratch on a small dataset.
高级语言: Jupyter Notebook
项目地址: git://github.com/Aashish-1008/tf-word-embedding.git
创建时间: 2019-04-16T07:44:51Z
项目社区:https://github.com/Aashish-1008/tf-word-embedding

开源协议:MIT License

下载


tf-word-embedding

This repo introduces word embeddings. It contains complete code to train word embeddings from scratch on a small dataset.

Representing text as numbers

Machine learning models take vectors (arrays of numbers) as input.
When working with text, the first thing we must do come up with a strategy to
convert strings to numbers (or to “vectorize” the text) before feeding it
to the model. In this section, we will look at three strategies for doing so.

1. One-hot encodings

This approach is inefficient. A one-hot encoded vector is sparse (meaning, most indicices are zero).
Imagine we have 10,000 words in the vocabulary.
To one-hot encode each word, we would create a vector where 99.99% of the
elements are zero.

2. Encode each word with a unique number

3. Word embeddings

  1. # on ubuntu
  2. sudo apt install python3
  3. # on mac
  4. brew install python3
  5. # install ludwig
  6. pip install ludwig
  7. python -m spacy download en

In this repo, I will try to build a simple text-classification using ludwig

Using local python

You can run the code locally

```
JOB_DIR=jobDir
TRAIN_FILE=./data/train/
EVAL_FILE=./data/eval/

TRAIN_STEPS=2000

cd tf-ludwig-google-cloud-ml-engine/

python3.6 -m trainer.task —train-files $TRAIN_FILE \
—eval-files $EVAL_FILE \
—job-dir $JOB_DIR \
—train-steps $TRAIN_STEPS.
‘’’