项目作者: jalajthanaki

项目描述 :
Understanding of POS tags and build a POS tagger from scratch
高级语言: Jupyter Notebook
项目地址: git://github.com/jalajthanaki/POS-tag-workshop.git
创建时间: 2018-06-05T11:52:15Z
项目社区:https://github.com/jalajthanaki/POS-tag-workshop

开源协议:

下载


All about Part of Speech (POS) tags

Understanding of POS tags and build a POS tagger from scratch

This repository is basically provides you basic understanding on POS tags. I have tried to build the custom POS tagger using Treebank dataset.


Workshop Outline

There are main three sections here.

  1. Section 1. Introduction to Part of Speech tags
  2. 1.1 What is Parts of Speech?
  3. 1.2 What is Parts of Speech tagging?
  4. 1.3 What is Part of Speech tagger?
  5. 1.4 What are the various types of the Part of Speech tags?
  6. 1.5 Which applications are using POS tagging?

  1. Section 2. Generate Part of Speech tags using various python libraries
  2. 2.1 Generating POS tags using Polyglot library
  3. 2.2 Generating POS tags using Stanford CoreNLP
  4. 2.3 Generating POS tags using Spacy library
  5. 2.4 Why do we need to develop our own POS tagger?

  1. Section 3. Build our own statistical POS tagger form scratch
  2. 3.1 Import dependencies
  3. 3.2 Explore dataset
  4. 3.2.1 Explore Brown Corpus
  5. 3.2.2 Explore Penn-Treebank Corpus
  6. 3.3 Generate features
  7. 3.4 Transform Dataset
  8. 3.5 Build training and testing dataset
  9. 3.6 Train model
  10. 3.7 Measure Accuracy
  11. 3.8 Generate POS tags for given sentence

Dependencies

  • Python 3.3+

  • Polyglot

  • Spacy

  • Py-CoreNLP (uses Stanford CoreNLP)

  • NLTK

  • Scikit-learn

  • jupyter notebook

Installation Instructions

General instructions

For section 1:

  • No dependency required for this section.

For section 2:

There are three dependencies are required.

2.1. Polyglot

2.2. Stanford CoreNLP and Py-CoreNLP

2.3. Spacy POS tagger


Windows OS

2.1. Polyglot

For installation refer this link

  1. Step 1: $ git clone https://github.com/aboSamoor/polyglot.git
  2. Step 2: $ python setup.py install
  3. Step 3: Downloaded and pip install
  4. $ pip install pycld2-0.31-cp36-cp36m-win_amd64.whl
  5. $ pip install PyICU-1.9.8-cp36-cp36m-win_amd64.whl

2.2. Stanford POS tagger

Step 1: Install JDK 1.8 using this link

Step 2: Download Stanford CoreNLP from this link

  1. Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

  1. Step 1: cd ~/Path where you extract the Stanford CoreNLP
  2. Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
  3. $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

  1. $ sudo pip iinstall spacy
  2. $ sudo python3 -m spacy download en

Linux OS

2.1. Polyglot

  1. Step 1: sudo apt-get update
  2. Step 2: sudo apt-get install python-pyicu
  3. Step 3: sudo pip install pycld2
  4. Step 4: sudo pip install Morfessor
  5. Step 5: sudo apt-get install python-numpy libicu-dev
  6. Step 6: sudo pip install PyICU
  7. Step 7: sudo pip install polyglot

2.2. Stanford POS tagger

Step 1: Install JDK 1.8

  1. Step 1.1: $ sudo mkdir /usr/lib/jvm
  2. Step 1.2: $ sudo tar xzvf jdk1.8.0_172.tar.gz -C /usr/lib/jvm
  3. Step 1.3: Set environment variable for java in .bashrc file
  4. $ sudo vi ~/.bashrc or sudo gedit ~/.bashrc
  5. Step 1.4: Set path at the end of the bashrc file
  6. JAVA_HOME=/usr/lib/jvm/jdk1.7.0_51
  7. PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
  8. JRE_HOME=$HOME/bin:$JRE_HOME/bin
  9. export JAVA_HOME
  10. export JRE_HOME
  11. export PATH

Step 2: Download Stanford CoreNLP from this link

  1. Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

  1. Step 1: cd ~/Path where you extract the Stanford CoreNLP
  2. Step 2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
  3. $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

  1. $ sudo pip install pycorenlp

step 5: Docker image (Instead of Step 1 to 4 above)

  1. docker run -p 9000:9000 --rm -it motiz88/corenlp

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

  1. $ sudo pip iinstall spacy
  2. $ sudo python3 -m spacy download en

Mac-OS

2.1. Polyglot

  1. Step 1: sudo apt-get update
  2. Step 2: sudo apt-get install python-pyicu
  3. Step 3: sudo pip install pycld2
  4. Step 4: sudo pip install Morfessor
  5. Step 5: sudo apt-get install python-numpy libicu-dev
  6. Step 6: sudo pip install PyICU
  7. Step 7: sudo pip install polyglot

2.2. Stanford POS tagger

Setup Standford CoreNLP

Step 1: Install JDK 1.8 using this steps

Step 2: Download Stanford CoreNLP from this link

  1. Step 2.1: Download and extract the Stanford CoreNLP

Step 3: Start service of Stanford CoreNLP

  1. Step 3.1: cd ~/Path where you extract the Stanford CoreNLP
  2. Step 3.2: Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
  3. $ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Step 4: Setup Py-coreNLP

  1. Step 4.1: $ sudo pip install pycorenlp

2.3. Spacy POS tagger

Step 1: Documentation for Spacy is here

Step 2: Run the installation commands

  1. $ sudo pip iinstall spacy
  2. $ sudo python3 -m spacy download en

For section 3:

There are two dependencies are required.


3.1. NLTK

3.2. Scikit-learn


Windows OS

3.1 NLTK

  1. Step 1: $ sudo pip install numpy scipy nltk
  2. Step 2: Download NLTK data
  3. $ python2 or python3
  4. Step 3: Inside python shell
  5. >>> import nltk
  6. >>> nltk.download()

3.2 Scikit-learn

  1. $ sudo pip install scikit-learn

For Linux OS

3.1 NLTK

  1. Step 1: $ sudo pip install numpy scipy nltk
  2. Step 2: Download NLTK data
  3. $ python2 or python3
  4. Step 3: Inside python shell
  5. >>> import nltk
  6. >>> nltk.download()

3.2 Scikit-learn

  1. $ sudo pip install scikit-learn

On Mac OS

3.1 NLTK

  1. Step 1: $ sudo pip install numpy scipy nltk
  2. Step 2: Download NLTK data
  3. $ python2 or python3
  4. Step 3: Inside python shell
  5. >>> import nltk
  6. >>> nltk.download()

3.2 Scikit-learn

  1. $ sudo pip install scikit-learn

Install jupyter notebook

  • For installation you can refer this link

  • In anaconda jupyter notebook is built-in given.

  • You can install jupyter notebook by using following command

    $ sudo pip install jupyter notebook

  • In order to start jupyter notebook execute the given command on cmd/terminal

    $ jupyter notebook

Usage

  • For Session 1: Use Introduction_to_POS ipython notebook

  • For Session 2: Use POS_tagger_Demo ipython notebook

  • For Session 3: Use POS_from_scratch ipython notebook

Share this Git-Pitch Presentation

See the Git-Pitch presentation using this link

Special Thanks

Thanks DataGiri/GreyAtom for hosting this event.