项目作者: yourh

项目描述 :
Implementation for "AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification"
高级语言: Python
项目地址: git://github.com/yourh/AttentionXML.git
创建时间: 2019-10-25T07:41:18Z
项目社区:https://github.com/yourh/AttentionXML

开源协议:

下载


AttentionXML

AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification

Requirements

  • python==3.7.4
  • click==7.0
  • ruamel.yaml==0.16.5
  • numpy==1.16.2
  • scipy==1.3.1
  • scikit-learn==0.21.2
  • gensim==3.4.0
  • torch==1.0.1
  • nltk==3.4
  • tqdm==4.31.1
  • joblib==0.13.2
  • logzero==1.5.0

Datasets

Download the GloVe embedding (840B,300d) and convert it to gensim format (which can be loaded by gensim.models.KeyedVectors.load).

We also provide a converted GloVe embedding at here.

XML Experiments

XML experiments in paper can be run directly such as:

  1. ./scripts/run_eurlex.sh

Preprocess

Run preprocess.py for train and test datasets with tokenized texts as follows:

  1. python preprocess.py \
  2. --text-path data/EUR-Lex/train_texts.txt \
  3. --label-path data/EUR-Lex/train_labels.txt \
  4. --vocab-path data/EUR-Lex/vocab.npy \
  5. --emb-path data/EUR-Lex/emb_init.npy \
  6. --w2v-model data/glove.840B.300d.gensim
  7. python preprocess.py \
  8. --text-path data/EUR-Lex/test_texts.txt \
  9. --label-path data/EUR-Lex/test_labels.txt \
  10. --vocab-path data/EUR-Lex/vocab.npy

Or run preprocss.py including tokenizing the raw texts by NLTK as follows:

  1. python preprocess.py \
  2. --text-path data/Wiki10-31K/train_raw_texts.txt \
  3. --tokenized-path data/Wiki10-31K/train_texts.txt \
  4. --label-path data/Wiki10-31K/train_labels.txt \
  5. --vocab-path data/Wiki10-31K/vocab.npy \
  6. --emb-path data/Wiki10-31K/emb_init.npy \
  7. --w2v-model data/glove.840B.300d.gensim
  8. python preprocess.py \
  9. --text-path data/Wiki10-31K/test_raw_texts.txt \
  10. --tokenized-path data/Wiki10-31K/test_texts.txt \
  11. --label-path data/Wiki10-31K/test_labels.txt \
  12. --vocab-path data/Wiki10-31K/vocab.npy

Train and Predict

Train and predict as follows:

  1. python main.py --data-cnf configure/datasets/EUR-Lex.yaml --model-cnf configure/models/AttentionXML-EUR-Lex.yaml

Or do prediction only with option “—mode eval”.

Ensemble

Train and predict with an ensemble:

  1. python main.py --data-cnf configure/datasets/Wiki-500K.yaml --model-cnf configure/models/FastAttentionXML-Wiki-500K.yaml -t 0
  2. python main.py --data-cnf configure/datasets/Wiki-500K.yaml --model-cnf configure/models/FastAttentionXML-Wiki-500K.yaml -t 1
  3. python main.py --data-cnf configure/datasets/Wiki-500K.yaml --model-cnf configure/models/FastAttentionXML-Wiki-500K.yaml -t 2
  4. python ensemble.py -p results/FastAttentionXML-Wiki-500K -t 3

Evaluation

  1. python evaluation.py --results results/AttentionXML-EUR-Lex-labels.npy --targets data/EUR-Lex/test_labels.npy

Or get propensity scored metrics together:

  1. python evaluation.py \
  2. --results results/FastAttentionXML-Amazon-670K-labels.npy \
  3. --targets data/Amazon-670K/test_labels.npy \
  4. --train-labels data/Amazon-670K/train_labels.npy \
  5. -a 0.6 \
  6. -b 2.6

Reference

You et al., AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification, NeurIPS 2019

Declaration

It is free for non-commercial use. For commercial use, please contact Mr. Ronghi You and Prof. Shanfeng Zhu (zhusf@fudan.edu.cn).