项目作者: tkukurin

项目描述 :
University work. Master's seminar based on Volkova et al. "Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter"
高级语言: Jupyter Notebook
项目地址: git://github.com/tkukurin/Lab.SeparatingFactFromFiction.git
创建时间: 2018-03-18T19:37:00Z
项目社区:https://github.com/tkukurin/Lab.SeparatingFactFromFiction

开源协议:

下载


Scraper

Multiclass and binary have overlapping Tweets (multiclass is a subset of binary). Not all Tweets
are available because some accounts were suspended in the meantime. In total, our scraper collected
119,136 of them.

SyntaxNet

They mention SyntaxNet preprocessing, however I’m not sure where the output gets sent to.

SyntaxNet is installed via Docker, find sh
convenience scripts in the syntaxnet directory.

The file wrapper.py is to be run from within a docker container obtained from
nardeas/tensorflow-syntaxnet. Usage:

  1. # install
  2. docker pull nardeas/tensorflow-syntaxnet
  3. # run it.
  4. $ ./syntaxnet/run.sh
  5. $ ./syntaxnet/exec.sh

Processing

Some tweets turned out to be empty, after preprocessing we retain a total of 116,882 (this is the
current number of saved items in the parsed tweets file). And then drop again to ~98k after
removing duplicates.

Better model (source)

parsey_mcparseface was trained on a significantly larger dataset. Further, it was optimized to
maximize both POS tagging accuracy and parsing accuracy.”

Lexicons and resources

TODO

Resources