University work. Master's seminar based on Volkova et al. "Separating Facts from Fiction: Linguistic Models to Classify Suspicious and Trusted News Posts on Twitter"
Multiclass and binary have overlapping Tweets (multiclass is a subset of binary). Not all Tweets
are available because some accounts were suspended in the meantime. In total, our scraper collected
119,136 of them.
They mention SyntaxNet preprocessing, however I’m not sure where the output gets sent to.
SyntaxNet is installed via Docker, find sh
convenience scripts in the syntaxnet
directory.
The file wrapper.py
is to be run from within a docker container obtained from
nardeas/tensorflow-syntaxnet. Usage:
# install
docker pull nardeas/tensorflow-syntaxnet
# run it.
$ ./syntaxnet/run.sh
$ ./syntaxnet/exec.sh
Some tweets turned out to be empty, after preprocessing we retain a total of 116,882 (this is the
current number of saved items in the parsed tweets file). And then drop again to ~98k after
removing duplicates.
“parsey_mcparseface
was trained on a significantly larger dataset. Further, it was optimized to
maximize both POS tagging accuracy and parsing accuracy.”