项目作者: MaartenGr

项目描述 :
Tool for extracting and analyzing IMDB reviews
高级语言: Jupyter Notebook
项目地址: git://github.com/MaartenGr/Reviewer.git
创建时间: 2020-08-03T05:19:26Z
项目社区:https://github.com/MaartenGr/Reviewer

开源协议:MIT License

下载




Code for scraping IMDB reviews and applying NER and Sentiment Analyses
to extract Character popularity

Reviewer can be used to scrape user reviews from IMDB, generate word clouds based
on a custom class-based TF-IDF, and extract popular characters/actors from reviews
using a combination of Named Entity Recognition and Sentiment Analyses.

Table of Contents

  1. Instructions

    a. Scrape

    b. Word Cloud

    c. Character Analysis

  2. c-TF-IDF

  3. Disney

  4. Sources

1. Instructions

Back to ToC

I would advise you to start with the notebooks/Overview.ipynb notebook for a good
introduction before going to the command line. Moreover, scraping multiple movies
is actually preferred as it allows you to use the class-based TF-IDF.

Instead, you can dowload or fork this repo and start with the
instructions below.

1.a Scrape

To scrape a single movie (e.g., Aladdin), simply run from the command line:

  1. python scraper.py --prefix aladdin --url https://www.imdb.com/title/tt0103639/reviews?ref_=tt_ov_rt

Make sure to select the url of the review page of the movie you want to scrape.
The prefix variable is the name used for saving the resulting .json file.

Not only is the movie scraped, count data is also extracted if it is a single movie. If you want
to apply the class-based TF-IDF, I would suggest to follow the instructions at notebooks/Overview.ipynb.

1.b Word Cloud

Make sure that you save an image that you want to be used as a mask.
It is important the background is white and the file saved as a .jpg.

Then, after scraping the reviews, run the following from the command line:

  1. python scrape.py --path "data/aladdin_count.json" --mask your_mask.jpg --pixels 1200

The data/aladdin_count.json is the file saved after running the scraper. The name
thus depends on the prefix at the scrape stage.

The result is something like this:






1.c Character Analysis

We want to extract, from the reviews, which characters and actors are often talked about.
We start by using Named Entity Recognition to extract the entity “Person” from reviews.
Then, in the sentence where the entity “Person” is found, we apply sentiment analysis to
understand the sentiment about that character. In other words, we extract often talked about characters
combined with how positive those characters are regarded.

To do this, I made use of Named Entity Recognition and Sentiment Analysis using pre-trained BERT models.

After having scraped the review data, run the following from the command line:

  1. python char.py --movie Aladdin --extract True --fast True --prefix disney --rpath disney_reviews.json

The result will be the following visualization:



2. Class-based TF-IDF

Back to ToC

This project uses a custom TF-IDF used for exploring words that are interesting based
on the differences between classes. In other words, words are only important if they are often
mentioned in one class, but not so much in all other classes. I call it a class-based TF-IDF (c-TF-IDF):



The above formula can best be explained as a TF-IDF formula adopted for
multiple classes by joining all documents per class. Thus, each class is converted
to a single document instead of set of documents. Then, the frequency of words t
are extracted for each class i and divided by the total number of words w.

Next, the total, unjoined, number of documents across all classes m is divided by
the total sum of word i across all classes.

3. Disney

Back to ToC

Initially, this project was meant for me to be used only for Disney and Pixar movies
(as I enjoy those very much), but eventually I generalized the code to be used for,
in principle, all movies.

However, I also analyzed the most popular characters (by relative frequency in reviews)
and created a visualization of it below:



Moreover, you will find some Disney snippets here and there that I purposefully did not remove as
there were some manual fixes to get the visualizations working and the pipeline running. See
notebooks/Overview.ipynb for more information on how to run that code.

4. Sources

Back to ToC

An overview of all sources used in this package (mainly images for masks).
All rights on these images belong to Disney, Pixar, and Marvel.


Mask Images

Aladdin - https://www.amazon.com/Aladdin-Official-Lifesize-Cardboard-Fan/dp/B07QSZ5GC9 Coco - https://www.jing.fm/iclipt/mJTmmi/
Avengers - https://besthqwallpapers.com/films/ironman-4k-superheroes-iron-man-white-background-38148 Up - https://www.hiclipart.com/free-transparent-background-png-clipart-semci
Toy Story 3 - https://pixar.fandom.com/wiki/Lots-o%27-Huggin%27_Bear Frozen - https://tvtropes.org/pmwiki/pmwiki.php/Characters/FrozenElsa
Moana - https://brooklynactivemama.com/11/2016/obsessed-disneys-moana-free-moana-movie-printables.html Tangled - https://www.pngfuel.com/free-png/nfjxg
Toy Story (Woody) - http://www.allocine.fr/evenements/pixar/chapitre2/ Toy Story (Buzz) - https://heroes-and-villians.fandom.com/wiki/Buzz_Lightyear


Disney Popularity Images

Simba - https://lionking.fandom.com/wiki/Simba Basil - https://disney.fandom.com/wiki/Basil_of_Baker_Street
Mowgli - https://disney.fandom.com/wiki/Mowgli Woody - https://www.vhv.rs/viewpic/hbRoomw_woody-toy-story-png-png-download-toy-story/
Woody - https://nl.disney.be/films/toy-story-4 Carl - https://pixar.fandom.com/wiki/Carl_Fredricksen
Mike - https://www.pngegg.com/en/png-emxle Belle - https://i.pinimg.com/550x/89/e6/29/89e629b622a929e9b2e1b825c34a3c71.jpg
Rapunzel - https://picsart.com/i/318196865223211 Ariel - https://princess.disney.com/ariel