Compute statistics using IMDb dataset files.
Computes statistics using IMDb dataset files.
To download the IMDb dataset files:
download.sh <S3 key> <S3 password>
To compute the statistics, generate the plots / PNG images, and save the results into README.md:
generate.sh
As of February 06, 2022, there are 8.671.097 titles in the IMDb dataset files.
Titles can be partitioned into 10 different types:
7.557.081 titles (87,15%) have a start and/or end year defined:
The earliest title in IMDb is The Passage of Venus (1874). And, yes, 100 Years is planned for release in 2115!
2.350.973 titles (27,11%) have a runtime duration defined:
Most durations above 1,000 minutes are experimental videos, total durations for series, mistakes, etc.
Here are the statistics and frequency plot for feature films only:
For short films only:
There is some overlap between the short films and feature films. I’m not sure it totally makes sense (e.g. a feature film shorter than 10 minutes or a short film longer than 100 minutes?).
1.212.002 titles (13,98%) have ratings.
Each title with a rating has at least 5 votes (this is a limit enforced by IMDb).
Most titles don’t have much votes. The full frequency plot is not very useful:
If we zoom to 1,100 votes and less, we can see what’s happening a little bit better:
95% of the titles with votes are in that area (i.e. about 1,100 votes and less):
Here is a list of the titles with more than 1,000,000 votes:
Most (but not all) of those titles are feature films. The mean/median numbers of votes for feature films are greater than the mean/median numbers of votes for all titles:
But the plot still doesn’t look like a bell curve:
Question: what’s the minimum IMDb rating for a feature film that you should watch if you can only watch N feature films in your life?
Here’s the plot if you take into account all feature films with ratings:
If you take into account only feature films with 100 votes or more:
And now with 10,000 votes or more:
All the plots have the same shape: the more films you take, the less you have to be strict/conservative about the minimum rating. It makes complete sense.
If you put all the plots on the same image, it becomes clear in what way the minimum number of votes influences the minimum rating:
The higher the number of votes, the lower the number of feature films there are with that many votes. In other words, you can be less strict/conservative about the minimum rating with movies that have lots of votes.
But let’s be honest, shall we? You probably won’t see more than 5,000 feature films in your entire life, unless you’re a movie buff. So let’s zoom a little:
At this scale, if becomes clear that the minimum number of votes becomes less important: the minimum rating doesn’t go all the way down to 1; actually, it doesn’t even go below 7.5 for most plots and doesn’t go below ~6.5 for all of them. It appears that it’s probably a good idea to stay clear of feature films with a rating lower than 7 or 8, depending on the number of films.
Example 1. Let’s say you only have the time to watch 1,500 feature films. These are the minimum ratings for various minimum number of votes:
Example 2. What about 250 feature films?
At the time of writing, all the movies in the IMDb Top 250 have more than 25,000 votes and a rating of 8.0 or more. If I had to guess, I would have given a minimum rating of 8.1 for a maximum of 250 movies to watch and a minimum of 25’000 votes. The discrepancy probably comes from the fact that “only votes from regular IMDb voters are considered when creating the top 250 out of the full voting database”. I have no way of knowing which vote comes from “regular IMDb voters”. This information is not included in the IMDb dataset files.