Dataset of Influenza Incidence and Wikipedia Pagecounts (and Pageviews).
This dataset contains data which record ILI activity levels in several European countries, starting from the 2007-2008 influenza season to the 2018-2019 one. It comprises also Wikipedia’s pageviews and pagecounts data extracted for several specific pages.
The directories are named in such a way:
wikipedia_{country}
: they contain the pageviews/pagecounts data for the selected Wikipedia’s pages. The pageviews are divided by year and the pageviews/pagecounts are aggregated for each week. Each file contains the following columns:week
: a string composed by year
-week_number
;{country}
: they contain the influenza incidence data for the specified country. The incidence information is divided for each influenza seasons (which spans over two years). The file are thus named {year}_{year+1}.csv
.week
: a string composed by year
-week_number
;incidence
: the incidence of influenza cases over 100000 people in that specific week;Moreover, inside each wikipedia_{country}
directory there is another layer of division (this division is present also inside the {country}
directories, but it matters only for the Wikipedia’s pageviews since for the incidence data the division was done only for improving the usability):
complete
: contains the entire dataset, done by merging the pageviews and pagecounts data;pageviews
: contains only the data from the pageviews (they are available only from May 2015);pagecounts
: contains only the data from the pagecounts (it was the first method used to analyze traffic oncyclerank/pagerank
: they contain the complete dataset, but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.cyclerank_pageviews/pagerank_pageviews
: contains only the data from the pageviews (they are available only from May 2015), but the data refer to a set of specific pages selected by using the CycleRank or the PageRank algorithm.The only difference is the USA
directory in which the incidence data are provided in one unique
file called 2007_2013.csv
. Moreover, for the USA, only the pagecounts data were extracted.
The keywords
directory contains the lists of Wikipedia’s pages selected.
Each file is named keywords_{country}.csv
and it contains a simple list of all pages monitored.
There are also other files called keywords_{method}_{country}.csv
in which there is a simple list of all the pages monitored that were chosen by using the given {method}
(e.g. CycleRank or PageRank).
The influenza incidence values were extracted from several sources:
Licensing information about these datasets is unclear, while the copyright on these data lies with the institution that produced them, we believe that we can share this data for research purposes. Please refer to the original websites for further information.
The pageviews dataset have been extracted from Wikimedia’s pagecounts-raw
dataset, which is released in the Public Domain.
For further info cristian.consonni@unitn.it">send us an email.