Analyze contributors to CRAN using Libraries.io data
For the first part of this project, to do with PyPi, see here
As a graduate student in statistics, I used R
a lot. In fact, entire
semester-long courses were dedicated to learning how to harness some ofR
‘s single-purpose (read: esoteric) packages for statistical modeling.
But when it came time for my capstone project, the data manipulation was
daunting… until I discovered dyplr
.
Moreover, I was completely taken by the paradigm outlined by dyplr
‘s author,
Hadley Wickham, in the
Split-Apply-Combine paper that he
wrote, introducing what would become the guiding principle of the Tidyverse.
Since then, the Tidyverse has exploded in popularity, becoming the
de facto standard for data
manipulation in R
, and Hadley Wickham’s veneration among R
users has only
increased— and for good reason: now Python and R
are the two most-used languages
in data science.
So, the motivation for this project is akin to that of the aforementioned PyPi
contributors investigation: is Hadley Wickham the most influential R
contributor?
To answer this question, we will analyze the R
packages uploaded to
CRAN; specifically:
R
packages themselvesUsing these items, we will use the
degree centrality algorithm
from graph theory to find the most influential node in the graph of R
packages,
dependencies, and contributors.
After constructing the graph (including imputing more than 2/3 of the R
packages
from Libraries.io Open Data dataset) and
analyzing the degree centrality,
Hadley Wickham is indeed the most influential R
contributor according to the
data from Libraries.io and CRAN. Below are the 10 Contributor
s with the highest
degree centrality scores for this graph:
Contributor | GitHub login | Degree Centrality Score |
---|---|---|
Hadley Wickham | hadley | 244 751 |
Jim Hester | jimhester | 170 123 |
Kiril Müller | krlmlr | 159 577 |
Jennifer (Jenny) Bryan | jennybc | 121 543 |
Mara Averick | batpigandme | 121 253 |
Gábor Csárdi | gaborcsardi | 101 351 |
Hiroaki Yutani | yutannihilation | 100 625 |
Christophe Dervieux | cderv | 98 078 |
Jeroen Ooms | jeroen | 82 055 |
Craig Citro | craigcitro | 71 207 |
For insight into how this result was arrived at, read on.
CRAN is the repository for R
packages that developers
know and love. Analogously to CRAN, other programming languages have their respective package
managers, such as PyPi for Python. As a natural exercise in abstraction,
Libraries.io is a meta-repository for
package managers. From their website:
Libraries.io gathers data from 36 package managers and 3 source code repositories.
We track over 2.7m unique open source packages, 33m repositories and 235m
interdependencies between [sic] them. This gives Libraries.io a unique understanding of
open source software. An understanding that we want to share with you.
Libraries.io has an easy-to-use API, but
given that CRAN has 15,000+ packages in the Open Data dataset,
the number of API calls to various endpoints to collate
the necessary data is not appealing (also, Libraries.io rate limits to 60 requests
per minute). Fortunately, Jeremy Katz on Zenodo
maintains snapshots of the Libraries.io Open Data source. The most recent
version is a snapshot from 22 December 2018, and contains the following CSV files:
More information about these CSVs is in the README
file included in the Open
Data tar.gz, copied here.
There is a substantial reduction in the data when subsetting these CSVs just
to the data pertaining to CRAN; find the code used to subset them and the
size comparisons here.
WARNING: The tar.gz file that contains these data is 13 GB itself, and
once downloaded takes quite a while to untar
; once uncompressed, the data
take up 64 GB on disk!
Because of the interconnected nature of software packages (dependencies,
versions, contributors, etc.), finding the most influential “item” in that web
of data make graph databases and
graph theory
the ideal tools for this type of analysis. Neo4j
is the most popular graph database according to DB engines,
and is the one that we will use for the analysis. Part of the reason for its popularity
is that its query language, Cypher,
is expressive and simple:
Terminology that will be useful going forward:
Jane Doe
and John Smith
are nodes (equivalently: vertexes)Person
, with property name
KNOWS
KNOWS
, and all Neo4j relationships, are directed; i.e. Jane Doe
John Smith
, but not the converseOn MacOS, the easiest way to use Neo4j is via the Neo4j Desktop app, available
as the neo4j
cask on Homebrew.
Neo4j Desktop is a great IDE for Neo4j, allowing simple installation of different
versions of Neo4j as well as plugins that are optional
(e.g. APOC
) but
are really the best way to interact with the graph database. Moreover, the
screenshot above is taken from the Neo4j Browser, a nice interactive
database interface as well as query result visualization tool.
Before we dive into the data model and how the data are loaded, Neo4j’s
default configuration isn’t going to cut it for the packages and approach
that we are going to use, so the customized configuration file can be found
here,
corresponding to Neo4j version 3.5.7.
Importing from CSV
is the most common way to populate a Neo4j graph, and is how we will
proceed given that the Open Data snapshot untar
s into CSV files. However,
first a data model is necessary— what the entities that will be
represented as labeled nodes with properties and the relationships
among them are going to be. Moreover, some settings of Neo4j
will have to be customized for proper and timely import from CSV.
Basically, when translating a data paradigm into graph data form, the nouns
become nodes and how the nouns interact (the verbs) become the relationships.
In the case of the Libraries.io data, the following is the data model:
So, a Platform
HOSTS
a Project
, which IS_WRITTEN_IN
a Language
,
and HAS_VERSION
Version
. Moreover, a Project
DEPENDS_ON
otherProject
s, and Contributor
s CONTRIBUTE_TO
Project
s. With respect toVersion
s, the diagram communicates a limitation of the Libraries.io
Open Data: that Project
nodes are linked in the dependencies CSV to otherProject
nodes, despite the fact that different versions of a project
depend on varying versions of other projects. Take, for example, this row
from the dependencies CSV:
ID | Project_Name | Project_ID | Version_Number | Version_ID | Dependency_Name | Dependency_Kind | Optional_Dependency | Dependency_Requirements | Dependency_Project_ID |
---|---|---|---|---|---|---|---|---|---|
29033435|archivist|687281|1.0|7326353|RCurl|imports|false|*|688429|
I.e.; Version
1.0 of Project
archivist
depends on Project
RCurl
.
There is no demarcation of which version of RCurl
it is that
version 1.0 of archivist
depends on, other than *
which forces the
modeling decision of Project
s depending on other Project
s, not Version
s.
It is impossible to answer the question of what contributor to CRAN is
most influential without, obviously, data on contributors. However, the
Open Data dataset lacks this information. In order to connect the Open
Data dataset with contributors data will require calls to the
Libraries.io API. As mentioned above, there
is a rate limit of 60 requests per minute. If there are
$ mlr --icsv --opprint filter '$Platform == "CRAN"' then uniq -n -g "ID" projects-1.4.0-2018-12-22.csv
14455
Python-language Pypi packages, each of which sends one request to the
Contributors endpoint
of the Libraries.io API, at “maximum velocity”, it will require
to get contributor data for each project.
Following the example of
this blog,
it is possible to use the aforementioned APOC utilities for Neo4j to
load data from web APIs,
but I found it to be unwieldy and difficult to monitor. So, I used
Python’s requests
and SQLite
packages to send requests to the
endpoint and store the responses in a long-running Bash process
(code for this here).
Analogously to the unique constraint in a relational database, Neo4j has a
uniqueness constraint
which is very useful in constraining the number of nodes created. Basically,
it isn’t useful, and hurts performance, to have two different nodes representing the
platform Pypi (or the language Python, or the project pipenv
, …) because
it is a unique entity. Moreover, uniqueness constraints enable
more performant queries.
The following
Cypher commands
add uniqueness constraints on the properties of the nodes that should be unique
in this data paradigm:
CREATE CONSTRAINT on (platform:Platform) ASSERT platform.name IS UNIQUE;
CREATE CONSTRAINT ON (project:Project) ASSERT project.name IS UNIQUE;
CREATE CONSTRAINT ON (project:Project) ASSERT project.ID IS UNIQUE;
CREATE CONSTRAINT ON (version:Version) ASSERT version.ID IS UNIQUE;
CREATE CONSTRAINT ON (language:Language) ASSERT language.name IS UNIQUE;
CREATE CONSTRAINT ON (contributor:Contributor) ASSERT contributor.uuid IS UNIQUE;
CREATE INDEX ON :Contributor(name);
All of the ID
properties come from the first column of the CSVs and are
ostensibly primary key values. The name
property of Project
nodes is
also constrained to be unique so that queries seeking to match nodes on
the property name— the way that we think of them— are performant as well.
N.b. if the graph from the first part of this analysis, concerning PyPi
packages, is already populated, it will be necessary to drop the uniqueness
constraint on the Project
names to avoid collisions. This is acceptable,
as there will still be distinct ID
values for the projects, but as Project
name
is the natural property to use for querying, a Neo4j
index
will do the trick:
DROP CONSTRAINT ON (p:Project) ASSERT p.name IS UNIQUE;
CREATE INDEX ON :Project(name):
With the constraints, plugins, and configuration of Neo4j in place,
the Libaries.io Open Data dataset can be loaded. Loading CSVs to Neo4j
can be done with the defaultLOAD CSV
command,
but in the APOC plugin there is an improved version,apoc.load.csv
,
which iterates over the CSV rows as map objects instead of arrays;
when coupled with
periodic execution
(a.k.a. batching), loading CSVs can be done in parallel, as well.
R
and CRAN NodesAs all projects that are to be loaded are hosted on CRAN, the first
node to be created in the graph is the CRAN Platform
node itself:
CREATE (:Platform {name: 'CRAN'});
Not all projects hosted on CRAN are written in R
, but those are
the focus of this analysis, so we need a R
Language
node:
CREATE (:Language {name: 'R'});
With these two, we create the first relationship of the graph:
MATCH (p:Platform {name: 'CRAN'})
MATCH (l:Language {name: 'R'})
CREATE (p)-[:HAS_DEFAULT_LANGUAGE]->(l);
Now we can load the rest of the entities in our graph, connecting them
to these as appropriate, starting with Project
s.
MERGE
OperationThe key operation when loading data to Neo4j is the
MERGE clause.
Using the property specified in the query, MERGE either MATCHes the node/relationship
with the property, and, if it doesn’t exist, duly CREATEs the node/relationship.
If the property in the query has a uniqueness constraint, Neo4j can thus iterate
over possible duplicates of the “same” node/relationship, only creating it once,
and “attaching” nodes to the uniquely-specified node on the go.
This is a double-edged sword, though, in the situation of creating relationships
between unique nodes; if the participating nodes are not specified exactly, to
MERGE a relationship between them will create new node(s) that are duplicates.
This is undesirable from an ontological perspective, as well as a database
efficiency perspective. So, all this to say that, to create unique
node-relationship-node entities requires three passes over a CSV: the first
to MERGE the first node type, the second to MERGE the second node type, and
the third to MATCH node type 1, MATCH node type 2, and MERGE the relationship
between them.
Lastly, for the same reason as the above, it is necessary to create “base” nodes
before creating nodes that “stem” from them. For example, if we had not created
the R
Language
node above (with unique property name
), for every R
project MERGED from the projects CSV, Neo4j would create a new Language
node
with name ‘R’ and a relationship between it and the R Project
node.
This duplication can be useful in some data models, but in the interest of
parsimony, we will load data in the following order:
Project
sVersion
sProject
s and Version
sContributor
sProject
sProject
nodes. The source CSV for this type of node isapoc.cypher.runFile
command; i.e.The result of this set of queries is that the following portion of our graph
CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/projects_apoc.cypher') yield row, result return 0;
Version
sNext are the Version
s of the Project
s. The source CSV for this type
of node is cran_versions.csv
and the queries are in
this file.
These queries are run with
CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/versions_apoc.cypher') yield row, result return 0;
The result of this set of queries is that the graph has grown to include
the following nodes and relationships:
Project
s and Version
sNow that there are Project
nodes and Version
nodes, it’s time to
link their dependencies. The source CSV for these data is
cran_dependencies.csv
and this query is in
this file.
Because the Project
s and Version
s already exist, this operation
is just the one MATCH-MATCH-MERGE query, creating relationships. It is run with
CALL apoc.cypher.runFile('/path/to/libraries_io/cypher/dependencies_apoc.cypher') yield row, result return 0;
Despite the Libraries.io Open Data dataset that contains dependencies amongR
projects, there are some projects that have no versions listed on CRAN,
yet still report imports
relationships on their CRAN sites. So, in order
to include the impact of these DEPENDS_ON
relationships in the degree
centrality algorithm, a pseudo Version
node was created with the number
property "NONE"
, and an auto-generated UUID from the apoc.create.uuid
function; i.e.
CREATE (v:Version{name:"NONE", number: apoc.create.uuid()});
Then, the R
script found
here
creates a JSON file using that Version
node, attached to Project
nodes with
no DEPENDS_ON
relationships in the current graph. Then, the JSON file
is loaded to Neo4j using the Cypher query
here.
That is, using the Neo4j cypher shell, and the Rscript
executable from R
:
GRAPHDBPASS=graph_db_pass_here Rscript --vanilla get_missing_dependencies_from_crandb_api.R > some_file.json && bin/cypher-shell -u neo4j -p $GRAPHDBPASS
which opens up the cypher-shell in which the aforementioned Cypher query and
the just-created JSON file can be passed to the shell.
The result of these operations is that the graph has grown to include
the DEPENDS_ON
relationship:
Contributor
sBecause the data corresponding to R
Project
Contributor
s was
retrieved from the Libraries.io API, it is not run with Cypher from a file, but
in a Python script, particularly
this section.
Unfortunately, that’s not the end of the story for the Contributor
data: over 70% of the R
Project
s have no Contributor
s reported
by the Libraries.io API. So, even after the ~15k Project
Contributor
s
were scraped from the API, more than 10k of those needed Contributor
data imputed. To do this, I used thecrandb
package
from one of the Top-10 most-influential Contributor
s, Gábor Csárdi.
For each package on CRAN, the crandb
package will return the information
on its official CRAN page, in an R
object that is easily parsed. For
example, using crandb
on the venerable bootstrapping package, boot
,
gives Contributor
in the form of Author and Maintainer:
> library(crandb)
> crandb::package('boot')
CRAN package boot 1.3-23, 4 months ago
Title: Bootstrap Functions (Originally by Angelo Canty for S)
Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>
Author: Angelo Canty [aut], Brian Ripley [aut, trl, cre] (author of
parallel support)
# ...
The Maintainer
field is always of the form “Maintainer: name
so that text was extracted and used as the name
property of theContributor
node for the Project
. The Author
field proved to
be too unstructured for reliable scraping. This process is in
this R
file.
After executing this process, the graph is now in its final form:
On the way to understanding the most influential Contributor
,
it is useful to find the most influential Project
. Intuitively,
the most influential Project
node should be the node with the
most (or very many) incoming DEPENDS_ON
relationships; however,
the degree centrality algorithm is not as simple as just counting
the number of relationships incoming and outgoing and ordering by
descending cardinality (although that is a useful metric for
understanding a [sub]graph). This is because the subgraph that
we are considering to understand the influence of Project
nodes
also contains relationships to Version
nodes.
So, using the Neo4j Graph Algorithm plugin’salgo.degree
procedure, all we need are a node label and a relationship type.
The arguments to this procedure could be as simple as two strings,
one for the node label, and one for the relationship type. However,
as mentioned above, there are two node labels at play here, so we
will use the alternative syntax
of the algo.degree
procedure in which we pass Cypher statements
returning the set of nodes and the relationships among them.
To run the degree centrality algorithm on the Project
s written
in R
that are hosted on CRAN
, the syntax
(found here)
is:
call algo.degree(
"MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'}) return id(p) as id",
"MATCH (p1:Project)-[:HAS_VERSION]->(:Version)-[:DEPENDS_ON]->(p2:Project) return id(p2) as source, id(p1) as target",
{graph: 'cypher', write: true, writeProperty: 'cran_degree_centrality'}
)
;
It is crucially important to alias as source
the Project
node MATCHed in the second query as the end node of theDEPENDS_ON
relationship, and the start node of the
relationship as target
. This is not officially documented,
but the example in the documentation has it as such, and I ran
into Java errors if not aliased exactly that way.
Now that there is a property on each R
Project
node denoting its
degree centrality score, the following query returns the top 10 Project
s:
MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'})
RETURN p.name, p.cran_degree_centrality ORDER BY p.cran_degree_centrality DESC LIMIT 10
;
Project | Degree Centrality Score |
---|---|
Rcpp |
6048 |
ggplot2 |
4269 |
MASS |
4024 |
dplyr |
3573 |
plyr |
3017 |
stringr |
2622 |
Matrix |
2512 |
magrittr |
2200 |
httr |
2073 |
jsonlite |
2070 |
The Project
that is out in front by a good margin is Rcpp
, the R
package
that allows developers to integrate C++
code into R
, usually for significant
speedup. Another interesting note is that 4 of these top 10 are part of the “Tidyverse”,
Hadley Wickham’s collection of packages designed for data science. Moreover, as
noted on the Tidyverse website,
the last two Project
s, httr
and jsonlite
, are “Tidyverse-adjacent”, in that
they have a similar design and philosophy. It seems that the hypothesis that
@hadley is the most influential contributor deserves a hefty amount of a priori weight!
To properly evaluate the hypothesis, the degree centrality algorithm will be
run again, this time focusing on the Contributor
nodes, and their contributions
to Project
s. The query
(found here)
is:
call algo.degree(
"MATCH (:Platform {name:'CRAN'})-[:HOSTS]->(p:Project) with p MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p)<-[:CONTRIBUTES_TO]-(c:Contributor) return id(c) as id",
"MATCH (c1:Contributor)-[:CONTRIBUTES_TO]->(:Project)-[:HAS_VERSION]->(:Version)-[:DEPENDS_ON]->(:Project)<-[:CONTRIBUTES_TO]-(c2:Contributor) return id(c2) as source, id(c1) as target",
{graph: 'cypher', write: true, writeProperty: 'cran_degree_centrality'}
)
;
This puts a property on each Contributor
node denoting its
degree centrality score, and the following query returns the top 10 Contributor
s and their scores:
MATCH (:Platform {name:'CRAN'})-[:HOSTS]->(p:Project)-[:IS_WRITTEN_IN]->(:Language {name: 'R'})
MATCH (c:Contributor)-[:CONTRIBUTES_TO]->(p)
RETURN c.name, c.cran_degree_centrality ORDER BY c.cran_degree_centrality DESC LIMIT 10
;
Contributor | GitHub login | Degree Centrality Score | # Top-10 Contributions | # Total Contributions | Total Contributions Rank |
---|---|---|---|---|---|
Hadley Wickham | hadley | 239 829 | 5 | 121 | 2nd |
Jim Hester | jimhester | 167 662 | 3 | 120 | 3rd |
Kiril Müller | krlmlr | 154 655 | 3 | 106 | 5th |
Jennifer (Jenny) Bryan | jennybc | 119 082 | 3 | 57 | 13th |
Mara Averick | batpigandme | 118 792 | 3 | 50 | 15th |
Hiroaki Yutani | yutannihilation | 98 164 | 3 | 49 | 16th |
Christophe Dervieux | cderv | 98 078 | 3 | 36 | 28th |
Gábor Csárdi | gaborcsardi | 93 968 | 2 | 91 | 6th |
Jeroen Ooms | jeroen | 72 211 | 2 | 117 | 4th |
Craig Citro | craigcitro | 71 207 | 3 | 15 | 107th |
As was surmised from the result of the Project
s degree centrality query, the
most influential R
contributor on CRAN is Hadley Wickham, and it’s not even close.
Not only has does @hadley contribute to the second-most R
projects of anyContributor
(only behind Contributor
Scott Chamberlain who is curiously
absent from the élité of most influential), he contributes to the most Top-10
projects of any Contributor
, with fully half bearing his mark.
There are only 253 Contributor
s who contribute to a Top-10 project–in terms
of degree centrality–however even being one of those is not a sufficient
condition for a high degree centrality score; i.e. even though this table hints
at a correlation between degree centrality score and number of total projects
(query here
and rank query here)
contributed to, there is a higher association between degree centrality and
number of Top-10 projects contributed to.
Indeed, using the algo.similarity.pearson
function:
MATCH (:Language {name:'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name:'CRAN'})
WITH p order by p.cran_degree_centrality DESC
WITH collect(p) as r_projects
UNWIND r_projects as project
SET project.cran_degree_centrality_rank = apoc.coll.indexOf(r_projects, project)+1
WITH project WHERE project.cran_degree_centrality_rank <= 10
MATCH (project)<-[ct:CONTRIBUTES_TO]-(c:Contributor)
WITH c, count(ct) as num_top_10_contributions
WITH collect(c.cran_degree_centrality) as dc, collect(num_top_10_contributions) as tc
RETURN algo.similarity.pearson(dc, tc) AS degree_centrality_top_10_contributions_correlation_estimate;
yields an estimate of 0.8462, whereas
MATCH (:Language {name: 'R'})<-[:IS_WRITTEN_IN]-(p:Project)<-[:HOSTS]-(:Platform {name: 'CRAN'})
MATCH (p)<-[ct:CONTRIBUTES_TO]-(c:Contributor)
WITH c, count(ct) as num_total_contributions
WITH collect(c.cran_degree_centrality) as dc, collect(num_total_contributions) as tc
RETURN algo.similarity.pearson(dc, tc) AS degree_centrality_total_contributions_correlation_estimate
;
is only 0.6830.
All this goes to show that, in a network, the centrality of a
node is determined by contributing to the right nodes,
not necessarily the most nodes.
Using the Libraries.io Open Data dataset, the R
projects
on CRAN and their contributors were analyzed using Neo4j–in
particular, the degree centrality algorithm–to find out which
contributor is the most influential to the graph of R
packages, versions, dependencies, and contributors. That contributor
is @hadley: the Tidyverse creator, Hadley Wickham.
This analysis did not take advantage of a commonly-used feature of
graph data; weights of the edges between nodes. A future improvement
of this analysis would be to use the number of versions of a project,
say, as the weight in the degree centrality algorithm to down-weight
those projects that have few versions as opposed to the projects that
have verifiable “weight” in the R
community, e.g. dplyr
.
Similarly, it was not possible to delineate the type of contribution
made in this analysis; more accurate findings would no doubt result
from the distinction between a package’s author, for example, and a
contributor who merged a small pull request to fix a typo. Similarly,
the imputation of just a single contributor for more than 70% of theR
packages potentially influenced in a non-trivial way the topology
of this network.
Moreover, the data used in this analysis are just a snapshot of the
state of CRAN from December 22, 2018: needless to say the number of
versions and projects and contributions is always in flux and so
behooves updating. However, the Libraries.io Open Data are a good
window into the dynamics of statistical programming’s premier
community.