项目作者: mpancia

项目描述 :
Analysis of the ethics filings associated to the San Francisco D5 supervisor race.
高级语言: R
项目地址: git://github.com/mpancia/san-francisco-ethics-d5.git
创建时间: 2019-07-13T21:11:34Z
项目社区:https://github.com/mpancia/san-francisco-ethics-d5

开源协议:The Unlicense

下载


San Francisco D5 Ethics Data

Summary

This is a reproducible data pipeline that collects and parses the San
Francisco ethics data associated to the 2019 District 5 supervisor race.
Roughly speaking, it does the following:

  • Downloads the data from the SF Ethics Socrata endpoint (see the
    query in plan.R);
  • Parses the raw data to extract the entities and relationships within
    the data, thought of as a graph (the data model is described below);
  • Loads the entities and relations into a Neo4j
    graph database
  • Corrects some of the data and adds extra information not present in
    the raw data
  • Summarizes and extracts the data in several forms

Methodology

The goal of this whole thing is to be as transparent as possible. There
are several choices that are made in the process of the data parsing and
analysis:

  • People donating are identified by their first/last names (there are
    a small number of donors, and so collisions are unlikely, though
    this is possible);
  • Employer names are standardized and re-mapped using regular
    expression rules found here
    and some data cleaning that is found in the match.R file (see the
    dplyr mutations in the appropriate function);
  • Employers are categorized by industry using the regular expression
    rules found here;
  • Occupations are categorized by industry using the regular expression
    rules found here;
  • Individuals are categorized by industry using the regular expression
    rules found here;
  • Associations of donations to industries are done by choosing the
    most specific of the following (less specific to more specific):
    • A mapping of the donor’s employer -> industry
    • A mapping of the donor’s occupation -> industry
    • A mapping of the donor -> industry
  • Associations of occupations to occupation classes found
    here.

The mappings were originally made by the maintainer, Matthew
Pancia
and subsequently checked by Elena
Palesis
.

PRS ARE HIGHLY ENCOURAGED TO CORRECT ANY INFORMATION. You can also
add an issue
here if you
see an error.

Industry classification

Some notes about the industry classification process:

  • All of the mappings were initially done blindly – the classifiers
    did not look at who the donations were associated to when making
    industrial classifications
  • The assignment of individuals, companies, occupations is obviously
    not absolute – there is some discretion in the assignment, and a
    given person may genuinely fall into several categories. Matthew
    tried to use his best judgement when determining these mappings.
  • Creative professions are generally grouped into the MEDIA
    category, which includes artists, musicians, journalists,
    filmmakers, etc.
  • General business positions (even potentiall at companies) are
    labeled as CONSULTING (BUSINESS); this includes managers,
    analysts, and other generic administrators
  • FOOD SERVICES includes restaurants, bars, caterers, food
    distributors, etc.
  • Professional services like accounting, gardening, etc. are labeled
    as OTHER SERVICES
  • REAL ESTATE (OTHER) includes some real estate
    investors/financiers, etc. This excludes people involved in
    non-profit real estate and the other more specific real estate
    categories as much as possible

Occupation class classification

  • All of the mappings were initially done blindly – the classifiers
    did not look at who the donations were associated to when making
    industrial classifications

Requirements for reproduction

This requires an installation of R, of course, as well as credentials
for a Neo4j database to write the data to.

The Neo4j database will need to have the
APOC
procedures installed, and you ought to allow CSVs to be loaded from the
directory that this is running in.

These credentials should be stored in a .env file with the following
variables:

  1. NEO4JDB_URL=
  2. NEO4JDB_BOLT_USER=
  3. NEO4JDB_BOLT_PASSWORD=

This requires using the http/https endpoint for Neo4j, not Bolt – the
code in plan.R will read these in from the .env file in order to
connect to the database, and uses the unsecure http by default. If you
want to use the https url, modify the code there.

Usage

This uses drake as a
workflow-management tool and renv for
environment management. If you are unfamiliar with it, it’s basically an
R-based, data-aware version of GNU Make. If you run make.R
interactively, it will run a pipeline that generates all of the data and
does all of the loading to the Neo4j server defined as above.

Alternatively, you can use the dumps of the output data that are found
in the data/output folder. This comes in several forms.

Data Model

This is the schema for the graph database:

Graph Schema

The taxonomy of industries can be found
here.

The taxonomy of occupation classes can be found
here.

Misc.

A shortlink for this repository can be found
here.