项目作者: mchmarny

项目描述 :
GitHub BigQuery Export Utility
高级语言: Shell
项目地址: git://github.com/mchmarny/github-bigquery-exporter.git
创建时间: 2018-10-31T01:24:18Z
项目社区:https://github.com/mchmarny/github-bigquery-exporter

开源协议:Apache License 2.0

下载


github-bigquery-exporter

GitHub BigQuery export utility, for those times when more granular PR and Issue queries are required. This is also good way to query data for periods longer than the GitHub max of 30 days.

Requirements

token

You can run the export script without the GitHub API token but you will be subject to much stricter rate limits. To avoid this (important for larger organizations) get a personal API tokens by following these instructions and define it in an GITHUB_ACCESS_TOKEN environment variable

  1. export GITHUB_ACCESS_TOKEN="your long token string goes in here"

Remember, you have to be org admin for this to work

json2csv

GitHub API exports data in JSON format. The simplest way to import desired data elements is to convert the data into CSV using json2csv, a Node.js utility that converts JSON to CSV.

  1. npm install -g json2csv

Configuration

To configure the export script you will need to define the organization and provide list of repositories in this organization.

  1. declare -r org="my-org-name"
  2. declare -a repos=("my-repo-1"
  3. "my-repo-2"
  4. "my-repo-3")

Optionally, to configure the import script you can edit the data-set name and configure the issue and pull table name. This step is only required if you for some reason have name conflicts in your BiqQuery project.

  1. declare -r ds="github"
  2. declare -r issues_table="issues"
  3. declare -r pulls_table="pulls"

Export

To execute the GitHub export script run this command:

  1. ./export

The expected output should look something like this

  1. Downloading issues for org/repo-1...
  2. Downloading prs for org/repo-1...
  3. Downloading issues for org/repo-2...
  4. Downloading prs for org/repo-2...

Import

To execute the BigQuery import script run this command:

  1. ./import

The expected output should look something like this

  1. Dataset 'project:github' successfully created.
  2. Table 'project:github.issues' successfully created.
  3. Table 'project:github.pulls' successfully created.
  4. Waiting on bqjob_..._1 ... (0s) Current status: DONE
  5. Waiting on bqjob_..._1 ... (0s) Current status: DONE

Query

When the above scripts completed successfully you should be able to query the imported data using SQL in BigQuery console. For example to find repositories with most issues over last 90 days:

  1. select
  2. i.repo,
  3. count(*) num_of_issues
  4. from gh.pulls i
  5. where date_diff(CURRENT_DATE(), date(i.ts), day) < 90
  6. group by
  7. i.repo
  8. order by 2 desc

TODO

  • Add org user export/import
  • Sort out the 2nd run where tables have to be appended
  • Bash, really? Can I haz me a service?

Scratch

Users who have activity (pr/issue) but are NOT in the user table

  1. with active_users as (
  2. select username
  3. from gh.issues
  4. group by username
  5. union all
  6. select p.username
  7. from gh.pulls p
  8. group by username
  9. )
  10. select *
  11. from active_users
  12. where username not in (SELECT username from gh.users)

Export results as CSV and use them as input in user-export which will download the GitHUb data for each one of those users. Then, when done, run user-import to bring those users into

Activity breakdown by company

  1. select all_prs.company, all_prs.prs apr, coalesce(m3_prs.prs,0) rpr from (
  2. select
  3. COALESCE(u.company, 'Unknown') company,
  4. COUNT(*) prs
  5. from gh.pulls i
  6. join gh.users u on i.username = u.username
  7. group by company
  8. ) all_prs
  9. left join (
  10. select
  11. COALESCE(u.company, 'Unknown') company,
  12. COUNT(*) prs
  13. from gh.pulls i
  14. join gh.users u on i.username = u.username
  15. where i.ts > "2018-10-30 23:59:59"
  16. group by company
  17. ) m3_prs on all_prs.company = m3_prs.company
  18. order by 2 desc
  1. select u.company, count(*)
  2. from gh.pulls i join gh.users u on i.username = u.username
  3. where u.company is not null
  4. group by company order by 2 desc
  1. select
  2. pr_month,
  3. sum(google_prs) as total_google_prs,
  4. sum(non_google_prs) as total_non_google_prs
  5. from (
  6. select
  7. case when u.company = 'Google' then 1 else 0 end as google_prs,
  8. case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  9. TIMESTAMP_TRUNC(i.`on`, MONTH) as pr_month
  10. from gh.pulls i
  11. join gh.users u on i.username = u.username
  12. where u.company is not null
  13. )
  14. group by pr_month
  15. order by 1

PRs

  1. select
  2. pr_month,
  3. sum(google_prs) as total_google_prs,
  4. sum(non_google_prs) as total_non_google_prs
  5. from (
  6. select
  7. case when u.company = 'Google' then 1 else 0 end as google_prs,
  8. case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  9. TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
  10. from gh.pulls p
  11. join gh.users u on p.username = u.username
  12. where u.company <> ''
  13. )
  14. group by pr_month
  15. order by 1

Issues

  1. select
  2. pr_month,
  3. sum(google_prs) as total_google_prs,
  4. sum(non_google_prs) as total_non_google_prs
  5. from (
  6. select
  7. case when u.company = 'Google' then 1 else 0 end as google_prs,
  8. case when u.company = 'Google' then 0 else 1 end as non_google_prs,
  9. TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
  10. from gh.issues i
  11. join gh.users u on i.username = u.username
  12. where u.company <> ''
  13. )
  14. group by pr_month
  15. order by 1
  1. select pr_month, repo, count(*) as prs
  2. from (
  3. select
  4. i.repo,
  5. TIMESTAMP_TRUNC(i.ts, MONTH) as pr_month
  6. from gh.pulls i
  7. join gh.users u on i.username = u.username
  8. where u.company is not null
  9. )
  10. group by pr_month, repo
  11. order by 1, 3 desc
  1. select
  2. pr_month,
  3. repo,
  4. count(*) action
  5. from (
  6. select
  7. repo,
  8. SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month
  9. from gh.issues
  10. union all
  11. select
  12. repo,
  13. SUBSTR(CAST(TIMESTAMP_TRUNC(ts, MONTH) as STRING),0,7) as pr_month
  14. from gh.pulls
  15. )
  16. where repo = 'build' --'build-pipeline'
  17. group by repo, pr_month
  18. order by 1, 2
  1. select repo, count(*) from (
  2. select
  3. repo
  4. from gh.issues
  5. union all
  6. select
  7. repo
  8. from gh.pulls
  9. )
  10. group by repo
  11. order by 2 desc