项目作者: casualcomputer

项目描述 :
Functions that generate SQL queries that summarize high-dimensional tables stored in various databases (e.g. Microsoft SQL Servers, Netezza, DB2, Postgres, Oracle, MySQL, etc.).
高级语言: R
项目地址: git://github.com/casualcomputer/sql.mechanic.git
创建时间: 2021-06-19T05:06:49Z
项目社区:https://github.com/casualcomputer/sql.mechanic

开源协议:GNU General Public License v3.0

下载


Summarizing large tables with SQL

Henry Luan

Overview

The core helper function get_summary_codes in the sql.mechanic
package takes in a string that specifies a table’s database name, schema
name, and table name. It outputs an SQL query that summarizes the
table’s column statistics.

If you don’t have much time to read or just want a quick solution, jump
to Example
2
.

Intended User

  • If you work with tables inside databases hosted on powerful servers
    (usually on-premise) but have limited compute resources (e.g. RAM,
    GPU, CPU) for advanced BI (e.g. Python, R, SAS, etc.)

  • If it’s much cheaper for you to use database servers than analytics
    servers (e.g. those for Python, R, etc.) either on-premise or on the
    cloud.

Limitations

  • You probably want to understand how Example
    2
    affect
    the CPU and disk usage of your database server, to avoid bad surprises
    on your server’s resource usage.

  • If you use the setup of Example
    2
    on a
    cloud database, you MUST do some testing, to understand how the
    example affects the CPU and disk usage of your cloud resource. Please
    avoid potentially expensive mistakes.

  • Currently, the function only works with Microsoft SQL Server and
    Netezza databases. Feel free to contribute to the codes, if
    interested.

Credits

Specail credits: the R codes in this package are built on top of Gordon
S. Linoff’s book Data Analysis Using SQL and Excel, 2nd Edition. His
work has been a tremendous inspiration for the creation of this package.

Example 1: Generate SQL queries and execute them in DMBS

Step 1: Install packages

You can install the library from my GitHub. If you have concerns
regarding the package’s security, you can download, check, and use the
“get_summary_codes.R” file directly.

  1. # Install package
  2. library(devtools)
  3. install_github("casualcomputer/sql.mechanic",quiet=TRUE)
  4. # Alternative: use the 'get_summary_codes.R' file only
  5. # source("get_summary_codes.R")

Step 2: Generate SQL quires

The following codes 1) generate the SQL queries you need to summarize a
table, and 2) copy (Ctrl+C) the codes to your clipboard. All you have to
do is paste it to your SQL editor and execute the queries.

  1. library(sql.mechanic)
  2. #SQL codes for basic summary, Netezza database
  3. sql_basic_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
  4. #SQL codes for advanced summary, Netezza database
  5. sql_advanced_netezza = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="Netezza")
  6. #SQL codes for basic summary, Microsoft SQL Server
  7. sql_basic_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="MSSQL")
  8. #SQL codes for advanced summary, Microsoft SQL Server
  9. sql_advanced_mssql = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="advanced", dbtype="MSSQL")
  10. #copy some of the sql queries to the clipboard
  11. writeClipboard(sql_basic_netezza)

Step 3: Paste the codes in your clipboard and run it in SQL

In case you are curious, the SQL copied to your clipboard looks like
this.

  1. SELECT REPLACE(REPLACE(REPLACE('<start> SELECT ''<col>'' as colname,
  2. COUNT(*) as numvalues,
  3. MAX(freqnull) as freqnull,
  4. CAST(MIN(minval) as CHAR(100)) as minval,
  5. SUM(CASE WHEN <col> = minval THEN freq ELSE 0 END) as numminvals,
  6. CAST(MAX(maxval) as CHAR(100)) as maxval,
  7. SUM(CASE WHEN <col> = maxval THEN freq ELSE 0 END) as nummaxvals,
  8. SUM(CASE WHEN freq =1 THEN 1 ELSE 0 END) as numuniques
  9. FROM (SELECT <col>, COUNT(*) as freq
  10. FROM SCHEMA_NAME.<tab> GROUP BY <col>) osum
  11. CROSS JOIN (SELECT MIN(<col>) as minval, MAX(<col>) as maxval, SUM(CASE WHEN <col> IS NULL THEN 1 ELSE 0 END) as freqnull
  12. FROM (SELECT <col> FROM SCHEMA_NAME.<tab>) osum
  13. ) summary',
  14. '<col>', column_name),
  15. '<tab>', 'TABLE_NAME'),
  16. '<start>',
  17. (CASE WHEN ordinal_position = 1 THEN ''
  18. ELSE 'UNION ALL' END)) as codes_data_summary
  19. FROM (SELECT table_name, case when regexp_like(column_name,'[a-z.]|GROUP') then '"'||column_name||'"'
  20. else column_name end as column_name , ordinal_position
  21. FROM information_schema.columns
  22. WHERE table_name ='TABLE_NAME') a;

Step 4: Copy, paste and execute the query results from Step 3.

Example 2: Automatically summarize tables in your databases

This example shows you how you can summarize tables as you did in
Example 1, with only a few lines of R codes.

  1. # Install package
  2. library(devtools)
  3. install_github("casualcomputer/sql.mechanic",quiet=TRUE)
  4. # Alternative: use the 'get_summary_codes.R' file only
  5. # source("get_summary_codes.R")
  6. # Load packages
  7. library(sql.mechanic)
  8. library(odbc)
  9. library(DBI)
  10. # Connect to database(s)
  11. ## Method 1: prompting user (you need to make some changes here)
  12. con <- dbConnect(odbc(),
  13. Driver = "SQL Server",
  14. Server = "mysqlhost",
  15. Database = "mydbname",
  16. UID = "myuser",
  17. PWD = "Database password",
  18. Port = 1433, encoding = 'windows-1252') #'windows-1252' allows French to display properly
  19. ## Alternative Method: Using a DSN
  20. #con <- dbConnect(odbc::odbc(), "DNS_NAMES", encoding = 'windows-1252')
  21. sql_query = get_summary_codes("DB_NAME.SCHEMA_NAME.TABLE_NAME", type="basic", dbtype="Netezza")
  22. res = dbSendQuery(con, sql_query) # part of Step 3 in "Example 1"
  23. sql_query_mod = dbFetch(res) # part of Step 3 in "Example 1"
  24. res = dbSendQuery(con, sql_query_mod) # part of Step 4 in "Example 1"
  25. output_table = dbFetch(res) # part of Step 4 in "Example 1"
  26. print(output) #desired summary table
  27. dbDisconnect(con) #close database connection