项目作者: osa-decentralized

项目描述 :
The easy to use database connector that allows one-command operations between PySpark and PostgreSQL or ClickHouse databases.
高级语言: Python
项目地址: git://github.com/osa-decentralized/pyspark_db_utils.git
创建时间: 2018-04-02T11:02:36Z
项目社区:https://github.com/osa-decentralized/pyspark_db_utils

开源协议:MIT License

下载


pyspark_db_utils

It helps you with your DB deals in Spark

Documentation

http://pyspark-db-utils.readthedocs.io/en/latest/

Example of using

You need jdbc drivers for using this lib!
Just get drivers from
https://jdbc.postgresql.org/download.html
https://github.com/yandex/clickhouse-jdbc
and put it in jars/ directory in your project

Example settings:

  1. settings = {
  2. "PG_PROPERTIES": {
  3. "user": "user",
  4. "password": "pass",
  5. "driver": "org.postgresql.Driver"
  6. },
  7. "PG_DRIVER_PATH": "jars/postgresql-42.1.4.jar",
  8. "PG_URL": "jdbc:postgresql://db.olabs.com/dbname",
  9. }

Example of code

see example.py

Example of run

  1. vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ mkdir jars
  2. vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ cp /var/bigdata/spark-2.2.0-bin-hadoop2.7/jars/postgresql-42.1.4.jar ./jars/
  3. vsmelov@vsmelov:~/PycharmProjects/pyspark_db_utils$ python3 pyspark_db_utils/example.py
  4. host: ***SECRET***
  5. db: ***SECRET***
  6. user: ***SECRET***
  7. password: ***SECRET***
  8. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  9. Setting default log level to "WARN".
  10. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  11. 18/03/05 11:43:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  12. 18/03/05 11:43:29 WARN Utils: Your hostname, vsmelov resolves to a loopback address: 127.0.1.1; using 192.168.43.26 instead (on interface wlp2s0)
  13. 18/03/05 11:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
  14. TRY: create df
  15. OK: create df
  16. +---+-----------+
  17. | id| mono_id|
  18. +---+-----------+
  19. | 1| 0|
  20. | 2| 1|
  21. | 3| 2|
  22. | 4| 3|
  23. | 5| 8589934592|
  24. | 6| 8589934593|
  25. | 7| 8589934594|
  26. | 8| 8589934595|
  27. | 9| 8589934596|
  28. | 10|17179869184|
  29. | 11|17179869185|
  30. | 12|17179869186|
  31. | 13|17179869187|
  32. | 14|17179869188|
  33. | 15|25769803776|
  34. | 16|25769803777|
  35. | 17|25769803778|
  36. | 18|25769803779|
  37. | 19|25769803780|
  38. +---+-----------+
  39. TRY: write_to_pg
  40. OK: write_to_pg
  41. TRY: read_from_pg
  42. OK: read_from_pg
  43. +---+-----------+
  44. | id| mono_id|
  45. +---+-----------+
  46. | 10|17179869184|
  47. | 11|17179869185|
  48. | 12|17179869186|
  49. | 13|17179869187|
  50. | 14|17179869188|
  51. | 1| 0|
  52. | 2| 1|
  53. | 3| 2|
  54. | 4| 3|
  55. | 5| 8589934592|
  56. | 6| 8589934593|
  57. | 7| 8589934594|
  58. | 8| 8589934595|
  59. | 9| 8589934596|
  60. | 15|25769803776|
  61. | 16|25769803777|
  62. | 17|25769803778|
  63. | 18|25769803779|
  64. | 19|25769803780|
  65. | 1| 0|
  66. +---+-----------+
  67. only showing top 20 rows