项目作者: ykursadkaya

项目描述 :
PySpark in Docker Containers
高级语言: Dockerfile
项目地址: git://github.com/ykursadkaya/pyspark-Docker.git
创建时间: 2020-04-03T13:36:01Z
项目社区:https://github.com/ykursadkaya/pyspark-Docker

开源协议:

下载


PySpark in Docker

Just an image for running PySpark.

Default versions
  • OpenJDK -> openjdk:8-slim-buster

  • Python -> python:3.9.5-slim-buster

  • PySpark -> 3.1.2

You can however specify OpenJDK, Python, PySpark versions and image variant when building.

  1. $ docker build -t pyspark --build-arg PYTHON_VERSION=3.7.10 --build-arg IMAGE=buster .

Running

Default entrypoint is “python”, so you will be interfacing directly Python Shell

  1. $ docker run -it pyspark
  2. Python 3.9.8 (main, Nov 10 2021, 03:21:27)
  3. [GCC 8.3.0] on linux
  4. Type "help", "copyright", "credits" or "license" for more information.
  5. >>> from pyspark.sql import SparkSession
  6. >>>
  7. >>> spark = SparkSession.builder.getOrCreate()
  8. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
  9. Setting default log level to "WARN".
  10. To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
  11. 21/11/13 23:44:54 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
  12. >>>
  13. >>> spark.version
  14. '3.2.0'
  15. >>>