项目作者: bnosac

项目描述 :
Read in SAS data in parallel into Apache Spark
高级语言: R
项目地址: git://github.com/bnosac/spark.sas7bdat.git
创建时间: 2016-07-20T20:25:46Z
项目社区:https://github.com/bnosac/spark.sas7bdat

开源协议:

下载


spark.sas7bdat

The spark.sas7bdat package allows R users working with Apache Spark to read in SAS datasets in .sas7bdat format into Spark by using the spark-sas7bdat Spark package. This allows R users to

  • load a SAS dataset in parallel into a Spark table for further processing with the sparklyr package
  • process in parallel the full SAS dataset with dplyr statements, instead of having to import the full SAS dataset in RAM (using the foreign/haven packages) and hence avoiding RAM problems of large imports

Example

The following example reads in a file called iris.sas7bdat in a table called sas_example in Spark. Do try this with bigger data on your cluster and look at the help of the sparklyr package to connect to your Spark cluster.

  1. library(sparklyr)
  2. library(spark.sas7bdat)
  3. mysasfile <- system.file("extdata", "iris.sas7bdat", package = "spark.sas7bdat")
  4. sc <- spark_connect(master = "local")
  5. x <- spark_read_sas(sc, path = mysasfile, table = "sas_example")
  6. x

The resulting pointer to a Spark table can be further used in dplyr statements

  1. library(dplyr)
  2. x %>% group_by(Species) %>%
  3. summarise(count = n(), length = mean(Sepal_Length), width = mean(Sepal_Width))

Installation

Install the package from CRAN.

  1. install.packages('spark.sas7bdat')

Or install this development version from github.

  1. devtools::install_github("bnosac/spark.sas7bdat", build_vignettes = TRUE)
  2. vignette("spark_sas7bdat_examples", package = "spark.sas7bdat")

The package has been tested out with Spark version 2.0.1 and Hadoop 2.7.

  1. library(sparklyr)
  2. spark_install(version = "2.0.1", hadoop_version = "2.7")

Speed comparison

In order to compare the functionality to the read_sas function from the haven package, below we show a comparison on a small 5234557 rows x 2 columns SAS dataset with only numeric data. Processing is done on 8 cores. With the haven package you need to import the data in RAM, with the spark.sas7bdat package, you can immediately execute dplyr statements on top of the SAS dataset.

  1. mysasfile <- "/home/bnosac/Desktop/testdata.sas7bdat"
  2. system.time(x <- spark_read_sas(sc, path = mysasfile, table = "testdata"))
  3. user system elapsed
  4. 0.008 0.000 0.051
  5. system.time(x <- haven::read_sas(mysasfile))
  6. user system elapsed
  7. 1.172 0.032 1.200

Support in big data and Spark analysis

Need support in big data and Spark analysis?
Contact BNOSAC: http://www.bnosac.be