我是新手 火花 </跨度> 我正试图从文件中读取CSV数据 火花 </跨度> 。这就是我在做的事情:
sc.textFile( ‘FILE.CSV’) .map(lambda line:(line.split(‘,’)[0],line.split(‘,’)[1])) 。搜集 …
import pandas as pd data1 = pd.read_csv("test1.csv") data2 = pd.read_csv("train1.csv")
from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() df = spark.read.csv("/home/stp/test1.csv",header=True,separator="|"); print(df.collect())
这符合什么 JP Mercier最初建议 关于使用Pandas,但有一个重大修改:如果您将数据读入Pandas的块中,它应该更具有可塑性。这意味着,您可以解析比Pandas实际处理的文件大得多的文件,并以较小的尺寸将其传递给Spark。 (这也回答了关于为什么人们想要使用Spark的评论,如果他们可以将所有内容加载到Pandas中。)
from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sc = SparkContext('local','example') # if using locally sql_sc = SQLContext(sc) Spark_Full = sc.emptyRDD() chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000) # if you have headers in your csv file: headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns) for chunky in chunk_100k: Spark_Full += sc.parallelize(chunky.values.tolist()) YourSparkDataFrame = Spark_Full.toDF(headers) # if you do not have headers, leave empty instead: # YourSparkDataFrame = Spark_Full.toDF() YourSparkDataFrame.show()
还有另一种选择,包括使用Pandas读取CSV文件,然后将Pandas DataFrame导入Spark。
例如:
from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sc = SparkContext('local','example') # if using locally sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header s_df = sql_sc.createDataFrame(pandas_df)
你确定吗? 所有 这些行至少有2列?你可以尝试一下,只是为了检查?:
sc.textFile("file.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)>1) \ .map(lambda line: (line[0],line[1])) \ .collect()
或者,你可以打印罪魁祸首(如果有的话):
sc.textFile("file.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)<=1) \ .collect()
如果您在数据集中有任何一行或多行的列数少于或多于2,则可能会出现此错误。
我也是Pyspark的新手并尝试阅读CSV文件。以下代码为我工作:
在这段代码中,我使用的是来自kaggle的数据集,链接是: https://www.kaggle.com/carrie1/ecommerce-data
的 1.没有提到架构: 强>
from pyspark.sql import SparkSession scSpark = SparkSession \ .builder \ .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() sdfData = scSpark.read.csv("data.csv", header=True, sep=",") sdfData.show()
现在检查列: sdfData.columns
输出将是:
['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country']
检查每列的数据类型:
sdfData.schema StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true)))
这将为数据框提供数据类型为StringType的所有列
的 2.使用架构: 强> 如果您知道模式或想要更改上表中任何列的数据类型,那么使用它(假设我有以下列,并希望它们在每个特定的数据类型中)
from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField from pyspark.sql.types import DoubleType, IntegerType, StringType schema = StructType([\ StructField("InvoiceNo", IntegerType()),\ StructField("StockCode", StringType()), \ StructField("Description", StringType()),\ StructField("Quantity", IntegerType()),\ StructField("InvoiceDate", StringType()),\ StructField("CustomerID", DoubleType()),\ StructField("Country", StringType())\ ]) scSpark = SparkSession \ .builder \ .appName("Python Spark SQL example: Reading CSV file with schema") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema)
现在检查每列的数据类型的模式:
sdfData.schema StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true)))
编辑:我们可以使用以下代码行,而不明确提及架构:
sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True) sdfData.schema
输出是:
StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true)))
输出将如下所示:
sdfData.show() +---------+---------+--------------------+--------+--------------+----------+-------+ |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|CustomerID|Country| +---------+---------+--------------------+--------+--------------+----------+-------+ | 536365| 85123A|WHITE HANGING HEA...| 6|12/1/2010 8:26| 2.55| 17850| | 536365| 71053| WHITE METAL LANTERN| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 84406B|CREAM CUPID HEART...| 8|12/1/2010 8:26| 2.75| 17850| | 536365| 84029G|KNITTED UNION FLA...| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 84029E|RED WOOLLY HOTTIE...| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 22752|SET 7 BABUSHKA NE...| 2|12/1/2010 8:26| 7.65| 17850| | 536365| 21730|GLASS STAR FROSTE...| 6|12/1/2010 8:26| 4.25| 17850| | 536366| 22633|HAND WARMER UNION...| 6|12/1/2010 8:28| 1.85| 17850| | 536366| 22632|HAND WARMER RED P...| 6|12/1/2010 8:28| 1.85| 17850| | 536367| 84879|ASSORTED COLOUR B...| 32|12/1/2010 8:34| 1.69| 13047| | 536367| 22745|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047| | 536367| 22748|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047| | 536367| 22749|FELTCRAFT PRINCES...| 8|12/1/2010 8:34| 3.75| 13047| | 536367| 22310|IVORY KNITTED MUG...| 6|12/1/2010 8:34| 1.65| 13047| | 536367| 84969|BOX OF 6 ASSORTED...| 6|12/1/2010 8:34| 4.25| 13047| | 536367| 22623|BOX OF VINTAGE JI...| 3|12/1/2010 8:34| 4.95| 13047| | 536367| 22622|BOX OF VINTAGE AL...| 2|12/1/2010 8:34| 9.95| 13047| | 536367| 21754|HOME BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047| | 536367| 21755|LOVE BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047| | 536367| 21777|RECIPE BOX WITH M...| 4|12/1/2010 8:34| 7.95| 13047| +---------+---------+--------------------+--------+--------------+----------+-------+ only showing top 20 rows
如果您的csv数据恰好不包含任何字段中的换行符,则可以使用加载数据 textFile() 并解析它
textFile()
import csv import StringIO def loadRecord(line): input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["name1", "name2"]) return reader.next() input = sc.textFile(inputFile).map(loadRecord)
如果要将csv作为数据帧加载,则可以执行以下操作:
from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv') \ .options(header='true', inferschema='true') \ .load('sampleFile.csv') # this is your csv file
它对我来说很好。