使用Spark加载CSV文件

作者: 仙风道骨刘憨憨
发布时间: 2024-08-16 04:55:58 (2月前)
转自：

9 条回复

0#
回复此人
無口君 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <pre> <code> import pandas as pd data1 = pd.read_csv("test1.csv") data2 = pd.read_csv("train1.csv") </code> </pre> </DIV>

编辑
1#
回复此人
机器设备维修 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <pre> <code> from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .appName("Python Spark SQL basic example") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() df = spark.read.csv("/home/stp/test1.csv",header=True,separator="|"); print(df.collect()) </code> </pre> </DIV>

编辑
2#
回复此人
绵绵 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 这符合什么 <a href="https://stackoverflow.com/a/33703881/3135363"> JP Mercier最初建议 </A> 关于使用Pandas，但有一个重大修改：如果您将数据读入Pandas的块中，它应该更具有可塑性。这意味着，您可以解析比Pandas实际处理的文件大得多的文件，并以较小的尺寸将其传递给Spark。（这也回答了关于为什么人们想要使用Spark的评论，如果他们可以将所有内容加载到Pandas中。） </p> <pre> <code> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sc = SparkContext('local','example') # if using locally sql_sc = SQLContext(sc) Spark_Full = sc.emptyRDD() chunk_100k = pd.read_csv("Your_Data_File.csv", chunksize=100000) # if you have headers in your csv file: headers = list(pd.read_csv("Your_Data_File.csv", nrows=0).columns) for chunky in chunk_100k: Spark_Full += sc.parallelize(chunky.values.tolist()) YourSparkDataFrame = Spark_Full.toDF(headers) # if you do not have headers, leave empty instead: # YourSparkDataFrame = Spark_Full.toDF() YourSparkDataFrame.show() </code> </pre> </DIV>

编辑
3#
回复此人
樱花弄๑•ั็•็ | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 还有另一种选择，包括使用Pandas读取CSV文件，然后将Pandas DataFrame导入Spark。 </p> <P> 例如： </p> <pre> <code> from pyspark import SparkContext from pyspark.sql import SQLContext import pandas as pd sc = SparkContext('local','example') # if using locally sql_sc = SQLContext(sc) pandas_df = pd.read_csv('file.csv') # assuming the file contains a header # pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) # if no header s_df = sql_sc.createDataFrame(pandas_df) </code> </pre> </DIV>

编辑
4#
回复此人
子阳 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 你确定吗？ <EM> 所有 </EM> 这些行至少有2列？你可以尝试一下，只是为了检查？： </p> <pre> <code> sc.textFile("file.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)>1) \ .map(lambda line: (line[0],line[1])) \ .collect() </code> </pre> <P> 或者，你可以打印罪魁祸首（如果有的话）： </p> <pre> <code> sc.textFile("file.csv") \ .map(lambda line: line.split(",")) \ .filter(lambda line: len(line)<=1) \ .collect() </code> </pre> </DIV>

编辑
5#
回复此人
哎？小查查 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 如果您在数据集中有任何一行或多行的列数少于或多于2，则可能会出现此错误。 </p> <P> 我也是Pyspark的新手并尝试阅读CSV文件。以下代码为我工作： </p> <P> 在这段代码中，我使用的是来自kaggle的数据集，链接是： <a href="https://www.kaggle.com/carrie1/ecommerce-data" rel="nofollow noreferrer"> https://www.kaggle.com/carrie1/ecommerce-data </A> </p> <P> 的<strong> 1.没有提到架构： </强> </p> <pre> <code> from pyspark.sql import SparkSession scSpark = SparkSession \ .builder \ .appName("Python Spark SQL basic example: Reading CSV file without mentioning schema") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() sdfData = scSpark.read.csv("data.csv", header=True, sep=",") sdfData.show() </code> </pre> <P> 现在检查列： sdfData.columns </p> <P> 输出将是： </p> <pre> <code> ['InvoiceNo', 'StockCode','Description','Quantity', 'InvoiceDate', 'CustomerID', 'Country'] </code> </pre> <P> 检查每列的数据类型： </p> <pre> <code> sdfData.schema StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,StringType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,StringType,true),StructField(CustomerID,StringType,true),StructField(Country,StringType,true))) </code> </pre> <P> 这将为数据框提供数据类型为StringType的所有列 </p> <P> 的<strong> 2.使用架构： </强> 如果您知道模式或想要更改上表中任何列的数据类型，那么使用它（假设我有以下列，并希望它们在每个特定的数据类型中） </p> <pre> <code> from pyspark.sql import SparkSession from pyspark.sql.types import StructType, StructField from pyspark.sql.types import DoubleType, IntegerType, StringType schema = StructType([\ StructField("InvoiceNo", IntegerType()),\ StructField("StockCode", StringType()), \ StructField("Description", StringType()),\ StructField("Quantity", IntegerType()),\ StructField("InvoiceDate", StringType()),\ StructField("CustomerID", DoubleType()),\ StructField("Country", StringType())\ ]) scSpark = SparkSession \ .builder \ .appName("Python Spark SQL example: Reading CSV file with schema") \ .config("spark.some.config.option", "some-value") \ .getOrCreate() sdfData = scSpark.read.csv("data.csv", header=True, sep=",", schema=schema) </code> </pre> <P> 现在检查每列的数据类型的模式： </p> <pre> <code> sdfData.schema StructType(List(StructField(InvoiceNo,IntegerType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(CustomerID,DoubleType,true),StructField(Country,StringType,true))) </code> </pre> <P> 编辑：我们可以使用以下代码行，而不明确提及架构： </p> <pre> <code> sdfData = scSpark.read.csv("data.csv", header=True, inferSchema = True) sdfData.schema </code> </pre> <P> 输出是： </p> <pre> <code> StructType(List(StructField(InvoiceNo,StringType,true),StructField(StockCode,StringType,true),StructField(Description,StringType,true),StructField(Quantity,IntegerType,true),StructField(InvoiceDate,StringType,true),StructField(UnitPrice,DoubleType,true),StructField(CustomerID,IntegerType,true),StructField(Country,StringType,true))) </code> </pre> <P> 输出将如下所示： </p> <pre> <code> sdfData.show() +---------+---------+--------------------+--------+--------------+----------+-------+ |InvoiceNo|StockCode| Description|Quantity| InvoiceDate|CustomerID|Country| +---------+---------+--------------------+--------+--------------+----------+-------+ | 536365| 85123A|WHITE HANGING HEA...| 6|12/1/2010 8:26| 2.55| 17850| | 536365| 71053| WHITE METAL LANTERN| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 84406B|CREAM CUPID HEART...| 8|12/1/2010 8:26| 2.75| 17850| | 536365| 84029G|KNITTED UNION FLA...| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 84029E|RED WOOLLY HOTTIE...| 6|12/1/2010 8:26| 3.39| 17850| | 536365| 22752|SET 7 BABUSHKA NE...| 2|12/1/2010 8:26| 7.65| 17850| | 536365| 21730|GLASS STAR FROSTE...| 6|12/1/2010 8:26| 4.25| 17850| | 536366| 22633|HAND WARMER UNION...| 6|12/1/2010 8:28| 1.85| 17850| | 536366| 22632|HAND WARMER RED P...| 6|12/1/2010 8:28| 1.85| 17850| | 536367| 84879|ASSORTED COLOUR B...| 32|12/1/2010 8:34| 1.69| 13047| | 536367| 22745|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047| | 536367| 22748|POPPY'S PLAYHOUSE...| 6|12/1/2010 8:34| 2.1| 13047| | 536367| 22749|FELTCRAFT PRINCES...| 8|12/1/2010 8:34| 3.75| 13047| | 536367| 22310|IVORY KNITTED MUG...| 6|12/1/2010 8:34| 1.65| 13047| | 536367| 84969|BOX OF 6 ASSORTED...| 6|12/1/2010 8:34| 4.25| 13047| | 536367| 22623|BOX OF VINTAGE JI...| 3|12/1/2010 8:34| 4.95| 13047| | 536367| 22622|BOX OF VINTAGE AL...| 2|12/1/2010 8:34| 9.95| 13047| | 536367| 21754|HOME BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047| | 536367| 21755|LOVE BUILDING BLO...| 3|12/1/2010 8:34| 5.95| 13047| | 536367| 21777|RECIPE BOX WITH M...| 4|12/1/2010 8:34| 7.95| 13047| +---------+---------+--------------------+--------+--------------+----------+-------+ only showing top 20 rows </code> </pre> </DIV>

编辑
6#
回复此人
冷月如霜·胡狼 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 如果您的csv数据恰好不包含任何字段中的换行符，则可以使用加载数据 <code> textFile() </code> 并解析它 </p> <pre> <code> import csv import StringIO def loadRecord(line): input = StringIO.StringIO(line) reader = csv.DictReader(input, fieldnames=["name1", "name2"]) return reader.next() input = sc.textFile(inputFile).map(loadRecord) </code> </pre> </DIV>

编辑
7#
回复此人
一瓶泡沫 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”> <P> 如果要将csv作为数据帧加载，则可以执行以下操作： </p> <pre> <code> from pyspark.sql import SQLContext sqlContext = SQLContext(sc) df = sqlContext.read.format('com.databricks.spark.csv') \ .options(header='true', inferschema='true') \ .load('sampleFile.csv') # this is your csv file </code> </pre> <P> 它对我来说很好。 </p> </DIV>

编辑

登录后才能参与评论