PROSAGA码农传奇-spark-Spark - 将CSV文件加载为DataFrame？

0# 至此 | 2019-08-31 10-32

1# trpnest | 2019-08-31 10-32

2# 那年 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    Penny的Spark 2示例是在spark2中实现它的方法。还有一个技巧：通过设置选项，通过对数据进行初始扫描为您生成标头
     <code>
      inferSchema
    </code>
     至
     <code>
      true
    </code>
  </p>
  <P>
    然后，在这里，假设
     <code>
      spark
    </code>
     是您设置的火花会话，是加载在S3上的亚马逊主机的所有Landsat图像的CSV索引文件中的操作。
  </p>
   <pre>
    <code>
        /*
   * Licensed to the Apache Software Foundation (ASF) under one or more
   * contributor license agreements.  See the NOTICE file distributed with
   * this work for additional information regarding copyright ownership.
   * The ASF licenses this file to You under the Apache License, Version 2.0
   * (the "License"); you may not use this file except in compliance with
   * the License.  You may obtain a copy of the License at
   *
   *    http://www.apache.org/licenses/LICENSE-2.0
   *
   * Unless required by applicable law or agreed to in writing, software
   * distributed under the License is distributed on an "AS IS" BASIS,
   * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   * See the License for the specific language governing permissions and
   * limitations under the License.
   */

val csvdata = spark.read.options(Map(
    "header" -> "true",
    "ignoreLeadingWhiteSpace" -> "true",
    "ignoreTrailingWhiteSpace" -> "true",
    "timestampFormat" -> "yyyy-MM-dd HH:mm:ss.SSSZZZ",
    "inferSchema" -> "true",
    "mode" -> "FAILFAST"))
  .csv("s3a://landsat-pds/scene_list.gz")

</code>
  </pre>
  <P>
    坏消息是：这会触发扫描文件;对于像这个20 + MB压缩CSV文件那样大的东西，在长途连接上可能需要30秒。请记住：一旦进入架构，最好手动编写架构编码。
  </p>
  <P>
    （代码片段Apache软件许可证2.0被授权以避免所有歧义;我作为S3集成的演示/集成测试所做的事情）
  </p>
</DIV>

3# 岁爵 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    如果您正在使用scala 2.11和Apache 2.0或更高版本构建jar。
  </p>
  <P>
    没有必要创建一个
     <code>
      sqlContext
    </code>
     要么
     <code>
      sparkContext
    </code>
     宾语。只是一个
     <code>
      SparkSession
    </code>
     对象满足所有需求的要求。
  </p>
  <P>
    以下是mycode工作正常：
  </p>
   <pre>
    <code>
      import org.apache.spark.sql.{DataFrame, Row, SQLContext, SparkSession}
import org.apache.log4j.{Level, LogManager, Logger}

object driver {

def main(args: Array[String]) {

val log = LogManager.getRootLogger

log.info("**********JAR EXECUTION STARTED**********")

val spark = SparkSession.builder().master("local").appName("ValidationFrameWork").getOrCreate()
    val df = spark.read.format("csv")
      .option("header", "true")
      .option("delimiter","|")
      .option("inferSchema","true")
      .load("d:/small_projects/spark/test.pos")
    df.show()
  }
}

</code>
  </pre>
  <P>
    如果你在集群中运行只是改变
     <code>
      .master("local")
    </code>
     至
     <code>
      .master("yarn")
    </code>
     在定义时
     <code>
      sparkBuilder
    </code>
     宾语
  </p>
  <P>
    Spark Doc涵盖了这个：

<a href="https://spark.apache.org/docs/2.2.0/sql-programming-guide.html" rel="nofollow noreferrer">
      https://spark.apache.org/docs/2.2.0/sql-programming-guide.html
    </A>
  </p>
</DIV>

4# 那月静好 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    解析CSV文件存在很多挑战，如果文件大小较大，如果列值中存在非英语/转义/分隔符/其他字符，则可能会导致解析错误。
  </p>
  <P>
    然后神奇的是在使用的选项中。那些适合我和希望的应该覆盖大多数边缘情况的代码如下：
  </p>
   <pre>
    <code>
      ### Create a Spark Session
spark = SparkSession.builder.master("local").appName("Classify Urls").getOrCreate()

### Note the options that are used. You may have to tweak these in case of error
html_df = spark.read.csv(html_csv_file_path, 
                         header=True, 
                         multiLine=True, 
                         ignoreLeadingWhiteSpace=True, 
                         ignoreTrailingWhiteSpace=True, 
                         encoding="UTF-8",
                         sep=',',
                         quote='"', 
                         escape='"',
                         maxColumns=2,
                         inferSchema=True)

</code>
  </pre>
  <P>
    希望有所帮助。更多参考：
    <a href="https://blog.codonomics.com/2018/08/using-pyspark-2-to-read-csv-having-html.html#more" rel="nofollow noreferrer">
      使用PySpark 2读取具有HTML源代码的CSV
    </A>
  </p>
  <P>
    注意：上面的代码来自Spark 2 API，其中CSV文件读取API与Spark可安装的内置包捆绑在一起。
  </p>
  <P>
    注意：PySpark是Spark的Python包装器，与Scala / Java共享相同的API。
  </p>
</DIV>

5# 夏花 | 2019-08-31 10-32

6# 生如夏花 | 2019-08-31 10-32

<div class =“post-text”itemprop =“text”>
  <P>
    在Java 1.8中此代码片段完美地用于读取CSV文件
  </p>
  <P>
    的pom.xml
  </p>
   <pre>
    <code>
      <dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.0.0</version>
</dependency>

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.10</artifactId>
    <version>2.0.0</version>
</dependency>

<dependency>
    <groupId>org.scala-lang</groupId>
    <artifactId>scala-library</artifactId>
    <version>2.11.8</version>
</dependency>
<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-csv_2.10</artifactId>
    <version>1.4.0</version>
</dependency>

</code>
  </pre>
  <P>
    Java的
  </p>
   <pre>
    <code>
      SparkConf conf = new SparkConf().setAppName("JavaWordCount").setMaster("local");
// create Spark Context
SparkContext context = new SparkContext(conf);
// create spark Session
SparkSession sparkSession = new SparkSession(context);

Dataset<Row> df = sparkSession.read().format("com.databricks.spark.csv").option("header", true).option("inferSchema", true).load("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");

//("hdfs://localhost:9000/usr/local/hadoop_data/loan_100.csv");
System.out.println("========== Print Schema ============");
df.printSchema();
System.out.println("========== Print Data ==============");
df.show();
System.out.println("========== Print title ==============");
df.select("title").show();

</code>
  </pre>
</DIV>