PROSAGA码农传奇-YARN-Spark创建较少的分区然后在WholeTextFiles上的minPartition参数

<div class =“post-text”itemprop =“text”>
  <BLOCKQUOTE>
    
      如果我们有每个文件的大小，那就更清楚了。但代码不会错。我根据spark代码库添加了这个答案
    
  </BLOCKQUOTE>
  <UL>
    <LI>
      
        首先，所有，
        的
          maxSplitSize
        </强>
         将计算取决于
        的
          
            目录大小
          
        </强>
         和
        的
          
            分区
          
        </强>
         传入
         <code>
 wholeTextFiles
 </code>
      
       <pre>
 <code>
 def setMinPartitions(context: JobContext, minPartitions: Int) {
 val files = listStatus(context).asScala
 val totalLen = files.map(file => if (file.isDirectory) 0L else file.getLen).sum
 val maxSplitSize = Math.ceil(totalLen * 1.0 /
 (if (minPartitions == 0) 1 else minPartitions)).toLong
 super.setMaxSplitSize(maxSplitSize)
 }
 // file: WholeTextFileInputFormat.scala

</code>
 </pre>
      
        <a href="https://github.com/apache/spark/blob/1055c94cdf072bfce5e36bb6552fe9b148bb9d17/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala#L51" rel="nofollow noreferrer">
          链接
        </A>
      
    </LI>
    <LI>
      
        按照
         <code>
 maxSplitSize
 </code>
         splits（Spark中的分区）将从源中提取。
      
       <pre>
 <code>
 inputFormat.setMinPartitions(jobContext, minPartitions)
 val rawSplits = inputFormat.getSplits(jobContext).toArray // Here number of splits will be decides
 val result = new Array[Partition](rawSplits.size)
 for (i <- 0 until rawSplits.size) {
 result(i) = new NewHadoopPartition(id, i, rawSplits(i).asInstanceOf[InputSplit with Writable])
 }
 // file: WholeTextFileRDD.scala

</code>
 </pre>
      
        <a href="https://github.com/apache/spark/blob/a9339db99f0620d4828eb903523be55dfbf2fb64/core/src/main/scala/org/apache/spark/rdd/WholeTextFileRDD.scala#L41" rel="nofollow noreferrer">
          链接
        </A>
      
    </LI>
  </UL>
  
    更多信息，请访问
    <a href =“https://github.com/apache/hadoop-common/blob/42a61a4fbc88303913c4681f0d40ffcc737e70b5/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org /apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.java#L174"rel =“nofollow noreferrer”>
       <code>
 CombineFileInputFormat#getSplits
 </code>
    </A>
     阅读文件和准备分裂的课程。
  
  <BLOCKQUOTE>
    <H2>
      注意：
    </H2>
    
      我提到了
      的
        Spark分区为MapReduce分裂
      </强>
       在这里，作为Spark
  从MapReduce借用输入和输出格式化程序
    
  </BLOCKQUOTE>
</DIV>