我需要批量编辑HBase数据,编辑每行的特定单元格的内容。传递HBase PUT / GET API不是一种选择,因为这会非常慢。我想建立一个……
回答自己,以防其他人需要这个。
可以从HBase快照加载HFile。请遵循以下步骤: (在HBase shell中) 1.禁用'namespace:table' 2.快照'命名空间:表''your_snapshot'
这将创建一个可访问的快照,您可以访问/ [HBase_path] /.snapshot/[your_snapshot]
将快照加载为RDD [ImmutableBytesWritable,Result]
def loadFromSnapshot(sc: SparkContext): RDD[ImmutableBytesWritable, Result] = { val restorePath = new Path(s"hdfs://$storageDirectory/$restoreDirectory/$snapshotName") val restorePathString = restorePath.toString // create hbase conf starting from spark's hadoop conf val hConf = HBaseConfiguration.create() val hadoopConf = sc.hadoopConfiguration HBaseConfiguration.merge(hConf, hadoopConf) // point HBase root dir to snapshot dir hConf.set("hbase.rootdir", s"hdfs://$storageDirectory/$snapshotDirectory/$snapshotName/") // point Hadoop to the bucket as default fs hConf.set("fs.default.name", s"hdfs://$storageDirectory/") // configure serializations hConf.setStrings("io.serializations", hadoopConf.get("io.serializations"), classOf[MutationSerialization].getName, classOf[ResultSerialization].getName, classOf[KeyValueSerialization].getName) // disable caches hConf.setFloat(HConstants.HFILE_BLOCK_CACHE_SIZE_KEY, HConstants.HFILE_BLOCK_CACHE_SIZE_DEFAULT) hConf.setFloat(HConstants.BUCKET_CACHE_SIZE_KEY, 0f) hConf.unset(HConstants.BUCKET_CACHE_IOENGINE_KEY) // configure TableSnapshotInputFormat hConf.set("hbase.TableSnapshotInputFormat.snapshot.name", settingsAccessor.settings.snapshotName) hConf.set("hbase.TableSnapshotInputFormat.restore.dir", restorePathString) val scan = new Scan() // Fake scan which is applied by spark on HFile. Bypass RPC val scanString = { val proto = ProtobufUtil.toScan(scan) Base64.encodeBytes(proto.toByteArray) } hConf.set(TableInputFormat.SCAN, scanString) val job = Job.getInstance(hConf) TableSnapshotInputFormat.setInput(job, settingsAccessor.settings.snapshotName, restorePath) // create RDD sc.newAPIHadoopRDD(job.getConfiguration, classOf[TableSnapshotInputFormat], classOf[ImmutableBytesWritable], classOf[Result]) }
这将从快照目录加载HFile并对它们应用“假”全扫描,这可以避免慢速远程过程调用,但允许具有相同的扫描输出。
完成后,您可以重新启用表格