- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
Apache Spark on Apache HBase: Current and Future
1. About Zhan Zhang
Zhan Zhang (Software Engineer at Hortonworks)
Currently Focus on Apache Spark and Hadoop, etc
Contribute to Apache Spark, Yarn, HBase, Ambari, etc
Experiences on Computer Networks, Distributed System and
Machine Learning Platform
2. About MeJean-Marc Spaggiari
Java Message Service => JMS
A bit of everything…
12 years in professional software development
4 years of team manager
4 years of project manager
Joined Cloudera in May 2013
Had mostly HBase knowledge
O'Reilly author of Architecting HBase Applications
International
Worked from Paris to Los Angeles
More than 100 flights per year
HBase and Phoenix contributor
3. About Ted Malaska
PSA at Cloudera
Co-Author to Hadoop Application Architecture
Contribute to 12 Apache projects
Worked with ~100 customers using big data
4. How it Started
• Demand started in the Field
• Porting off Map Reduce
• Huge Value in Spark Streaming for storing Aggregates and being used for point
look ups
• Started as a Github
• Andrew Purtell sparked the effort to put into Hbase
• Big call out to Sean B, Jon H, Ted Y, and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon 2016
5. Under the covers
HBaseCon 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
6. Key Addition: HBaseContext
Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
// A sample RDD
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1")), (Bytes.toBytes("2")),
(Bytes.toBytes("3")), (Bytes.toBytes("4")),
(Bytes.toBytes("5")), (Bytes.toBytes("6")),
(Bytes.toBytes("7"))))
HBaseCon 2016
7. • Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
8. Foreach
Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon 2016
9. Foreach
Read HBase data in parallel for each partition and compute
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
}
});
});
HBaseCon 2016
10. Map
Take an HBase dataset and map it in parallel for each partition to produce a new RDD
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon 2016
11. BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
},
stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon 2016
12. BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon 2016
13. BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon 2016
14. BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon 2016
16. What Improvement Have We Made?
Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
• High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
• Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
• Run on Top of Existing HBase Table
• Native Support Java Primitive Types