Apache Spark on Apache HBase: Current and Future

About Zhan Zhang
 Zhan Zhang (Software Engineer at Hortonworks)
 Currently Focus on Apache Spark and Hadoop, etc
 Contribute to Apache Spark, Yarn, HBase, Ambari, etc
 Experiences on Computer Networks, Distributed System and
Machine Learning Platform

About MeJean-Marc Spaggiari
Java Message Service => JMS
A bit of everything…
12 years in professional software development
4 years of team manager
4 years of project manager
Joined Cloudera in May 2013
Had mostly HBase knowledge
O'Reilly author of Architecting HBase Applications
International
Worked from Paris to Los Angeles
More than 100 flights per year
HBase and Phoenix contributor

About Ted Malaska
 PSA at Cloudera
 Co-Author to Hadoop Application Architecture
 Contribute to 12 Apache projects
 Worked with ~100 customers using big data

How it Started
• Demand started in the Field
• Porting off Map Reduce
• Huge Value in Spark Streaming for storing Aggregates and being used for point
look ups
• Started as a Github
• Andrew Purtell sparked the effort to put into Hbase
• Big call out to Sean B, Jon H, Ted Y, and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon 2016

Under the covers
HBaseCon 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks

Key Addition: HBaseContext
Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
// A sample RDD
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1")), (Bytes.toBytes("2")),
(Bytes.toBytes("7"))))
HBaseCon 2016

• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext

Foreach
Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon 2016

Foreach
Read HBase data in parallel for each partition and compute
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
}
});
});
HBaseCon 2016

Map
Take an HBase dataset and map it in parallel for each partition to produce a new RDD
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon 2016

BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
},
stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon 2016

BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon 2016

BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon 2016

BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon 2016

BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size
HBaseCon 2016

What Improvement Have We Made?
 Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
• High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
• Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
• Run on Top of Existing HBase Table
• Native Support Java Primitive Types

More …
• Composite Key
• Avro Format

Usage - Language Integrate Query/SQL

Spark HBase Connector Architecture

Apache Spark on Apache HBase: Current and Future

Apache Spark on Apache HBase: Current and Future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark on Apache HBase: Current and Future

Similar to Apache Spark on Apache HBase: Current and Future (20)

More from HBaseCon

More from HBaseCon (20)

Recently uploaded

Recently uploaded (20)

Apache Spark on Apache HBase: Current and Future