SlideShare a Scribd company logo
1 of 23
About Zhan Zhang
 Zhan Zhang (Software Engineer at Hortonworks)
 Currently Focus on Apache Spark and Hadoop, etc
 Contribute to Apache Spark, Yarn, HBase, Ambari, etc
 Experiences on Computer Networks, Distributed System and
Machine Learning Platform
About MeJean-Marc Spaggiari
Java Message Service => JMS
A bit of everything…
12 years in professional software development
4 years of team manager
4 years of project manager
Joined Cloudera in May 2013
Had mostly HBase knowledge
O'Reilly author of Architecting HBase Applications
International
Worked from Paris to Los Angeles
More than 100 flights per year
HBase and Phoenix contributor
About Ted Malaska
 PSA at Cloudera
 Co-Author to Hadoop Application Architecture
 Contribute to 12 Apache projects
 Worked with ~100 customers using big data
How it Started
• Demand started in the Field
• Porting off Map Reduce
• Huge Value in Spark Streaming for storing Aggregates and being used for point
look ups
• Started as a Github
• Andrew Purtell sparked the effort to put into Hbase
• Big call out to Sean B, Jon H, Ted Y, and Matteo B
• Components
• Normal Spark
• Spark Streaming
• Bulk Load
• SparkSQL
HBaseCon 2016
Under the covers
HBaseCon 2016
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
Key Addition: HBaseContext
Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
// A sample RDD
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1")), (Bytes.toBytes("2")),
(Bytes.toBytes("3")), (Bytes.toBytes("4")),
(Bytes.toBytes("5")), (Bytes.toBytes("6")),
(Bytes.toBytes("7"))))
HBaseCon 2016
• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
• Most of them in both Java and Scala
Operations on the HBaseContext
Foreach
Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})
HBaseCon 2016
Foreach
Read HBase data in parallel for each partition and compute
hbaseContext.foreachPartition(keyValuesPuts,
new VoidFunction<Tuple2<Iterator<Put>, Connection>>() {
@Override
public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception {
BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME);
while (t._1().hasNext()) {
... // HBase API put/incr/append/cas calls
}
mutator.flush();
mutator.close();
}
});
});
HBaseCon 2016
Map
Take an HBase dataset and map it in parallel for each partition to produce a new RDD
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})
HBaseCon 2016
BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only)
rdd.hbaseBulkLoad(hbaseContext, tableName,
t => {
val rowKey = t._1
val fam:Array[Byte] = t._2._1
val qual = t._2._2
val value = t._2._3
val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual)
Seq((keyFamilyQualifier, value)).iterator
},
stagingFolder)
val load = new LoadIncrementalHFiles(config)
load.run(Array(stagingFolder, tableNameString))
HBaseCon 2016
BulkLoadThinRows
Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
HBaseCon 2016
BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
HBaseCon 2016
BulkPut
Parallelized HBase Multiput
hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() {
@Override
public Put call(String v1) throws Exception {
String[] tokens = v1.split("|");
Put put = new Put(Bytes.toBytes(tokens[0]));
put.addColumn(Bytes.toBytes("segment"),
Bytes.toBytes(tokens[1]),
Bytes.toBytes(tokens[2]));
return put;
}
});
HBaseCon 2016
BulkDelete
Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size
HBaseCon 2016
What Improvement Have We Made?
 Combine Spark and HBase
• Spark Catalyst Engine for Query Plan and Optimization
• HBase for Fast Access KV Store
• Implement Standard External Data Source with Built-in Filter
• High Performance
• Data Locality: Move Computation to Data
• Partition Pruning: Task only Performed in RS Holding Requested Data
• Column Pruning / Predicate Pushdown: Reduce Network Overhead
• Full Fledged DataFrame Support
• Spark-SQL
• Integrated Language Query
• Run on Top of Existing HBase Table
• Native Support Java Primitive Types
More …
• Composite Key
• Avro Format
Usage - Define the Catalog
Usage– Write to HBase
Usage– Construct DataFrame
Usage - Language Integrate Query/SQL
Spark HBase Connector Architecture
Apache Spark on Apache HBase: Current and Future

More Related Content

What's hot

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseHBaseCon
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseCloudera, Inc.
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionDataWorks Summit/Hadoop Summit
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseCloudera, Inc.
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesData Con LA
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardMatthew Blair
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Suman Srinivasan
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase Cloudera, Inc.
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBaseHBaseCon
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBaseCon
 
HBase Backups
HBase BackupsHBase Backups
HBase BackupsHBaseCon
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseCloudera, Inc.
 

What's hot (20)

HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWSHBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
HBaseCon 2015: Graph Processing of Stock Market Order Flow in HBase on AWS
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBase Read High Availability Using Timeline-Consistent Region Replicas
HBase Read High Availability Using Timeline-Consistent Region Replicas
 
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...
 
Keynote: The Future of Apache HBase
Keynote: The Future of Apache HBaseKeynote: The Future of Apache HBase
Keynote: The Future of Apache HBase
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBaseHBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
 
Batch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application AdoptionBatch is Back: Critical for Agile Application Adoption
Batch is Back: Critical for Agile Application Adoption
 
HBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBaseHBaseCon 2013: Integration of Apache Hive and HBase
HBaseCon 2013: Integration of Apache Hive and HBase
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
HBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ FlipboardHBaseCon 2015- HBase @ Flipboard
HBaseCon 2015- HBase @ Flipboard
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
Real-Time Video Analytics Using Hadoop and HBase (HBaseCon 2013)
 
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
HBaseCon 2013: Honeycomb - MySQL Backed by Apache HBase
 
Data Evolution in HBase
Data Evolution in HBaseData Evolution in HBase
Data Evolution in HBase
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
HBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial IndustryHBase at Bloomberg: High Availability Needs for the Financial Industry
HBase at Bloomberg: High Availability Needs for the Financial Industry
 
HBase Backups
HBase BackupsHBase Backups
HBase Backups
 
HBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBaseHBaseCon 2013: Compaction Improvements in Apache HBase
HBaseCon 2013: Compaction Improvements in Apache HBase
 

Similar to Apache Spark on Apache HBase: Current and Future

Cascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingCascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingAlexander Schätzle
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...DataWorks Summit/Hadoop Summit
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) BigDataEverywhere
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureRussell Spitzer
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax Academy
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & developmentShashwat Shriparv
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop IntegrationJeremy Hanna
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Jeremy Walsh
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightPaco Nathan
 

Similar to Apache Spark on Apache HBase: Current and Future (20)

Cascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join ProcessingCascading Map-Side Joins over HBase for Scalable Join Processing
Cascading Map-Side Joins over HBase for Scalable Join Processing
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
מיכאל
מיכאלמיכאל
מיכאל
 
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and S...
 
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant) Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
Big Data Everywhere Chicago: Unleash the Power of HBase Shell (Conversant)
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Spark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and FutureSpark Cassandra Connector: Past, Present, and Future
Spark Cassandra Connector: Past, Present, and Future
 
DataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and FutureDataStax: Spark Cassandra Connector - Past, Present and Future
DataStax: Spark Cassandra Connector - Past, Present and Future
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)Introduction to the Hadoop Ecosystem (codemotion Edition)
Introduction to the Hadoop Ecosystem (codemotion Edition)
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
H base introduction & development
H base introduction & developmentH base introduction & development
H base introduction & development
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Hspark index conf
Hspark index confHspark index conf
Hspark index conf
 
Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14Introduction to HBase - Phoenix HUG 5/14
Introduction to HBase - Phoenix HUG 5/14
 
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsightEnterprise Data Workflows with Cascading and Windows Azure HDInsight
Enterprise Data Workflows with Cascading and Windows Azure HDInsight
 

More from HBaseCon

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on KubernetesHBaseCon
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on BeamHBaseCon
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程HBaseCon
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at NeteaseHBaseCon
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践HBaseCon
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台HBaseCon
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comHBaseCon
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architectureHBaseCon
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at HuaweiHBaseCon
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0HBaseCon
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon
 

More from HBaseCon (20)

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kuberneteshbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes
 
hbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beamhbaseconasia2017: HBase on Beam
hbaseconasia2017: HBase on Beam
 
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huaweihbaseconasia2017: HBase Disaster Recovery Solution at Huawei
hbaseconasia2017: HBase Disaster Recovery Solution at Huawei
 
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinteresthbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
hbaseconasia2017: Removable singularity: a story of HBase upgrade in Pinterest
 
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
hbaseconasia2017: HareQL:快速HBase查詢工具的發展過程
 
hbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Neteasehbaseconasia2017: Apache HBase at Netease
hbaseconasia2017: Apache HBase at Netease
 
hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践hbaseconasia2017: HBase在Hulu的使用和实践
hbaseconasia2017: HBase在Hulu的使用和实践
 
hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台hbaseconasia2017: 基于HBase的企业级大数据平台
hbaseconasia2017: 基于HBase的企业级大数据平台
 
hbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.comhbaseconasia2017: HBase at JD.com
hbaseconasia2017: HBase at JD.com
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huaweihbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
hbaseconasia2017: Ecosystems with HBase and CloudTable service at Huawei
 
hbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMihbaseconasia2017: HBase Practice At XiaoMi
hbaseconasia2017: HBase Practice At XiaoMi
 
hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0hbaseconasia2017: hbase-2.0.0
hbaseconasia2017: hbase-2.0.0
 
HBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBaseHBaseCon2017 Democratizing HBase
HBaseCon2017 Democratizing HBase
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBaseHBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
HBaseCon2017 Quanta: Quora's hierarchical counting system on HBase
 
HBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBaseHBaseCon2017 Transactions in HBase
HBaseCon2017 Transactions in HBase
 
HBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBaseHBaseCon2017 Highly-Available HBase
HBaseCon2017 Highly-Available HBase
 
HBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at DidiHBaseCon2017 Apache HBase at Didi
HBaseCon2017 Apache HBase at Didi
 
HBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase ClientHBaseCon2017 gohbase: Pure Go HBase Client
HBaseCon2017 gohbase: Pure Go HBase Client
 

Recently uploaded

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationJuha-Pekka Tolvanen
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benonimasabamasaba
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrainmasabamasaba
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension AidPhilip Schwarz
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareJim McKeeth
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburgmasabamasaba
 

Recently uploaded (20)

Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
WSO2CON 2024 - Building the API First Enterprise – Running an API Program, fr...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 

Apache Spark on Apache HBase: Current and Future

  • 1. About Zhan Zhang  Zhan Zhang (Software Engineer at Hortonworks)  Currently Focus on Apache Spark and Hadoop, etc  Contribute to Apache Spark, Yarn, HBase, Ambari, etc  Experiences on Computer Networks, Distributed System and Machine Learning Platform
  • 2. About MeJean-Marc Spaggiari Java Message Service => JMS A bit of everything… 12 years in professional software development 4 years of team manager 4 years of project manager Joined Cloudera in May 2013 Had mostly HBase knowledge O'Reilly author of Architecting HBase Applications International Worked from Paris to Los Angeles More than 100 flights per year HBase and Phoenix contributor
  • 3. About Ted Malaska  PSA at Cloudera  Co-Author to Hadoop Application Architecture  Contribute to 12 Apache projects  Worked with ~100 customers using big data
  • 4. How it Started • Demand started in the Field • Porting off Map Reduce • Huge Value in Spark Streaming for storing Aggregates and being used for point look ups • Started as a Github • Andrew Purtell sparked the effort to put into Hbase • Big call out to Sean B, Jon H, Ted Y, and Matteo B • Components • Normal Spark • Spark Streaming • Bulk Load • SparkSQL HBaseCon 2016
  • 5. Under the covers HBaseCon 2016 Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 6. Key Addition: HBaseContext Create an HBaseContext // an Hadoop/HBase Configuration object val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml")) conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) // sc is the Spark Context; hbase context corresponds to an HBase Connection val hbaseContext = new HBaseContext(sc, conf) // A sample RDD val rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7")))) HBaseCon 2016
  • 7. • Foreach • Map • BulkLoad • BulkLoadThinRows • BulkGet (aka Multiget) • BulkDelete • Most of them in both Java and Scala Operations on the HBaseContext
  • 8. Foreach Read HBase data in parallel for each partition and compute rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator(TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() }) HBaseCon 2016
  • 9. Foreach Read HBase data in parallel for each partition and compute hbaseContext.foreachPartition(keyValuesPuts, new VoidFunction<Tuple2<Iterator<Put>, Connection>>() { @Override public void call(Tuple2<Iterator<Put>, Connection> t) throws Exception { BufferedMutator mutator = t._2().getBufferedMutator(TABLE_NAME); while (t._1().hasNext()) { ... // HBase API put/incr/append/cas calls } mutator.flush(); mutator.close(); } }); }); HBaseCon 2016
  • 10. Map Take an HBase dataset and map it in parallel for each partition to produce a new RDD val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() it.map( r => { ... // HBase API Scan Results } }) HBaseCon 2016
  • 11. BulkLoadBulk load a data set into Hbase (for all cases, generally wide tables) (Scala only) rdd.hbaseBulkLoad(hbaseContext, tableName, t => { val rowKey = t._1 val fam:Array[Byte] = t._2._1 val qual = t._2._2 val value = t._2._3 val keyFamilyQualifier= new KeyFamilyQualifier(rowKey, fam, qual) Seq((keyFamilyQualifier, value)).iterator }, stagingFolder) val load = new LoadIncrementalHFiles(config) load.run(Array(stagingFolder, tableNameString)) HBaseCon 2016
  • 12. BulkLoadThinRows Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) HBaseCon 2016
  • 13. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put } HBaseCon 2016
  • 14. BulkPut Parallelized HBase Multiput hbaseContext.bulkPut(textFile, TABLE_NAME, new Function<String, Put>() { @Override public Put call(String v1) throws Exception { String[] tokens = v1.split("|"); Put put = new Put(Bytes.toBytes(tokens[0])); put.addColumn(Bytes.toBytes("segment"), Bytes.toBytes(tokens[1]), Bytes.toBytes(tokens[2])); return put; } }); HBaseCon 2016
  • 15. BulkDelete Parallelized HBase Multi-deletes hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size HBaseCon 2016
  • 16. What Improvement Have We Made?  Combine Spark and HBase • Spark Catalyst Engine for Query Plan and Optimization • HBase for Fast Access KV Store • Implement Standard External Data Source with Built-in Filter • High Performance • Data Locality: Move Computation to Data • Partition Pruning: Task only Performed in RS Holding Requested Data • Column Pruning / Predicate Pushdown: Reduce Network Overhead • Full Fledged DataFrame Support • Spark-SQL • Integrated Language Query • Run on Top of Existing HBase Table • Native Support Java Primitive Types
  • 17. More … • Composite Key • Avro Format
  • 18. Usage - Define the Catalog
  • 21. Usage - Language Integrate Query/SQL
  • 22. Spark HBase Connector Architecture