SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
Apache Spark vs rest of the world
- Problems and Solutions
Arkadiusz Jachnik
#BigDataSpain 2017
About Arkadiusz
• Senior Data Scientist at AGORA SA
- user profiling & content personalization

- recommendation system

• PhD Student at 

Poznan University of Technology
- multi-class & multi-label classification

- multi-output prediction

- recommendation algorithms
2
#BigDataSpain 2017
Agora’s BigData Team
3
my boss Luiza :) it’s me!
we are all here
at #BDS!
I invite to talk of these guys :)
Arek Wojtek
Paweł
Paweł
Dawid
Bartek Jacek Daniel
#BigDataSpain 20174
Internet Press
Polish Media Company
Magazines
Radio
Cinemas
Advertising
TV
Books
#BigDataSpain 2017
Spark in Agora's BigData Platform
5
DATA COLLECTING AND INTEGRATION
USER PROFILING

SYSTEM DATA ANALYTICSRECOMMENDATION
SYSTEM
DATA ENRICHMENT AND CONTENT STRUCTURISATION
HADOOP CLUSTER
own build, v2.2
structuredstreaming
Spark SQL, MLlib
Spark
streaming
over 3 years of experience
#BigDataSpain 2017
Today discussed problems
6
1. Processing parts of data and loading from 

Spark to relational database in parallel
2. Bulk loading do HBase database
3. From relational database to Spark DataFrame
(with user defined functions)
4. From HBase to Spark by Hive external table
(with timestamps of HBase cells)
5. Spark Streaming with Kafka - how to implement
own offset manager
#BigDataSpain 2017
I will show some code…
• I will show real technical problems we have
encountered during Spark deployment

• We use Spark in Agora for over 3 years so
we have great experience

• I will present practical solutions showing
some code in Scala

• Scala is natural for Spark
7
1. Processing and writing parts of data in parallel
Problem description:

• We have processed huge
DataFrame of computed
recommendations for users

• There are 4 defined types of
recommendations

• For each type we want to take
top-K recommendations for each
user

• Recommendations of each type
should be loaded to different
PostgreSQL table
#BigDataSpain 20178
User
Recommendation
type
Article Score
Grzegorz TYPE_3 Article F 1.0
Bożena TYPE_4 Article B 0.2
Grażyna TYPE_2 Article B 0.2
Grzegorz TYPE_3 Article D 0.9
Krzysztof TYPE_3 Article D 0.4
Grażyna TYPE_2 Article C 0.9
Grażyna TYPE_1 Article D 0.3
Bożena TYPE_2 Article E 0.9
Grzegorz TYPE_1 Article E 1.0
Grzegorz TYPE_1 Article A 0.7
#BigDataSpain 2017
Code intro: input & output
9
Grzegorz, Article A, 1.0
Grzegorz, Article F, 0.9
Grzegorz, Article C, 0.9
Grzegorz, Article D, 0.8
Grzegorz, Article B, 0.75
Bożena, ... ...
TYPE1

5recos.peruser
save table_1
Krzysztof, Article F, 1.0
Krzysztof, Article D, 1.0
Krzysztof, Article C, 0.8
Krzysztof, Article B, 0.85
Grażyna, Article C, 1.0
Grażyna, ... ...
TYPE2

4recos.peruser
save table_2
Grzegorz, Article E, 1.0
Grzegorz, Article B, 0.75
Grzegorz, Article A, 0.8
Bożena, Article E, 0.9
Bożena, Article A, 0.75
Bożena, Article C 0.75
TYPE3

3recos.peruser
save table_3
Grażyna, Article A, 1.0
Grażyna, Article F, 0.9
Bożena, Article B, 0.9
Bożena, Article D, 0.9
Grzegorz, Article B, 1.0
Grzegorz, Article E, 0.95
TYPE4

2recos.peruser
save table_4
#BigDataSpain 2017
Standard approach
recoTypes.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
10
no-parallelism parallelism but most of the tasks skipped
#BigDataSpain 2017
maybe we can add .par ?
recoTypes.par.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
11
parallelism but too much tasks :(
#BigDataSpain 2017
Our trick
parallelizeProcessing(recoTypes, (recoType: RecoType) => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = {
f(recoTypes.head)
if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_))
}
12
execute Spark action for the first type…
parallelize the rest
2. Fast bulk-loading to HBase
Problems with standard HBase
client (inserts with Put class):

• Difficult integration with Spark

• Complicated parallelization

• For non pre-splited tables problem
with *Region*Exception-s
• Slow for millions of rows
#BigDataSpain 201713
Spark DataFrame / RDD
.foreachPartition
hTable

.put(…)
hTable

.put(…)
hTable

.put(…)
hTable

.put(…)
#BigDataSpain 2017
Idea
Our approach is based on:

https://github.com/zeyuanxy/
spark-hbase-bulk-loading
Input RDD:

data: RDD[( //pair RDD
Array[Byte], //HBase row key
Map[ //data:
String, //column-family
Array[(
String, //column name
(String, //cell value
Long) //timestamp
)]
]
)]
14
General idea:

We have to save our RDD data as HFiles
(HBase data are stored in such files) and load
them into the given pre-existing table.
General steps:

1. Implement Spark Partitioner that defines
how our data in a key-value pair RDD
should be partitioned for HBase row key

2. Repartition and sort the RDD within
column-families and starting row keys for
every HBase region

3. Save RDD to HDFS as HFiles by
rdd.saveAsNewAPIHadoopFile method

4. Load files to table by
LoadIncrementalHFiles (HBase API)
#BigDataSpain 2017
Implementation
// Prepare hConnection, tableName, hTable ...
val regionLocator =
hConnection.getRegionLocator(tableName)
val columnFamilies = hTable.getTableDescriptor
.getFamiliesKeys.map(Bytes.toString(_))
val partitioner = new
HFilePartitioner(regionLocator.getStartKeys, fraction)
// prepare partitioned RDD
val rdds = for {
family <- columnFamilies
rdd = data
.collect{ case (key, dataMap) if
dataMap.contains(family) => (key, dataMap(family))}
.flatMap{ case (key, familyDataMap) =>
familyDataMap.map{
case (column: String, valueTs: (String, Long)) =>
(((key, Bytes.toBytes(column)), valueTs._2),
Bytes.toBytes(valueTs._1))
}
}
} yield getPartitionedRdd(rdd, family, partitioner)
15
val rddToSave = rdds.reduce(_ ++ _)
// prepare map-reduce job for bulk-load
HFileOutputFormat2.configureIncrementalLoad(
job, hTable, regionLocator)
// prepare path for HFiles output
val fs = FileSystem.get(hbaseConfig)
val hFilePath = new Path(...)
try {
rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString,
classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], job.getConfiguration)
// prepare HFiles for incremental load by setting
// folders permissions read/write/exec for all...
setRecursivePermission(hFilePath)
val loader = new LoadIncrementalHFiles(hbaseConfig)
loader.doBulkLoad(hFilePath, hConnection.getAdmin,
hTable, regionLocator)
} // finally close resources, ...
Prepare HBase
connection, table

and region locator
Prepare Spark
partitioner for
HBase regions
Repartition and sort
data within partitions
by the partitioner
Save HFiles by
NewAPIHadoopFile 

to HDFS
Load HFiles 

to HBase table
#BigDataSpain 2017
Keep in mind
• Set optimally HBase parameter:

hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32)

• For large data too small value of this parameter may causes

IllegalArgumentException: Size exceeds Integer.MAX_VALUE
• Create HBase tables with splits adapted to the expected row keys

- example: for row keys of HEX IDs create table with splits like:

create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',

‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']}
- for further single puts it minimizes *Region*Exceptions
16
#BigDataSpain 2017
3. Loading data from Postgres to Spark


This is possible for data from Hive:

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val data: DataFrame = sparkSesstion.sql(
"SELECT id, toUpperCaseUdf(code) FROM types"
)
17
But this is not possible for data from
JDBC (for example PostgreSQL):

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT toUpperCaseUdf(code) " +
"FROM codes) as codesData",
connectionConf)
this query is executed
by Postgres (not Spark)
here you can can specify
just Postgres table name
and how to parallelize
data loading?
#BigDataSpain 2017
Try to load ’raw’ data without UDFs and next
use .withColumn with UDF as expression:

val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT code " +
"FROM codes) as codesData",
connectionConf)
.withColumn("upperCode",
expr("toUpperCaseUdf(code)"))
Our solution
18
.jdbc produces
DataFrame
We will split the table read across executors
on the selected column:

val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(
url = jdbcUrl,
table = "(SELECT code, type_id " +
"FROM codes) as codesData",
columnName = "type_id",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 10,
connectionProperties = connectionConf)
but it’s one partition!
#BigDataSpain 2017
Is it working?
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
properties = connectionProperties)
.cache()
spark.read.jdbc(
url = "jdbc:mysql://localhost:3306/test",
table = "users",
columnName = "type",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 4,
connectionProperties = connectionProperties)
.cache()
19
test data
1 partition
4 partitions
#BigDataSpain 2017
4. From HBase to Spark by Hive
There are commonly used method for loading
data from HBase to Spark by Hive external
table:

CREATE TABLE hive_view_on_hbase (
key int,
value string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key, cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
20
72A9DBA74524
column-family: cities
Poznan Warsaw Cracow Gdansk
40 5 1 3
58383B36275A
Poznan Warsaw Cracow Gdansk
120 60 5
009D22419988
Poznan Warsaw Cracow Gdansk
75 1
user_id cities_map last_city
72A9DBA
74524
map(Poznan->40, Warsaw->5,

Cracow->1, Gdansk->3)
?
58383B3
6275A
map(Warsaw->120, 

Cracow->60, Gdansk->5)
?
009D224
19988
map(Poznan->75, Warsaw->1) ?
HiveHBaseHandler
but how to get the last
(most recent) values?
where aretimestamps?
#BigDataSpain 2017
Our case
• We use HDP distribution of Hadoop cluster
with HBase 1.1.x

• There is possibility to add to Hive view on
HBase table the latest timestamp of row
modification:

CREATE TABLE hive_view_on_hbase (
key int,
value string,
ts timestamp
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key, cf1:val, :timestamp'
)
TBLPROPERTIES (
'hbase.table.name' = 'xyz'
);
21
• How to extract timestamp of each cell?

• Answer: rewrite Hive-HBase-Handler that is
responsible for creating the Hive views on
HBase tables :) … but first …

• Do not download source code of Hive
from the Hive GitHub repository - check
your Hadoop distribution! (for example
HDP has own code branch)
#BigDataSpain 201722
There is a patch on Hive repo…
…but still not reviewed and merged :(
#BigDataSpain 2017
There is a lot of code…
…but we have some tips on how to change Hive-HBase-Handler:

• Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java
which returns ColumnMappings object

• LazyHBaseRow class stores data from HBase row.

• Timestamps of processed HBase cells can be read from loaded (by scanner) rows in
LazyHBaseCellMap class

• Column parser and HBase scanner is initialized in HBaseStorageHandler.java
23
#BigDataSpain 2017
5. Spark + Kafka: own offset manager
Problem description:

• Spark output operations are at-least-once

• For exactly-once semantics, you must store
offsets after an idempotent output, or in an
atomic transaction alongside output

• Options:

1. Checkpoints

+ easy to enable by Spark checkpointing

- output operation must be idempotent

- cannot recover from a checkpoint if
application code has changed

2. Own data store

+ regardless of changes to your application
code

+ you can use data stores that support
transactions

+ exactly-once semantics
24
Single Spark batch
Process
and save data
Save
offsets
Image source: Spark Streaming documentation

https://spark.apache.org/docs/latest/streaming-programming-guide.html
#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(…)
val stream: DStream[ConsumerRecord[String, String]] = ...
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, ...)
})
25
Single Spark batch
Process
and save data
Save
offsets
#BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(...)
val stream: DStream[ConsumerRecord[String, String]] =
kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams)
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, zkPath)
})
def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore,
kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = {
offsetsStore.readOffsets(topic, zkPath) match {
case Some(offsetsMap) =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap))
case None =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams)
)
}
}
26
#BigDataSpain 2017
Code of offset store
class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false)
def saveOffsets(rdd: RDD[_], zkPath: String): Unit = {
val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsRanges.groupBy(_.topic).foreach {
case (topic, offsetsRangesPerTopic) => {
val offsetsRangesStr = offsetsRangesPerTopic
.map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",")
zkUtils.updatePersistentPath(zkPath, offsetsRangesStr)
}
}}
def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = {
val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath)
offsetsRangesStrOpt match {
case Some(offsetsRangesStr) =>
Some(offsetsRangesStr.split(",").map(s => s.split(":")).map {
case Array(partitionStr, offsetStr) =>
new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong
}.toMap)
case None => None
}
}
}
27
Thank you!
Questions?
arkadiusz.jachnik@agora.pl
www.linkedin.com/in/arkadiusz-jachnik

Más contenido relacionado

La actualidad más candente

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
Víctor Zabalza
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
Víctor Zabalza
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 

La actualidad más candente (20)

Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 Let Spark Fly: Advantages and Use Cases for Spark on Hadoop Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
Let Spark Fly: Advantages and Use Cases for Spark on Hadoop
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher BateySpark with Cassandra by Christopher Batey
Spark with Cassandra by Christopher Batey
 
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskAUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask
 
Lens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgetsLens: Data exploration with Dask and Jupyter widgets
Lens: Data exploration with Dask and Jupyter widgets
 
Cosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics WorkshopCosmos DB Real-time Advanced Analytics Workshop
Cosmos DB Real-time Advanced Analytics Workshop
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5Unlocking Your Hadoop Data with Apache Spark and CDH5
Unlocking Your Hadoop Data with Apache Spark and CDH5
 
Project Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare MetalProject Tungsten: Bringing Spark Closer to Bare Metal
Project Tungsten: Bringing Spark Closer to Bare Metal
 
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache SparkKoalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
 
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNEGenerating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 

Similar a Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
Stuart Ainsworth
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
Henk van der Valk
 

Similar a Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017 (20)

Analytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop ConnectorAnalytics with MongoDB Aggregation Framework and Hadoop Connector
Analytics with MongoDB Aggregation Framework and Hadoop Connector
 
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
MongoDB Evenings Houston: What's the Scoop on MongoDB and Hadoop? by Jake Ang...
 
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & HadoopMongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
MongoDB Evenings Dallas: What's the Scoop on MongoDB & Hadoop
 
Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017Apache parquet - Apache big data North America 2017
Apache parquet - Apache big data North America 2017
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Azure database as a service options
Azure database as a service optionsAzure database as a service options
Azure database as a service options
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
RedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory OptimizationRedisConf18 - Redis Memory Optimization
RedisConf18 - Redis Memory Optimization
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019Big Data Processing with Spark and .NET - Microsoft Ignite 2019
Big Data Processing with Spark and .NET - Microsoft Ignite 2019
 
Json to hive_schema_generator
Json to hive_schema_generatorJson to hive_schema_generator
Json to hive_schema_generator
 
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
SAP PowerDesigner Masterclass for the UK SAP Database & Technology User Group...
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Data Science
Data ScienceData Science
Data Science
 
NoSQL and SQL Anti Patterns
NoSQL and SQL Anti PatternsNoSQL and SQL Anti Patterns
NoSQL and SQL Anti Patterns
 
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Elephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to HadoopElephant in the room: A DBA's Guide to Hadoop
Elephant in the room: A DBA's Guide to Hadoop
 
Get started with Microsoft SQL Polybase
Get started with Microsoft SQL PolybaseGet started with Microsoft SQL Polybase
Get started with Microsoft SQL Polybase
 

Más de Big Data Spain

Más de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017

  • 1.
  • 2. Apache Spark vs rest of the world - Problems and Solutions Arkadiusz Jachnik
  • 3. #BigDataSpain 2017 About Arkadiusz • Senior Data Scientist at AGORA SA - user profiling & content personalization - recommendation system • PhD Student at 
 Poznan University of Technology - multi-class & multi-label classification - multi-output prediction - recommendation algorithms 2
  • 4. #BigDataSpain 2017 Agora’s BigData Team 3 my boss Luiza :) it’s me! we are all here at #BDS! I invite to talk of these guys :) Arek Wojtek Paweł Paweł Dawid Bartek Jacek Daniel
  • 5. #BigDataSpain 20174 Internet Press Polish Media Company Magazines Radio Cinemas Advertising TV Books
  • 6. #BigDataSpain 2017 Spark in Agora's BigData Platform 5 DATA COLLECTING AND INTEGRATION USER PROFILING
 SYSTEM DATA ANALYTICSRECOMMENDATION SYSTEM DATA ENRICHMENT AND CONTENT STRUCTURISATION HADOOP CLUSTER own build, v2.2 structuredstreaming Spark SQL, MLlib Spark streaming over 3 years of experience
  • 7. #BigDataSpain 2017 Today discussed problems 6 1. Processing parts of data and loading from 
 Spark to relational database in parallel 2. Bulk loading do HBase database 3. From relational database to Spark DataFrame (with user defined functions) 4. From HBase to Spark by Hive external table (with timestamps of HBase cells) 5. Spark Streaming with Kafka - how to implement own offset manager
  • 8. #BigDataSpain 2017 I will show some code… • I will show real technical problems we have encountered during Spark deployment • We use Spark in Agora for over 3 years so we have great experience • I will present practical solutions showing some code in Scala • Scala is natural for Spark 7
  • 9. 1. Processing and writing parts of data in parallel Problem description: • We have processed huge DataFrame of computed recommendations for users • There are 4 defined types of recommendations • For each type we want to take top-K recommendations for each user • Recommendations of each type should be loaded to different PostgreSQL table #BigDataSpain 20178 User Recommendation type Article Score Grzegorz TYPE_3 Article F 1.0 Bożena TYPE_4 Article B 0.2 Grażyna TYPE_2 Article B 0.2 Grzegorz TYPE_3 Article D 0.9 Krzysztof TYPE_3 Article D 0.4 Grażyna TYPE_2 Article C 0.9 Grażyna TYPE_1 Article D 0.3 Bożena TYPE_2 Article E 0.9 Grzegorz TYPE_1 Article E 1.0 Grzegorz TYPE_1 Article A 0.7
  • 10. #BigDataSpain 2017 Code intro: input & output 9 Grzegorz, Article A, 1.0 Grzegorz, Article F, 0.9 Grzegorz, Article C, 0.9 Grzegorz, Article D, 0.8 Grzegorz, Article B, 0.75 Bożena, ... ... TYPE1
 5recos.peruser save table_1 Krzysztof, Article F, 1.0 Krzysztof, Article D, 1.0 Krzysztof, Article C, 0.8 Krzysztof, Article B, 0.85 Grażyna, Article C, 1.0 Grażyna, ... ... TYPE2
 4recos.peruser save table_2 Grzegorz, Article E, 1.0 Grzegorz, Article B, 0.75 Grzegorz, Article A, 0.8 Bożena, Article E, 0.9 Bożena, Article A, 0.75 Bożena, Article C 0.75 TYPE3
 3recos.peruser save table_3 Grażyna, Article A, 1.0 Grażyna, Article F, 0.9 Bożena, Article B, 0.9 Bożena, Article D, 0.9 Grzegorz, Article B, 1.0 Grzegorz, Article E, 0.95 TYPE4
 2recos.peruser save table_4
  • 11. #BigDataSpain 2017 Standard approach recoTypes.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 10 no-parallelism parallelism but most of the tasks skipped
  • 12. #BigDataSpain 2017 maybe we can add .par ? recoTypes.par.foreach(recoType => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) 11 parallelism but too much tasks :(
  • 13. #BigDataSpain 2017 Our trick parallelizeProcessing(recoTypes, (recoType: RecoType) => { val topNrecommendations = processedData.where($"type" === recoType.code) .withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score")))) .where(col("row_number") <= recoType.recoNum).drop("row_number") RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName) }) def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = { f(recoTypes.head) if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_)) } 12 execute Spark action for the first type… parallelize the rest
  • 14. 2. Fast bulk-loading to HBase Problems with standard HBase client (inserts with Put class): • Difficult integration with Spark • Complicated parallelization • For non pre-splited tables problem with *Region*Exception-s • Slow for millions of rows #BigDataSpain 201713 Spark DataFrame / RDD .foreachPartition hTable
 .put(…) hTable
 .put(…) hTable
 .put(…) hTable
 .put(…)
  • 15. #BigDataSpain 2017 Idea Our approach is based on: https://github.com/zeyuanxy/ spark-hbase-bulk-loading Input RDD: data: RDD[( //pair RDD Array[Byte], //HBase row key Map[ //data: String, //column-family Array[( String, //column name (String, //cell value Long) //timestamp )] ] )] 14 General idea: We have to save our RDD data as HFiles (HBase data are stored in such files) and load them into the given pre-existing table. General steps: 1. Implement Spark Partitioner that defines how our data in a key-value pair RDD should be partitioned for HBase row key 2. Repartition and sort the RDD within column-families and starting row keys for every HBase region 3. Save RDD to HDFS as HFiles by rdd.saveAsNewAPIHadoopFile method 4. Load files to table by LoadIncrementalHFiles (HBase API)
  • 16. #BigDataSpain 2017 Implementation // Prepare hConnection, tableName, hTable ... val regionLocator = hConnection.getRegionLocator(tableName) val columnFamilies = hTable.getTableDescriptor .getFamiliesKeys.map(Bytes.toString(_)) val partitioner = new HFilePartitioner(regionLocator.getStartKeys, fraction) // prepare partitioned RDD val rdds = for { family <- columnFamilies rdd = data .collect{ case (key, dataMap) if dataMap.contains(family) => (key, dataMap(family))} .flatMap{ case (key, familyDataMap) => familyDataMap.map{ case (column: String, valueTs: (String, Long)) => (((key, Bytes.toBytes(column)), valueTs._2), Bytes.toBytes(valueTs._1)) } } } yield getPartitionedRdd(rdd, family, partitioner) 15 val rddToSave = rdds.reduce(_ ++ _) // prepare map-reduce job for bulk-load HFileOutputFormat2.configureIncrementalLoad( job, hTable, regionLocator) // prepare path for HFiles output val fs = FileSystem.get(hbaseConfig) val hFilePath = new Path(...) try { rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString, classOf[ImmutableBytesWritable], classOf[KeyValue], classOf[HFileOutputFormat2], job.getConfiguration) // prepare HFiles for incremental load by setting // folders permissions read/write/exec for all... setRecursivePermission(hFilePath) val loader = new LoadIncrementalHFiles(hbaseConfig) loader.doBulkLoad(hFilePath, hConnection.getAdmin, hTable, regionLocator) } // finally close resources, ... Prepare HBase connection, table
 and region locator Prepare Spark partitioner for HBase regions Repartition and sort data within partitions by the partitioner Save HFiles by NewAPIHadoopFile 
 to HDFS Load HFiles 
 to HBase table
  • 17. #BigDataSpain 2017 Keep in mind • Set optimally HBase parameter:
 hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32) • For large data too small value of this parameter may causes
 IllegalArgumentException: Size exceeds Integer.MAX_VALUE • Create HBase tables with splits adapted to the expected row keys - example: for row keys of HEX IDs create table with splits like:
 create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
 ‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']} - for further single puts it minimizes *Region*Exceptions 16
  • 18. #BigDataSpain 2017 3. Loading data from Postgres to Spark 
 This is possible for data from Hive:
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val data: DataFrame = sparkSesstion.sql( "SELECT id, toUpperCaseUdf(code) FROM types" ) 17 But this is not possible for data from JDBC (for example PostgreSQL):
 val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT toUpperCaseUdf(code) " + "FROM codes) as codesData", connectionConf) this query is executed by Postgres (not Spark) here you can can specify just Postgres table name and how to parallelize data loading?
  • 19. #BigDataSpain 2017 Try to load ’raw’ data without UDFs and next use .withColumn with UDF as expression: val toUpperCase: String => String = _.toUpperCase val toUpperCaseUdf = udf(toUpperCase) val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc(jdbcUrl, "(SELECT code " + "FROM codes) as codesData", connectionConf) .withColumn("upperCode", expr("toUpperCaseUdf(code)")) Our solution 18 .jdbc produces DataFrame We will split the table read across executors on the selected column: val jdbcUrl = s"jdbc:mysql://host:port/database" val data: DataFrame = sparkSesstion.read .jdbc( url = jdbcUrl, table = "(SELECT code, type_id " + "FROM codes) as codesData", columnName = "type_id", lowerBound = 1L, upperBound = 100L, numPartitions = 10, connectionProperties = connectionConf) but it’s one partition!
  • 20. #BigDataSpain 2017 Is it working? spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", properties = connectionProperties) .cache() spark.read.jdbc( url = "jdbc:mysql://localhost:3306/test", table = "users", columnName = "type", lowerBound = 1L, upperBound = 100L, numPartitions = 4, connectionProperties = connectionProperties) .cache() 19 test data 1 partition 4 partitions
  • 21. #BigDataSpain 2017 4. From HBase to Spark by Hive There are commonly used method for loading data from HBase to Spark by Hive external table: CREATE TABLE hive_view_on_hbase ( key int, value string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( "hbase.columns.mapping" = ":key, cf1:val" ) TBLPROPERTIES ( "hbase.table.name" = "xyz" ); 20 72A9DBA74524 column-family: cities Poznan Warsaw Cracow Gdansk 40 5 1 3 58383B36275A Poznan Warsaw Cracow Gdansk 120 60 5 009D22419988 Poznan Warsaw Cracow Gdansk 75 1 user_id cities_map last_city 72A9DBA 74524 map(Poznan->40, Warsaw->5,
 Cracow->1, Gdansk->3) ? 58383B3 6275A map(Warsaw->120, 
 Cracow->60, Gdansk->5) ? 009D224 19988 map(Poznan->75, Warsaw->1) ? HiveHBaseHandler but how to get the last (most recent) values? where aretimestamps?
  • 22. #BigDataSpain 2017 Our case • We use HDP distribution of Hadoop cluster with HBase 1.1.x • There is possibility to add to Hive view on HBase table the latest timestamp of row modification: CREATE TABLE hive_view_on_hbase ( key int, value string, ts timestamp ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping' = ':key, cf1:val, :timestamp' ) TBLPROPERTIES ( 'hbase.table.name' = 'xyz' ); 21 • How to extract timestamp of each cell? • Answer: rewrite Hive-HBase-Handler that is responsible for creating the Hive views on HBase tables :) … but first … • Do not download source code of Hive from the Hive GitHub repository - check your Hadoop distribution! (for example HDP has own code branch)
  • 23. #BigDataSpain 201722 There is a patch on Hive repo… …but still not reviewed and merged :(
  • 24. #BigDataSpain 2017 There is a lot of code… …but we have some tips on how to change Hive-HBase-Handler: • Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java which returns ColumnMappings object • LazyHBaseRow class stores data from HBase row. • Timestamps of processed HBase cells can be read from loaded (by scanner) rows in LazyHBaseCellMap class • Column parser and HBase scanner is initialized in HBaseStorageHandler.java 23
  • 25. #BigDataSpain 2017 5. Spark + Kafka: own offset manager Problem description: • Spark output operations are at-least-once • For exactly-once semantics, you must store offsets after an idempotent output, or in an atomic transaction alongside output • Options: 1. Checkpoints + easy to enable by Spark checkpointing - output operation must be idempotent - cannot recover from a checkpoint if application code has changed 2. Own data store + regardless of changes to your application code + you can use data stores that support transactions + exactly-once semantics 24 Single Spark batch Process and save data Save offsets Image source: Spark Streaming documentation https://spark.apache.org/docs/latest/streaming-programming-guide.html
  • 26. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(…) val stream: DStream[ConsumerRecord[String, String]] = ... stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, ...) }) 25 Single Spark batch Process and save data Save offsets
  • 27. #BigDataSpain 2017 Some code with Spark Streaming val ssc: StreamingContext = new StreamingContext(...) val stream: DStream[ConsumerRecord[String, String]] = kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams) stream.foreachRDD(rdd => { val toSave: Seq[String] = rdd.collect().map(_.value()) saveData(toSave) offsetsStore.saveOffsets(rdd, zkPath) }) def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore, kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = { offsetsStore.readOffsets(topic, zkPath) match { case Some(offsetsMap) => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap)) case None => KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent, ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams) ) } } 26
  • 28. #BigDataSpain 2017 Code of offset store class MyOffsetsStore(zkHosts: String) { val zkUtils = ZkUtils(zkHosts, 10000, 10000, false) def saveOffsets(rdd: RDD[_], zkPath: String): Unit = { val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges offsetsRanges.groupBy(_.topic).foreach { case (topic, offsetsRangesPerTopic) => { val offsetsRangesStr = offsetsRangesPerTopic .map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",") zkUtils.updatePersistentPath(zkPath, offsetsRangesStr) } }} def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = { val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath) offsetsRangesStrOpt match { case Some(offsetsRangesStr) => Some(offsetsRangesStr.split(",").map(s => s.split(":")).map { case Array(partitionStr, offsetStr) => new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong }.toMap) case None => None } } } 27