Apache Spark is a great solution for building Big Data applications. It provides really fast SQL-like processing, machine learning library, and streaming module for near real time processing of data streams. Unfortunately, during application development and production deployments we often encounter many difficulties in mixing various data sources or bulk loading of computed data to SQL or NoSQL databases
https://www.bigdataspain.org/2017/talk/apache-spark-vs-rest-of-the-world-problems-and-solutions
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Apache Spark vs rest of the world – Problems and Solutions by Arkadiusz Jachnik at Big Data Spain 2017
1.
2. Apache Spark vs rest of the world
- Problems and Solutions
Arkadiusz Jachnik
3. #BigDataSpain 2017
About Arkadiusz
• Senior Data Scientist at AGORA SA
- user profiling & content personalization
- recommendation system
• PhD Student at
Poznan University of Technology
- multi-class & multi-label classification
- multi-output prediction
- recommendation algorithms
2
4. #BigDataSpain 2017
Agora’s BigData Team
3
my boss Luiza :) it’s me!
we are all here
at #BDS!
I invite to talk of these guys :)
Arek Wojtek
Paweł
Paweł
Dawid
Bartek Jacek Daniel
6. #BigDataSpain 2017
Spark in Agora's BigData Platform
5
DATA COLLECTING AND INTEGRATION
USER PROFILING
SYSTEM DATA ANALYTICSRECOMMENDATION
SYSTEM
DATA ENRICHMENT AND CONTENT STRUCTURISATION
HADOOP CLUSTER
own build, v2.2
structuredstreaming
Spark SQL, MLlib
Spark
streaming
over 3 years of experience
7. #BigDataSpain 2017
Today discussed problems
6
1. Processing parts of data and loading from
Spark to relational database in parallel
2. Bulk loading do HBase database
3. From relational database to Spark DataFrame
(with user defined functions)
4. From HBase to Spark by Hive external table
(with timestamps of HBase cells)
5. Spark Streaming with Kafka - how to implement
own offset manager
8. #BigDataSpain 2017
I will show some code…
• I will show real technical problems we have
encountered during Spark deployment
• We use Spark in Agora for over 3 years so
we have great experience
• I will present practical solutions showing
some code in Scala
• Scala is natural for Spark
7
9. 1. Processing and writing parts of data in parallel
Problem description:
• We have processed huge
DataFrame of computed
recommendations for users
• There are 4 defined types of
recommendations
• For each type we want to take
top-K recommendations for each
user
• Recommendations of each type
should be loaded to different
PostgreSQL table
#BigDataSpain 20178
User
Recommendation
type
Article Score
Grzegorz TYPE_3 Article F 1.0
Bożena TYPE_4 Article B 0.2
Grażyna TYPE_2 Article B 0.2
Grzegorz TYPE_3 Article D 0.9
Krzysztof TYPE_3 Article D 0.4
Grażyna TYPE_2 Article C 0.9
Grażyna TYPE_1 Article D 0.3
Bożena TYPE_2 Article E 0.9
Grzegorz TYPE_1 Article E 1.0
Grzegorz TYPE_1 Article A 0.7
10. #BigDataSpain 2017
Code intro: input & output
9
Grzegorz, Article A, 1.0
Grzegorz, Article F, 0.9
Grzegorz, Article C, 0.9
Grzegorz, Article D, 0.8
Grzegorz, Article B, 0.75
Bożena, ... ...
TYPE1
5recos.peruser
save table_1
Krzysztof, Article F, 1.0
Krzysztof, Article D, 1.0
Krzysztof, Article C, 0.8
Krzysztof, Article B, 0.85
Grażyna, Article C, 1.0
Grażyna, ... ...
TYPE2
4recos.peruser
save table_2
Grzegorz, Article E, 1.0
Grzegorz, Article B, 0.75
Grzegorz, Article A, 0.8
Bożena, Article E, 0.9
Bożena, Article A, 0.75
Bożena, Article C 0.75
TYPE3
3recos.peruser
save table_3
Grażyna, Article A, 1.0
Grażyna, Article F, 0.9
Bożena, Article B, 0.9
Bożena, Article D, 0.9
Grzegorz, Article B, 1.0
Grzegorz, Article E, 0.95
TYPE4
2recos.peruser
save table_4
11. #BigDataSpain 2017
Standard approach
recoTypes.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
10
no-parallelism parallelism but most of the tasks skipped
12. #BigDataSpain 2017
maybe we can add .par ?
recoTypes.par.foreach(recoType => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
11
parallelism but too much tasks :(
13. #BigDataSpain 2017
Our trick
parallelizeProcessing(recoTypes, (recoType: RecoType) => {
val topNrecommendations = processedData.where($"type" === recoType.code)
.withColumn("row_number", row_number().over(Window.partitionBy("name").orderBy(desc("score"))))
.where(col("row_number") <= recoType.recoNum).drop("row_number")
RecoDAO.save(topNrecommendations.collect().map(OutputReco(_)), recoType.tableName)
})
def parallelizeProcessing(recoTypes: Seq[RecoType], f: RecoType => Unit) = {
f(recoTypes.head)
if(recoTypes.tail.nonEmpty) recoTypes.tail.par.foreach(f(_))
}
12
execute Spark action for the first type…
parallelize the rest
14. 2. Fast bulk-loading to HBase
Problems with standard HBase
client (inserts with Put class):
• Difficult integration with Spark
• Complicated parallelization
• For non pre-splited tables problem
with *Region*Exception-s
• Slow for millions of rows
#BigDataSpain 201713
Spark DataFrame / RDD
.foreachPartition
hTable
.put(…)
hTable
.put(…)
hTable
.put(…)
hTable
.put(…)
15. #BigDataSpain 2017
Idea
Our approach is based on:
https://github.com/zeyuanxy/
spark-hbase-bulk-loading
Input RDD:
data: RDD[( //pair RDD
Array[Byte], //HBase row key
Map[ //data:
String, //column-family
Array[(
String, //column name
(String, //cell value
Long) //timestamp
)]
]
)]
14
General idea:
We have to save our RDD data as HFiles
(HBase data are stored in such files) and load
them into the given pre-existing table.
General steps:
1. Implement Spark Partitioner that defines
how our data in a key-value pair RDD
should be partitioned for HBase row key
2. Repartition and sort the RDD within
column-families and starting row keys for
every HBase region
3. Save RDD to HDFS as HFiles by
rdd.saveAsNewAPIHadoopFile method
4. Load files to table by
LoadIncrementalHFiles (HBase API)
16. #BigDataSpain 2017
Implementation
// Prepare hConnection, tableName, hTable ...
val regionLocator =
hConnection.getRegionLocator(tableName)
val columnFamilies = hTable.getTableDescriptor
.getFamiliesKeys.map(Bytes.toString(_))
val partitioner = new
HFilePartitioner(regionLocator.getStartKeys, fraction)
// prepare partitioned RDD
val rdds = for {
family <- columnFamilies
rdd = data
.collect{ case (key, dataMap) if
dataMap.contains(family) => (key, dataMap(family))}
.flatMap{ case (key, familyDataMap) =>
familyDataMap.map{
case (column: String, valueTs: (String, Long)) =>
(((key, Bytes.toBytes(column)), valueTs._2),
Bytes.toBytes(valueTs._1))
}
}
} yield getPartitionedRdd(rdd, family, partitioner)
15
val rddToSave = rdds.reduce(_ ++ _)
// prepare map-reduce job for bulk-load
HFileOutputFormat2.configureIncrementalLoad(
job, hTable, regionLocator)
// prepare path for HFiles output
val fs = FileSystem.get(hbaseConfig)
val hFilePath = new Path(...)
try {
rddToSave.saveAsNewAPIHadoopFile(hFilePath.toString,
classOf[ImmutableBytesWritable], classOf[KeyValue],
classOf[HFileOutputFormat2], job.getConfiguration)
// prepare HFiles for incremental load by setting
// folders permissions read/write/exec for all...
setRecursivePermission(hFilePath)
val loader = new LoadIncrementalHFiles(hbaseConfig)
loader.doBulkLoad(hFilePath, hConnection.getAdmin,
hTable, regionLocator)
} // finally close resources, ...
Prepare HBase
connection, table
and region locator
Prepare Spark
partitioner for
HBase regions
Repartition and sort
data within partitions
by the partitioner
Save HFiles by
NewAPIHadoopFile
to HDFS
Load HFiles
to HBase table
17. #BigDataSpain 2017
Keep in mind
• Set optimally HBase parameter:
hbase.mapreduce.bulkload.max.hfiles.perRegion.perFamily (default 32)
• For large data too small value of this parameter may causes
IllegalArgumentException: Size exceeds Integer.MAX_VALUE
• Create HBase tables with splits adapted to the expected row keys
- example: for row keys of HEX IDs create table with splits like:
create 'hbase_table_name', 'col-fam', {SPLITS => ['0','1','2',
‚3’,'4','5','6','7','8','9','a','b','c','d','e','f']}
- for further single puts it minimizes *Region*Exceptions
16
18. #BigDataSpain 2017
3. Loading data from Postgres to Spark
This is possible for data from Hive:
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val data: DataFrame = sparkSesstion.sql(
"SELECT id, toUpperCaseUdf(code) FROM types"
)
17
But this is not possible for data from
JDBC (for example PostgreSQL):
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT toUpperCaseUdf(code) " +
"FROM codes) as codesData",
connectionConf)
this query is executed
by Postgres (not Spark)
here you can can specify
just Postgres table name
and how to parallelize
data loading?
19. #BigDataSpain 2017
Try to load ’raw’ data without UDFs and next
use .withColumn with UDF as expression:
val toUpperCase: String => String = _.toUpperCase
val toUpperCaseUdf = udf(toUpperCase)
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(jdbcUrl,
"(SELECT code " +
"FROM codes) as codesData",
connectionConf)
.withColumn("upperCode",
expr("toUpperCaseUdf(code)"))
Our solution
18
.jdbc produces
DataFrame
We will split the table read across executors
on the selected column:
val jdbcUrl = s"jdbc:mysql://host:port/database"
val data: DataFrame = sparkSesstion.read
.jdbc(
url = jdbcUrl,
table = "(SELECT code, type_id " +
"FROM codes) as codesData",
columnName = "type_id",
lowerBound = 1L,
upperBound = 100L,
numPartitions = 10,
connectionProperties = connectionConf)
but it’s one partition!
21. #BigDataSpain 2017
4. From HBase to Spark by Hive
There are commonly used method for loading
data from HBase to Spark by Hive external
table:
CREATE TABLE hive_view_on_hbase (
key int,
value string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key, cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
20
72A9DBA74524
column-family: cities
Poznan Warsaw Cracow Gdansk
40 5 1 3
58383B36275A
Poznan Warsaw Cracow Gdansk
120 60 5
009D22419988
Poznan Warsaw Cracow Gdansk
75 1
user_id cities_map last_city
72A9DBA
74524
map(Poznan->40, Warsaw->5,
Cracow->1, Gdansk->3)
?
58383B3
6275A
map(Warsaw->120,
Cracow->60, Gdansk->5)
?
009D224
19988
map(Poznan->75, Warsaw->1) ?
HiveHBaseHandler
but how to get the last
(most recent) values?
where aretimestamps?
22. #BigDataSpain 2017
Our case
• We use HDP distribution of Hadoop cluster
with HBase 1.1.x
• There is possibility to add to Hive view on
HBase table the latest timestamp of row
modification:
CREATE TABLE hive_view_on_hbase (
key int,
value string,
ts timestamp
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
'hbase.columns.mapping' = ':key, cf1:val, :timestamp'
)
TBLPROPERTIES (
'hbase.table.name' = 'xyz'
);
21
• How to extract timestamp of each cell?
• Answer: rewrite Hive-HBase-Handler that is
responsible for creating the Hive views on
HBase tables :) … but first …
• Do not download source code of Hive
from the Hive GitHub repository - check
your Hadoop distribution! (for example
HDP has own code branch)
24. #BigDataSpain 2017
There is a lot of code…
…but we have some tips on how to change Hive-HBase-Handler:
• Functions of parsing columns of hbase.columns.mapping is located in HBaseSerDe.java
which returns ColumnMappings object
• LazyHBaseRow class stores data from HBase row.
• Timestamps of processed HBase cells can be read from loaded (by scanner) rows in
LazyHBaseCellMap class
• Column parser and HBase scanner is initialized in HBaseStorageHandler.java
23
25. #BigDataSpain 2017
5. Spark + Kafka: own offset manager
Problem description:
• Spark output operations are at-least-once
• For exactly-once semantics, you must store
offsets after an idempotent output, or in an
atomic transaction alongside output
• Options:
1. Checkpoints
+ easy to enable by Spark checkpointing
- output operation must be idempotent
- cannot recover from a checkpoint if
application code has changed
2. Own data store
+ regardless of changes to your application
code
+ you can use data stores that support
transactions
+ exactly-once semantics
24
Single Spark batch
Process
and save data
Save
offsets
Image source: Spark Streaming documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html
26. #BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(…)
val stream: DStream[ConsumerRecord[String, String]] = ...
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, ...)
})
25
Single Spark batch
Process
and save data
Save
offsets
27. #BigDataSpain 2017
Some code with Spark Streaming
val ssc: StreamingContext = new StreamingContext(...)
val stream: DStream[ConsumerRecord[String, String]] =
kafkaStream(topic, zkPath, ssc, offsetsStore, kafkaParams)
stream.foreachRDD(rdd => {
val toSave: Seq[String] = rdd.collect().map(_.value())
saveData(toSave)
offsetsStore.saveOffsets(rdd, zkPath)
})
def kafkaStream(topic: String, zkPath: String, ssc: StreamingContext, offsetsStore: MyOffsetsStore,
kafkaParams: Map[String, Object]): DStream[ConsumerRecord[String, String]] = {
offsetsStore.readOffsets(topic, zkPath) match {
case Some(offsetsMap) =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](offsetsMap.map(_._1), kafkaParams, offsetsMap))
case None =>
KafkaUtils.createDirectStream[String, String](ssc, LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](Seq(topic), kafkaParams)
)
}
}
26
28. #BigDataSpain 2017
Code of offset store
class MyOffsetsStore(zkHosts: String) {
val zkUtils = ZkUtils(zkHosts, 10000, 10000, false)
def saveOffsets(rdd: RDD[_], zkPath: String): Unit = {
val offsetsRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offsetsRanges.groupBy(_.topic).foreach {
case (topic, offsetsRangesPerTopic) => {
val offsetsRangesStr = offsetsRangesPerTopic
.map(offRang => s"${offRang.partition}:${offRang.untilOffset}").mkString(",")
zkUtils.updatePersistentPath(zkPath, offsetsRangesStr)
}
}}
def readOffsets(topic: String, zkPath: String): Option[Map[TopicPartition, Long]] = {
val (offsetsRangesStrOpt, _) = zkUtils.readDataMaybeNull(zkPath)
offsetsRangesStrOpt match {
case Some(offsetsRangesStr) =>
Some(offsetsRangesStr.split(",").map(s => s.split(":")).map {
case Array(partitionStr, offsetStr) =>
new TopicPartition(topic, partitionStr.toInt) -> offsetStr.toLong
}.toMap)
case None => None
}
}
}
27