The document discusses a company's migration from their in-house computation engine to Apache Spark. It describes five key issues encountered during the migration process: 1) difficulty adapting to Spark's low-level RDD API, 2) limitations of DataSource predicates, 3) incomplete Spark SQL functionality, 4) performance issues with round trips between Spark and other systems, and 5) OutOfMemory errors due to large result sizes. Lessons learned include being aware of new Spark features and data formats, and designing architectures and data structures to minimize data movement between systems.
2. About me
Roman Chukh
11+ years of experience
Java / PHP / Ruby / etc.
~1 year with Apache Spark
Interested in
Data Storage / Data Flow
Monitoring
Provisioning Tools
3. Agenda
Why Spark?
Our Migration to Spark
Issues
… and solutions
… or workarounds
… or at least the lessons learnt
13. Migrating To Spark
The Product
Cloud-based analytics application
Won the Big Data Startup Challenge
In-house computation engine
14. Migrating To Spark
Reasons
More data
More granular data
Support various data backends
Support Machine Learning algorithms
15. Migrating To Spark
Use Cases
❏ supplement Graph database used to
store/query big dimensions
❏ supplement RDBMS for querying of high
volumes of data
❏ represent existing computation graph as
flow of Spark-based operations
16. Migrating To Spark
Star Schema
Dimension DimensionMetric
Process /
Filter
Dimension
Filter
Metric
Process /
Filter
Dimension
Result
Data
Processing
...
23. Issue #1: Low-Level API
RDD: Issues
Functional transformations (e.g. map/reduce)
are not as intuitive
Manual memory management
High (dev) maintenance cost
24. Issue #1: Low-Level API
DataFrame: Overview
❏ (Semi-) Structured data
❏ Columnar Storage
❏ Graph mutation
❏ Code generation
❏ "on" by default in 1.5+
❏ "always on" in latest master
25. Issue #1: Low-Level API
DataFrame: Example
lines.json
{"line":"some"}
{"line":"lines"}
{"line":"for"}
{"line":"test"}
26. Issue #1: Low-Level API
DataFrame vs RDD
Source: http://www.slideshare.net/databricks/spark-summit-eu-2015-matei-zaharia-keynote
27. Issue #1: Low-Level API
DataFrame: Graph Mutation
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
28. Issue #1: Low-Level API
Lessons Learnt
❏ Be aware of the new features
❏ … especially why they were introduced
❏ Low-Level API != Better Performance
30. “
“The fastest way to process big
data is to never read it”
Source: http://www.slideshare.net/databricks/spark-whats-new-whats-coming
31. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
32. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
AND y < 10
WHERE
y < 10
AND
33. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
34. Spark Flow
RDBMS
WHERE
x > 0
Result
Issue #2: DataSource Predicates
Use Cases
SQL
SELECT *
FROM Table
WHERE x > 0
OR y < 10
WHERE
y < 10
OR
35. … is at a very early stage
❏ Only simple predicates
<, <=, >, >=, =
❏ Only ‘AND’ predicate groups
(no OR support)
Issue #2: DataSource Predicates
JDBC
36. … is buggy
❏ Parquet < 1.7
❏ PARQUET-136 - NPE if all column values are
null
❏ Parquet 1.7
❏ PARQUET-251 - Possible incorrect results
for String/Decimal/Binary columns
Issue #2: DataSource Predicates
Apache Parquet
37. Issue #2: DataSource Predicates
Lessons Learnt
❏ Know your data format / data storage features
❏ ... and issues
❏ Its hard to check predicate pushdown behavior
❏ SPARK-11390: Pushdown information
❏ Simple aggregation operations are not supported
❏ Check out the talk “The Pushdown of Everything”
39. ❏ Window functions (e.g. row_number)
❏ Introduced for HiveContext in 1.4
❏ Introduced for SparkContext in 1.5
❏ Subquery (e.g. not exists) support is still missing
❏ Can sometimes be replaced with left semi join
Issue #3: Spark (sort of) SQL
Missing Functionality
40. Issue #3: Spark (sort of) SQL
Lessons Learnt
❏ Know your use-case
❏ Spark SQL is still quite young
❏ SQL grammar is incomplete
❏ … but actively extended
42. Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
43. Issue #4: Round Trips
Background
Metric
Data Processing
...
Filter
Metric
Result
Internal API
Process / Filter
Dimension
Dimension
ids
Dimension
44. Get ID for the ‘Year 2015’
Issue #4: Round Trips
Resolving Dimensions
Dimension
WHERE
key = ‘2015’
Result
45. Get IDs of all passed months of the current year
Dimension
WHERE parent = 2015
and level = month
Dim. id
of ‘2015’
WHERE
key = ‘2015’
Issue #4: Round Trips
Resolving Dimensions
Result
46. Get IDs of all passed months of the current year
AND their siblings from the previous year
Dimension
WHERE
parent = 2015
and
level = month
Dim. id
of ‘2015’
Jan,
Feb,
…
WHERE
key = ‘2015’
WHERE
sibling_id =
sibling_id - 1
Result
Issue #4: Round Trips
Resolving Dimensions
47. ❏ Spark is better suited for a single complex request
❏ … though not too complex yet
❏ Invest time in architecture analysis and data flow
❏ It might be better to replace a more high-level API
Issue #4: Round Trips
Lessons Learnt
49. “
“RAM's cheap, but not that cheap”
Source: http://superuser.com/questions/637302/if-ram-is-cheap-why-dont-we-load-everything-to-ram-and-run-it-from-there
50. Issue #5: OOM
Background
❏ Receive request
❏ Select / Filter / Process data (on Spark)
❏ Collect results
❏ … Out Of Memory
51. ❏ Same data as before
❏ Same external API
Issue #5: OOM
Workaround: Requirements
52. ❏ Result holds ~ 1M objects
❏ (Average) Object size 928 bytes
❏ Result size ~880 MB
Issue #5: OOM
Workaround: Before
53. Issue #5: OOM
Workaround: After
❏ Result holds ~ 1M objects
❏ (Average) Object size 272 bytes
❏ Result size ~261 MB
54. ❏ Invest (more) time in data structures
❏ Some java performance tips:
http://java-performance.com/
❏ Know your serializer
❏ E.g. Kryo (v2.2.1) prepares object for
deserialization by using default constructor.
Issue #5: OOM
Lessons Learnt
56. “
“The fact that there is a highway to hell
and only a stairway to heaven says a lot
about the traffic trends”
Source: https://www.reddit.com/r/Showerthoughts/comments/2wbvou/the_fact_that_there_is_a_highway_to_hell_and_only