5. • Machine Learning, Deep Learning, AI
• TensorFrame, TensorFlow on Spark, Apache MXNet,...
• Cloudera Data Science WorkBench
• IBM Data Science Experience
(Partnered with Hortonworks)
• Streaming
• Kafka, Beam, Structured Streaming, Flink, Apex,
Hortonworks Streaming Analytics Manager, etc
5
Hot Topics
Foot note
6. • Spark still dominates the big data world and the
research area
• Innovations in streaming:
• event time, watermark, state management,
exactly-once, rescaling, streaming SQL
• Big Data X Cloud
• Hadoop, Hive, HBase, Spark on S3
6
Tech Trend
Foot note
11. • RocketMQ (Similar to Kafka, Alibaba, graduated)
• CarbonData (File format, Huawei, graduated)
• MXNet (DL, Amazon)
• Apache Gearpump (Streaming, Intel China)
• Apache Omid (HBase ACID, Yahoo!)
11
Some interesting new projects
Foot note
12. • RocketMQ, CarbonData, Gearpump, etc
• Kylin (BI, OLAP cube)
• Alluxio (formally Tachyon, in-memory cache)
• Blink (derived from Flink, Alibaba)
• MaxCompute (ODPS, Alibaba)
• HBaseCon Asia 2017 in Shenzhen, Huawei
12
China is playing a BIG role
Foot note
16. • Treat stream as a table
• Applies a query with output mode specified:
• complete, append, update
• Query an input table, get a (filtered) result table
• The engine converts query to incremental query on
new data to generate output
16
Concept
Foot note
18. • Event time (handles late data)
• Watermark (limits the stateful data kept in memory)
• Checkpoints(offsets) stored in json (finally!)
• State Management: MapGroupWithState (Spark 2.2)
• Stream-stream join (Spark 2.3)
• Relies on watermark to decide when to drop data
that can never yield join result
18
New features
Foot note
19. • SQL interface supported
• Performance consideration:
• Runtime codegen, Off-heap, execution plan
optimization... all available in streaming
• The bright future with more dev support (!?)
19
Advantages
Foot note
20. • Encoder stuffs is quite annoying
• Output mode depends on operations [1]
• Stateful operation still not intuitive, compare to Flink's
state management
20
Disadvantages
Foot note
21. • Closing with writeStream is mandatory now
• spark.readStream...T...writeStream.start
• org.apache.spark.sql.AnalysisException: Queries
with streaming sources must be executed with
writeStream.start();;
21
Disadvantages
Foot note
22. • Hard to cope with other data, compared to powerful
foreachRDD
• org.apache.spark.sql.AnalysisException: Right outer join
with a streaming DataFrame/Dataset on the left is not
supported;;
org.apache.spark.sql.AnalysisException: Union between
streaming and batch DataFrames/Datasets is not
supported;;
22
Disadvantages
Foot note
23. • Need to write ForeachWriter if sink not supported
23
Disadvantages
Foot note
24. • Use Structured Streaming
• if you need event time accuracy
• if you need stream-stream join
• if you need performance
• Use Spark Streaming
• if you want more control over your compute logic
• if you can't do it in Structured Streaming ;)
24
Recap
Foot note
25. • Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark
• Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark –
continues
• Deep Dive into Stateful Stream Processing in
Structured Streaming
25
Ref
Foot note
26. TensorFlowOnSpark
S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s
L e e Ya n g , A n d r e w F e n g
Yahoo Big Data ML Platform Team
29. • Achieving HBase Multi-Tenancy with RegionServer
Groups and Favored Nodes
29
Ref
Foot note
30. • Big data is lying to the cloud
• Batch:
SQL optimization everywhere
• Streaming:
Event time, Exactly-once is the default
• AI:
TensorFlow wins the war. Try TensorFlow on Spark!
30
Summary
Foot note