2017 big data landscape and cutting edge innovations public

2017 Big Data
Landscape  
and Innovations
APAC Data Team
Evans Ye
Feb. 2018

• 2017 Big Data Landscape
• Cutting Edge Innovations
• Spark Structured Streaming
• TensorFlow on Spark (copyright)
• HBase Multi-Tenancy: 
RSGroups and Favored Nodes (copyright)
2
Agenda
Foot note

4
Big Data Landscape
Foot note

• Machine Learning, Deep Learning, AI
• TensorFrame, TensorFlow on Spark, Apache MXNet,...
• Cloudera Data Science WorkBench
• IBM Data Science Experience  
(Partnered with Hortonworks)
• Streaming
• Kafka, Beam, Structured Streaming, Flink, Apex,
Hortonworks Streaming Analytics Manager, etc
5
Hot Topics
Foot note

• Spark still dominates the big data world and the
research area
• Innovations in streaming:
• event time, watermark, state management,
exactly-once, rescaling, streaming SQL
• Big Data X Cloud
• Hadoop, Hive, HBase, Spark on S3
6
Tech Trend
Foot note

7
A white 
divider slide
6
SQL

• ACID: Ignite, Trafodion, Omid (incubator)
• Predicate Push-Down, Runtime Filter(BloomFilter)
• Rule-Based to Cost-Based Optimization:  
Spark Catalyst(2.2), Calcite
• Streaming SQL: Blink, KSQL, Storm, Samza
8
SQL -> NoSQL -> NewSQL
Foot note

9
ASF Status
Foot note
Currently 193 Top Level Projects

10
ASF Status
Foot note
https://projects.apache.org

• RocketMQ (Similar to Kafka, Alibaba, graduated)
• CarbonData (File format, Huawei, graduated)
• MXNet (DL, Amazon)
• Apache Gearpump (Streaming, Intel China)
• Apache Omid (HBase ACID, Yahoo!)
11
Some interesting new projects
Foot note

• RocketMQ, CarbonData, Gearpump, etc
• Kylin (BI, OLAP cube)
• Alluxio (formally Tachyon, in-memory cache)
• Blink (derived from Flink, Alibaba)
• MaxCompute (ODPS, Alibaba)
• HBaseCon Asia 2017 in Shenzhen, Huawei
12
China is playing a BIG role
Foot note

15
You should not
have to reason
about streaming
6

• Treat stream as a table
• Applies a query with output mode specified:
• complete, append, update
• Query an input table, get a (filtered) result table
• The engine converts query to incremental query on
new data to generate output
16
Concept
Foot note

• Event time (handles late data)
• Watermark (limits the stateful data kept in memory)
• Checkpoints(offsets) stored in json (finally!)
• State Management: MapGroupWithState (Spark 2.2)
• Stream-stream join (Spark 2.3)
• Relies on watermark to decide when to drop data
that can never yield join result
18
New features
Foot note

• SQL interface supported
• Performance consideration:
• Runtime codegen, Off-heap, execution plan
optimization... all available in streaming
• The bright future with more dev support (!?)
19
Advantages
Foot note

• Encoder stuffs is quite annoying
• Output mode depends on operations [1]
• Stateful operation still not intuitive, compare to Flink's
state management 
20
Disadvantages
Foot note

• Closing with writeStream is mandatory now
• spark.readStream...T...writeStream.start
• org.apache.spark.sql.AnalysisException: Queries
with streaming sources must be executed with
writeStream.start();; 
21
Disadvantages
Foot note

• Hard to cope with other data, compared to powerful
foreachRDD 
• org.apache.spark.sql.AnalysisException: Right outer join
with a streaming DataFrame/Dataset on the left is not
supported;; 
org.apache.spark.sql.AnalysisException: Union between
streaming and batch DataFrames/Datasets is not
supported;;
22
Disadvantages
Foot note

• Need to write ForeachWriter if sink not supported
23
Disadvantages
Foot note

• Use Structured Streaming
• if you need event time accuracy
• if you need stream-stream join
• if you need performance
• Use Spark Streaming
• if you want more control over your compute logic
• if you can't do it in Structured Streaming ;)
24
Recap
Foot note

• Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark
• Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark –
continues
• Deep Dive into Stateful Stream Processing in
Structured Streaming
25
Ref
Foot note

TensorFlowOnSpark 
S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s
L e e Ya n g , A n d r e w F e n g
Yahoo Big Data ML Platform Team

• TensorFlowOnSpark: Scalable TensorFlow Learning
on Spark Clusters
27
Ref
Foot note

28
A white 
divider slide
6
ACHIEVING HBASE
MULTI-TENANCY:
REGIONSERVER
GROUPS
AND
FAVORED NODES
Francis Liu & Thiruvel Thirumoolan
HBase Yahoos

• Achieving HBase Multi-Tenancy with RegionServer
Groups and Favored Nodes
29
Ref
Foot note

• Big data is lying to the cloud
• Batch:  
SQL optimization everywhere
• Streaming:  
Event time, Exactly-once is the default
• AI:  
TensorFlow wins the war. Try TensorFlow on Spark!
30
Summary
Foot note

2017 big data landscape and cutting edge innovations public

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a 2017 big data landscape and cutting edge innovations public

Similar a 2017 big data landscape and cutting edge innovations public (20)

Más de Evans Ye

Más de Evans Ye (20)

Último

Último (20)

2017 big data landscape and cutting edge innovations public