SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
2017 Big Data
Landscape 

and Innovations
APAC Data Team
Evans Ye
Feb. 2018
• 2017 Big Data Landscape
• Cutting Edge Innovations
• Spark Structured Streaming
• TensorFlow on Spark (copyright)
• HBase Multi-Tenancy:

RSGroups and Favored Nodes (copyright)
2
Agenda
Foot note
2017 

Big Data Landscape
3
4
Big Data Landscape
Foot note
• Machine Learning, Deep Learning, AI
• TensorFrame, TensorFlow on Spark, Apache MXNet,...
• Cloudera Data Science WorkBench
• IBM Data Science Experience 

(Partnered with Hortonworks)
• Streaming
• Kafka, Beam, Structured Streaming, Flink, Apex,
Hortonworks Streaming Analytics Manager, etc
5
Hot Topics
Foot note
• Spark still dominates the big data world and the
research area
• Innovations in streaming:
• event time, watermark, state management,
exactly-once, rescaling, streaming SQL
• Big Data X Cloud
• Hadoop, Hive, HBase, Spark on S3
6
Tech Trend
Foot note
7
A white

divider slide
6
SQL
• ACID: Ignite, Trafodion, Omid (incubator)
• Predicate Push-Down, Runtime Filter(BloomFilter)
• Rule-Based to Cost-Based Optimization: 

Spark Catalyst(2.2), Calcite
• Streaming SQL: Blink, KSQL, Storm, Samza
8
SQL -> NoSQL -> NewSQL
Foot note
9
ASF Status
Foot note
Currently 193 Top Level Projects
10
ASF Status
Foot note
https://projects.apache.org
• RocketMQ (Similar to Kafka, Alibaba, graduated)
• CarbonData (File format, Huawei, graduated)
• MXNet (DL, Amazon)
• Apache Gearpump (Streaming, Intel China)
• Apache Omid (HBase ACID, Yahoo!)
11
Some interesting new projects
Foot note
• RocketMQ, CarbonData, Gearpump, etc
• Kylin (BI, OLAP cube)
• Alluxio (formally Tachyon, in-memory cache)
• Blink (derived from Flink, Alibaba)
• MaxCompute (ODPS, Alibaba)
• HBaseCon Asia 2017 in Shenzhen, Huawei
12
China is playing a BIG role
Foot note
Cutting Edge
Innovations
13
14
Structured Streaming
6
15
You should not
have to reason
about streaming
6
• Treat stream as a table
• Applies a query with output mode specified:
• complete, append, update
• Query an input table, get a (filtered) result table
• The engine converts query to incremental query on
new data to generate output
16
Concept
Foot note
17
Example
Foot note
• Event time (handles late data)
• Watermark (limits the stateful data kept in memory)
• Checkpoints(offsets) stored in json (finally!)
• State Management: MapGroupWithState (Spark 2.2)
• Stream-stream join (Spark 2.3)
• Relies on watermark to decide when to drop data
that can never yield join result
18
New features
Foot note
• SQL interface supported
• Performance consideration:
• Runtime codegen, Off-heap, execution plan
optimization... all available in streaming
• The bright future with more dev support (!?)
19
Advantages
Foot note
• Encoder stuffs is quite annoying
• Output mode depends on operations [1]
• Stateful operation still not intuitive, compare to Flink's
state management

20
Disadvantages
Foot note
• Closing with writeStream is mandatory now
• spark.readStream...T...writeStream.start
• org.apache.spark.sql.AnalysisException: Queries
with streaming sources must be executed with
writeStream.start();;

21
Disadvantages
Foot note
• Hard to cope with other data, compared to powerful
foreachRDD

• org.apache.spark.sql.AnalysisException: Right outer join
with a streaming DataFrame/Dataset on the left is not
supported;;

org.apache.spark.sql.AnalysisException: Union between
streaming and batch DataFrames/Datasets is not
supported;;
22
Disadvantages
Foot note
• Need to write ForeachWriter if sink not supported
23
Disadvantages
Foot note
• Use Structured Streaming
• if you need event time accuracy
• if you need stream-stream join
• if you need performance
• Use Spark Streaming
• if you want more control over your compute logic
• if you can't do it in Structured Streaming ;)
24
Recap
Foot note
• Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark
• Easy, Scalable, Fault-Tolerant Stream Processing
with Structured Streaming in Apache Spark –
continues
• Deep Dive into Stateful Stream Processing in
Structured Streaming
25
Ref
Foot note
TensorFlowOnSpark

S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s
L e e Ya n g , A n d r e w F e n g
Yahoo Big Data ML Platform Team
• TensorFlowOnSpark: Scalable TensorFlow Learning
on Spark Clusters
27
Ref
Foot note
28
A white

divider slide
6
ACHIEVING HBASE
MULTI-TENANCY:
REGIONSERVER
GROUPS
AND
FAVORED NODES
Francis Liu & Thiruvel Thirumoolan
HBase Yahoos
• Achieving HBase Multi-Tenancy with RegionServer
Groups and Favored Nodes
29
Ref
Foot note
• Big data is lying to the cloud
• Batch: 

SQL optimization everywhere
• Streaming: 

Event time, Exactly-once is the default
• AI: 

TensorFlow wins the war. Try TensorFlow on Spark!
30
Summary
Foot note

Más contenido relacionado

La actualidad más candente

Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
Databricks
 

La actualidad más candente (20)

Hyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache SparkHyperspace: An Indexing Subsystem for Apache Spark
Hyperspace: An Indexing Subsystem for Apache Spark
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Spark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan KesslerSpark Summit EU talk by Stephan Kessler
Spark Summit EU talk by Stephan Kessler
 
Apache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why CareApache Spark 3.0: Overview of What’s New and Why Care
Apache Spark 3.0: Overview of What’s New and Why Care
 
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
Experience of Running Spark on Kubernetes on OpenStack for High Energy Physic...
 
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using PrometheusMonitoring of GPU Usage with Tensorflow Models Using Prometheus
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...Informational Referential Integrity Constraints Support in Apache Spark with ...
Informational Referential Integrity Constraints Support in Apache Spark with ...
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Is hadoop for you
Is hadoop for youIs hadoop for you
Is hadoop for you
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1Deep Dive into the New Features of Apache Spark 3.1
Deep Dive into the New Features of Apache Spark 3.1
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C... Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
 

Similar a 2017 big data landscape and cutting edge innovations public

Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
Apache Apex
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
AnalyticsWeek
 

Similar a 2017 big data landscape and cutting edge innovations public (20)

SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem Large-Scale Stream Processing in the Hadoop Ecosystem
Large-Scale Stream Processing in the Hadoop Ecosystem
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application  Introduction to Apache Apex and writing a big data streaming application
Introduction to Apache Apex and writing a big data streaming application
 
Jags Ramnarayan's presentation
Jags Ramnarayan's presentationJags Ramnarayan's presentation
Jags Ramnarayan's presentation
 
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data 2016: Next Gen Big Data Analytics with Apache Apex
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL[WSO2Con EU 2018] The Rise of Streaming SQL
[WSO2Con EU 2018] The Rise of Streaming SQL
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
How KeyBank Used Elastic to Build an Enterprise Monitoring SolutionHow KeyBank Used Elastic to Build an Enterprise Monitoring Solution
How KeyBank Used Elastic to Build an Enterprise Monitoring Solution
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 

Más de Evans Ye

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
Evans Ye
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
Evans Ye
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
Evans Ye
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
Evans Ye
 

Más de Evans Ye (20)

Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdfJoin ASF to Unlock Full Possibilities of Your Professional Career.pdf
Join ASF to Unlock Full Possibilities of Your Professional Career.pdf
 
非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽非常人走非常路:參與ASF打世界杯比賽
非常人走非常路:參與ASF打世界杯比賽
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 
The Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward SuccessThe Apache Way: A Proven Way Toward Success
The Apache Way: A Proven Way Toward Success
 
The Apache Way
The Apache WayThe Apache Way
The Apache Way
 
Leveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioningLeveraging docker for hadoop build automation and big data stack provisioning
Leveraging docker for hadoop build automation and big data stack provisioning
 
Using the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data ProductUsing the SDACK Architecture to Build a Big Data Product
Using the SDACK Architecture to Build a Big Data Product
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
How bigtop leveraged docker for build automation and one click hadoop provis...
How bigtop leveraged docker for build automation and  one click hadoop provis...How bigtop leveraged docker for build automation and  one click hadoop provis...
How bigtop leveraged docker for build automation and one click hadoop provis...
 
BigTop vm and docker provisioner
BigTop vm and docker provisionerBigTop vm and docker provisioner
BigTop vm and docker provisioner
 
Docker workshop
Docker workshopDocker workshop
Docker workshop
 
Fits docker into devops
Fits docker into devopsFits docker into devops
Fits docker into devops
 
Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...Getting involved in world class software engineering tips and tricks to join ...
Getting involved in world class software engineering tips and tricks to join ...
 
Deep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through ImpalaDeep dive into enterprise data lake through Impala
Deep dive into enterprise data lake through Impala
 
How we lose etu hadoop competition
How we lose etu hadoop competitionHow we lose etu hadoop competition
How we lose etu hadoop competition
 
Network Traffic Search using Apache HBase
Network Traffic Search using Apache HBaseNetwork Traffic Search using Apache HBase
Network Traffic Search using Apache HBase
 
Vagrant
VagrantVagrant
Vagrant
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 

2017 big data landscape and cutting edge innovations public

  • 1. 2017 Big Data Landscape 
 and Innovations APAC Data Team Evans Ye Feb. 2018
  • 2. • 2017 Big Data Landscape • Cutting Edge Innovations • Spark Structured Streaming • TensorFlow on Spark (copyright) • HBase Multi-Tenancy:
 RSGroups and Favored Nodes (copyright) 2 Agenda Foot note
  • 3. 2017 
 Big Data Landscape 3
  • 5. • Machine Learning, Deep Learning, AI • TensorFrame, TensorFlow on Spark, Apache MXNet,... • Cloudera Data Science WorkBench • IBM Data Science Experience 
 (Partnered with Hortonworks) • Streaming • Kafka, Beam, Structured Streaming, Flink, Apex, Hortonworks Streaming Analytics Manager, etc 5 Hot Topics Foot note
  • 6. • Spark still dominates the big data world and the research area • Innovations in streaming: • event time, watermark, state management, exactly-once, rescaling, streaming SQL • Big Data X Cloud • Hadoop, Hive, HBase, Spark on S3 6 Tech Trend Foot note
  • 8. • ACID: Ignite, Trafodion, Omid (incubator) • Predicate Push-Down, Runtime Filter(BloomFilter) • Rule-Based to Cost-Based Optimization: 
 Spark Catalyst(2.2), Calcite • Streaming SQL: Blink, KSQL, Storm, Samza 8 SQL -> NoSQL -> NewSQL Foot note
  • 9. 9 ASF Status Foot note Currently 193 Top Level Projects
  • 11. • RocketMQ (Similar to Kafka, Alibaba, graduated) • CarbonData (File format, Huawei, graduated) • MXNet (DL, Amazon) • Apache Gearpump (Streaming, Intel China) • Apache Omid (HBase ACID, Yahoo!) 11 Some interesting new projects Foot note
  • 12. • RocketMQ, CarbonData, Gearpump, etc • Kylin (BI, OLAP cube) • Alluxio (formally Tachyon, in-memory cache) • Blink (derived from Flink, Alibaba) • MaxCompute (ODPS, Alibaba) • HBaseCon Asia 2017 in Shenzhen, Huawei 12 China is playing a BIG role Foot note
  • 15. 15 You should not have to reason about streaming 6
  • 16. • Treat stream as a table • Applies a query with output mode specified: • complete, append, update • Query an input table, get a (filtered) result table • The engine converts query to incremental query on new data to generate output 16 Concept Foot note
  • 18. • Event time (handles late data) • Watermark (limits the stateful data kept in memory) • Checkpoints(offsets) stored in json (finally!) • State Management: MapGroupWithState (Spark 2.2) • Stream-stream join (Spark 2.3) • Relies on watermark to decide when to drop data that can never yield join result 18 New features Foot note
  • 19. • SQL interface supported • Performance consideration: • Runtime codegen, Off-heap, execution plan optimization... all available in streaming • The bright future with more dev support (!?) 19 Advantages Foot note
  • 20. • Encoder stuffs is quite annoying • Output mode depends on operations [1] • Stateful operation still not intuitive, compare to Flink's state management
 20 Disadvantages Foot note
  • 21. • Closing with writeStream is mandatory now • spark.readStream...T...writeStream.start • org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;
 21 Disadvantages Foot note
  • 22. • Hard to cope with other data, compared to powerful foreachRDD
 • org.apache.spark.sql.AnalysisException: Right outer join with a streaming DataFrame/Dataset on the left is not supported;;
 org.apache.spark.sql.AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported;; 22 Disadvantages Foot note
  • 23. • Need to write ForeachWriter if sink not supported 23 Disadvantages Foot note
  • 24. • Use Structured Streaming • if you need event time accuracy • if you need stream-stream join • if you need performance • Use Spark Streaming • if you want more control over your compute logic • if you can't do it in Structured Streaming ;) 24 Recap Foot note
  • 25. • Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark • Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark – continues • Deep Dive into Stateful Stream Processing in Structured Streaming 25 Ref Foot note
  • 26. TensorFlowOnSpark
 S c a l a b l e Te n s o r F l o w L e a r n i n g o n S p a r k C l u s t e r s L e e Ya n g , A n d r e w F e n g Yahoo Big Data ML Platform Team
  • 27. • TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters 27 Ref Foot note
  • 28. 28 A white
 divider slide 6 ACHIEVING HBASE MULTI-TENANCY: REGIONSERVER GROUPS AND FAVORED NODES Francis Liu & Thiruvel Thirumoolan HBase Yahoos
  • 29. • Achieving HBase Multi-Tenancy with RegionServer Groups and Favored Nodes 29 Ref Foot note
  • 30. • Big data is lying to the cloud • Batch: 
 SQL optimization everywhere • Streaming: 
 Event time, Exactly-once is the default • AI: 
 TensorFlow wins the war. Try TensorFlow on Spark! 30 Summary Foot note