SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Spark: High-Speed Big
Data Analysis Framework
Intel
Andrew Xia
Weibo:Andrew-Xia
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Spark Overview
• Open source projects initiated by
AMPLab in UC Berkeley
• Apache incubation since June 2013
UC BERKELEY

• Intel closely collaborating with
AMPLab & the community
on open source development
Contributions by Intel
• Netty based shuffle for Spark
Intel China
• 3 committers
• FairScheduler for Spark
• 7 contributors
• Spark job log files
• 50+ patches
• Metrics system for Spark
• Spark shell on YARN
• Spark (standalone mode) integration with security Hadoop
• Byte code generation for Shark
• Co-partitioned join in Shark
...
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Collaboration Partners
• Intel partnering with several big websites
– Building next-gen big data analytics using the
Spark stack
– E.g., Alibaba , Baidu iQiyi, Youku, etc.
Big Data in Partners
• Advertising
– Operation analysis
– Effect analysis
– Directional optimization

• Analysis
– Website report
– Platform report
– Monitor system

• Recommendation
– Ranking list
– Personal recommendation
– Hot-click analysis
FAQs
FAQ #1: Poor Performance
– machine learning and graph computation
– OLAP for Tabular data, interactive query
Hadoop Data Sharing
iter. 1

. . .

iter. 2

Input

query 1

result 1

query 2

HDFS
read

result 2

query 3

result 3

Slow due to replication, serialization, and disk IO
Spark Data Sharing
iter. 1
Input

query 1
one-time
processing

Input

iter. 2

query 2
query 3

. . .

10-100× faster than network and disk

. . .
FAQs
FAQ #2: Too many big data systems
Big Data Systems Today

MapReduce
…
General batch
processing

Specialized systems
(iterative, interactive and
streaming apps)
Vision of Spark Ecosystem

One stack to rule them all!
Spark Ecosystem
Spark
Streaming

Graphx
Graphparallel

MLBase
Machine
learning

Spark
Tachyon
HDFS/Hadoop Storage
Mesos

YARN

MPI……
MapReduce

Shark
SQL
FAQs
FAQ #3: Study cost
Code Size
140000
120000
100000

GraphX

80000

Shark*

60000

Streaming

40000

20000
0
Hadoop
Storm
MapReduce (Streaming)

non-test, non-example source lines

Impala
(SQL)

Giraph
(Graph)

Spark

* also calls into Hive
FAQs
FAQ #4: Is Spark Stable?
Spark Status
• Spark 0.8 has been released
• Spark 0.9 will be release in Jan 2013
FAQs

FAQ #5: Not enough memory to cache
Not Enough Memory
• Graceful degradation
• Scheduler takes care of this
• Other options
– MEMORY_ONLY
– MEMORY_ONLY_SER
– MEMORY_AND_DISK
– DISK_ONLY
FAQs

FAQ #6: How to recover when failure?
How to Failover?
iter. 1
Input

query 1
one-time
processing

Input

iter. 2

query 2
query 3

. . .

. . .
How to Failover?
• Lineage: track the graph of transformations
that built RDD
• Checkpoint: lineage graphs get large
FAQs

FAQ #7: Is Spark compatible with
Hadoop ecosystem?
FAQs

FAQ #8:Need port to Spark?
FAQs

FAQ #9: Any cons about Spark?
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Case1#:Real-Time Log Aggregation
• Logs continuously collected & streamed in
– Through queuing/messaging systems

• Incoming logs processed in a (semi) streaming
fashion
– Aggregations for different time periods, demographics,
etc.
– Join logs and history tables when necessary

• Aggregation results then consumed in a (semi)
streaming fashion
– Monitoring, alerting, etc.
Real-Time Log Aggregation: Spark Streaming
Log
Collectors

Kafka
Cluster

Spark
Cluster

RDBMS

• Implications
– Better streaming framework support
• Complex (e.g., statful) analysis, fault-tolerance, etc.

– Kafka & Spark not collocated
• DStream retrieves logs in background (over network) and caches blocks in memory

– Memory tuning to reduce GC is critical
• spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size)
• Storage level (MEMORY_ONLY_SER2)

– Lower latency (several seconds)
• No startup overhead (reusing SparkContext)
Case #2:Machine Learning & Graph Analysis
• Algorithm: complex match operations
– Mostly matrix based
• Multiplication, factorization, etc.

– Sometime graph-based
• E.g., sparse matrix

• Iterative computations
– Matrix (graph) cached in memory across iterations
Graph Analysis: N-Degree Association
• N-degree association in the
graph
– Computing associations between two
vertices that are n-hop away
– E.g., friends of friend

• Graph-parallel implementation
– Bagel (Pregel on Spark) and GraphX
• Memory optimizations for efficient graph
caching critical

– Speedup from 20+ minutes to <2 minutes
Graph Analysis: N-Degree Association
u

State[u] = list of Weight(x, u)
(for current top K weights to vertex u)

v

w

State[w] = list of Weight(x, w)
(for current top K weights to vertex w)

State[v] = list of Weight(x, v)
(for current top K weights to vertex v)

u

u
Messages = {D(x, u) =
Weight(x, v) * edge(w, u)}
(for weight(x, v) in State[v])

v
Messages = {D(x, u) =
Weight(x, w) * edge(w, u)}
(for weight(x, w) in State[w])

v

w

w
Agenda
•
•
•
•

Intel contributions to Spark
Collaboration
Real world cases
Summary
Summary
• Memory is King!
• One stack to rule them all!
• Contribute to community!
夏俊鸾:Spark——基于内存的下一代大数据分析框架

Más contenido relacionado

La actualidad más candente

Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)Spark Summit
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkDatabricks
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RDatabricks
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkAlpine Data
 
SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkDatabricks
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresSpark Summit
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsDatabricks
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit
 

La actualidad más candente (20)

Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
 
Assessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache SparkAssessing Graph Solutions for Apache Spark
Assessing Graph Solutions for Apache Spark
 
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
Apache Spark AI Use Case in Telco: Network Quality Analysis and Prediction wi...
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Meetup tensorframes
Meetup tensorframesMeetup tensorframes
Meetup tensorframes
 
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and RSpark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
SparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache SparkSparkCruise: Automatic Computation Reuse in Apache Spark
SparkCruise: Automatic Computation Reuse in Apache Spark
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
Stories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi TorresStories About Spark, HPC and Barcelona by Jordi Torres
Stories About Spark, HPC and Barcelona by Jordi Torres
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Bring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science WorkflowsBring Satellite and Drone Imagery into your Data Science Workflows
Bring Satellite and Drone Imagery into your Data Science Workflows
 
Spark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena LazovikSpark Summit EU talk by Elena Lazovik
Spark Summit EU talk by Elena Lazovik
 
Spark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos ErotocritouSpark Summit EU talk by Christos Erotocritou
Spark Summit EU talk by Christos Erotocritou
 

Destacado

Big data hadoop FAQ's
Big data hadoop FAQ'sBig data hadoop FAQ's
Big data hadoop FAQ'sAshraf Ali
 
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...GeeksLab Odessa
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An OverviewMohit Jain
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the futureAnsviaLab
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to ScalaTim Underwood
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsHortonworks
 
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...Functional Programming in Scala in a Nutshell: Review of Functional Programmi...
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...Namuk Park
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Classification of yarn yarn classification. Textile yarn. Yarn count.
Classification of yarn   yarn classification. Textile yarn. Yarn count. Classification of yarn   yarn classification. Textile yarn. Yarn count.
Classification of yarn yarn classification. Textile yarn. Yarn count. Vaibhav Mathankar
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 

Destacado (12)

Big data hadoop FAQ's
Big data hadoop FAQ'sBig data hadoop FAQ's
Big data hadoop FAQ's
 
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...
Java/Scala Lab 2016. Руслан Шевченко: Несколько трюков scala-разработки, приг...
 
Apache Spark An Overview
Apache Spark An OverviewApache Spark An Overview
Apache Spark An Overview
 
Scala : language of the future
Scala : language of the futureScala : language of the future
Scala : language of the future
 
A Brief Intro to Scala
A Brief Intro to ScalaA Brief Intro to Scala
A Brief Intro to Scala
 
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data ApplicationsApache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
 
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...Functional Programming in Scala in a Nutshell: Review of Functional Programmi...
Functional Programming in Scala in a Nutshell: Review of Functional Programmi...
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
Classification of yarn yarn classification. Textile yarn. Yarn count.
Classification of yarn   yarn classification. Textile yarn. Yarn count. Classification of yarn   yarn classification. Textile yarn. Yarn count.
Classification of yarn yarn classification. Textile yarn. Yarn count.
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 

Similar a 夏俊鸾:Spark——基于内存的下一代大数据分析框架

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache SparkAmir Sedighi
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Etu Solution
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introductionHektor Jacynycz García
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Wee Hyong Tok
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with SparkMd. Mahedi Kaysar
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 

Similar a 夏俊鸾:Spark——基于内存的下一代大数据分析框架 (20)

An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425Spark summit 2019 infrastructure for deep learning in apache spark 0425
Spark summit 2019 infrastructure for deep learning in apache spark 0425
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 

Más de hdhappy001

詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systemshdhappy001
 
翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践hdhappy001
 
袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战hdhappy001
 
俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkabanhdhappy001
 
杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务hdhappy001
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐hdhappy001
 
徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践hdhappy001
 
肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践hdhappy001
 
肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进hdhappy001
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题hdhappy001
 
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎hdhappy001
 
王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术hdhappy001
 
钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探hdhappy001
 
穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scalehdhappy001
 
罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群hdhappy001
 
刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术hdhappy001
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sqlhdhappy001
 
刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台hdhappy001
 
李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略hdhappy001
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展hdhappy001
 

Más de hdhappy001 (20)

詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems詹剑锋:Big databench—benchmarking big data systems
詹剑锋:Big databench—benchmarking big data systems
 
翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践翟艳堂:腾讯大规模Hadoop集群实践
翟艳堂:腾讯大规模Hadoop集群实践
 
袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战袁晓如:大数据时代可视化和可视分析的机遇与挑战
袁晓如:大数据时代可视化和可视分析的机遇与挑战
 
俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban俞晨杰:Linked in大数据应用和azkaban
俞晨杰:Linked in大数据应用和azkaban
 
杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务杨少华:阿里开放数据处理服务
杨少华:阿里开放数据处理服务
 
薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐薛伟:腾讯广点通——大数据之上的实时精准推荐
薛伟:腾讯广点通——大数据之上的实时精准推荐
 
徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践徐萌:中国移动大数据应用实践
徐萌:中国移动大数据应用实践
 
肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践肖永红:科研数据应用和共享方面的实践
肖永红:科研数据应用和共享方面的实践
 
肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进肖康:Storm在实时网络攻击检测和分析的应用与改进
肖康:Storm在实时网络攻击检测和分析的应用与改进
 
魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题魏凯:大数据商业利用的政策管制问题
魏凯:大数据商业利用的政策管制问题
 
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎王涛:基于Cloudera impala的非关系型数据库sql执行引擎
王涛:基于Cloudera impala的非关系型数据库sql执行引擎
 
王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术王峰:阿里搜索实时流计算技术
王峰:阿里搜索实时流计算技术
 
钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探钱卫宁:在线社交媒体分析型查询基准评测初探
钱卫宁:在线社交媒体分析型查询基准评测初探
 
穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale穆黎森:Interactive batch query at scale
穆黎森:Interactive batch query at scale
 
罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群罗李:构建一个跨机房的Hadoop集群
罗李:构建一个跨机房的Hadoop集群
 
刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术刘书良:基于大数据公共云平台的Dsp技术
刘书良:基于大数据公共云平台的Dsp技术
 
刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql刘诚忠:Running cloudera impala on postgre sql
刘诚忠:Running cloudera impala on postgre sql
 
刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台刘昌钰:阿里大数据应用平台
刘昌钰:阿里大数据应用平台
 
李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略李战怀:大数据背景下分布式系统的数据一致性策略
李战怀:大数据背景下分布式系统的数据一致性策略
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Último (20)

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

夏俊鸾:Spark——基于内存的下一代大数据分析框架

  • 1. Spark: High-Speed Big Data Analysis Framework Intel Andrew Xia Weibo:Andrew-Xia
  • 2. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  • 3. Spark Overview • Open source projects initiated by AMPLab in UC Berkeley • Apache incubation since June 2013 UC BERKELEY • Intel closely collaborating with AMPLab & the community on open source development
  • 4. Contributions by Intel • Netty based shuffle for Spark Intel China • 3 committers • FairScheduler for Spark • 7 contributors • Spark job log files • 50+ patches • Metrics system for Spark • Spark shell on YARN • Spark (standalone mode) integration with security Hadoop • Byte code generation for Shark • Co-partitioned join in Shark ...
  • 5. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  • 6. Collaboration Partners • Intel partnering with several big websites – Building next-gen big data analytics using the Spark stack – E.g., Alibaba , Baidu iQiyi, Youku, etc.
  • 7. Big Data in Partners • Advertising – Operation analysis – Effect analysis – Directional optimization • Analysis – Website report – Platform report – Monitor system • Recommendation – Ranking list – Personal recommendation – Hot-click analysis
  • 8. FAQs FAQ #1: Poor Performance – machine learning and graph computation – OLAP for Tabular data, interactive query
  • 9. Hadoop Data Sharing iter. 1 . . . iter. 2 Input query 1 result 1 query 2 HDFS read result 2 query 3 result 3 Slow due to replication, serialization, and disk IO
  • 10. Spark Data Sharing iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . 10-100× faster than network and disk . . .
  • 11. FAQs FAQ #2: Too many big data systems
  • 12. Big Data Systems Today MapReduce … General batch processing Specialized systems (iterative, interactive and streaming apps)
  • 13. Vision of Spark Ecosystem One stack to rule them all!
  • 16. Code Size 140000 120000 100000 GraphX 80000 Shark* 60000 Streaming 40000 20000 0 Hadoop Storm MapReduce (Streaming) non-test, non-example source lines Impala (SQL) Giraph (Graph) Spark * also calls into Hive
  • 17. FAQs FAQ #4: Is Spark Stable?
  • 18. Spark Status • Spark 0.8 has been released • Spark 0.9 will be release in Jan 2013
  • 19. FAQs FAQ #5: Not enough memory to cache
  • 20. Not Enough Memory • Graceful degradation • Scheduler takes care of this • Other options – MEMORY_ONLY – MEMORY_ONLY_SER – MEMORY_AND_DISK – DISK_ONLY
  • 21. FAQs FAQ #6: How to recover when failure?
  • 22. How to Failover? iter. 1 Input query 1 one-time processing Input iter. 2 query 2 query 3 . . . . . .
  • 23. How to Failover? • Lineage: track the graph of transformations that built RDD • Checkpoint: lineage graphs get large
  • 24. FAQs FAQ #7: Is Spark compatible with Hadoop ecosystem?
  • 26. FAQs FAQ #9: Any cons about Spark?
  • 27. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  • 28. Case1#:Real-Time Log Aggregation • Logs continuously collected & streamed in – Through queuing/messaging systems • Incoming logs processed in a (semi) streaming fashion – Aggregations for different time periods, demographics, etc. – Join logs and history tables when necessary • Aggregation results then consumed in a (semi) streaming fashion – Monitoring, alerting, etc.
  • 29. Real-Time Log Aggregation: Spark Streaming Log Collectors Kafka Cluster Spark Cluster RDBMS • Implications – Better streaming framework support • Complex (e.g., statful) analysis, fault-tolerance, etc. – Kafka & Spark not collocated • DStream retrieves logs in background (over network) and caches blocks in memory – Memory tuning to reduce GC is critical • spark.cleaner.ttl (throughput * spark.cleaner.ttl < spark mem free size) • Storage level (MEMORY_ONLY_SER2) – Lower latency (several seconds) • No startup overhead (reusing SparkContext)
  • 30. Case #2:Machine Learning & Graph Analysis • Algorithm: complex match operations – Mostly matrix based • Multiplication, factorization, etc. – Sometime graph-based • E.g., sparse matrix • Iterative computations – Matrix (graph) cached in memory across iterations
  • 31. Graph Analysis: N-Degree Association • N-degree association in the graph – Computing associations between two vertices that are n-hop away – E.g., friends of friend • Graph-parallel implementation – Bagel (Pregel on Spark) and GraphX • Memory optimizations for efficient graph caching critical – Speedup from 20+ minutes to <2 minutes
  • 32. Graph Analysis: N-Degree Association u State[u] = list of Weight(x, u) (for current top K weights to vertex u) v w State[w] = list of Weight(x, w) (for current top K weights to vertex w) State[v] = list of Weight(x, v) (for current top K weights to vertex v) u u Messages = {D(x, u) = Weight(x, v) * edge(w, u)} (for weight(x, v) in State[v]) v Messages = {D(x, u) = Weight(x, w) * edge(w, u)} (for weight(x, w) in State[w]) v w w
  • 33. Agenda • • • • Intel contributions to Spark Collaboration Real world cases Summary
  • 34. Summary • Memory is King! • One stack to rule them all! • Contribute to community!