SlideShare una empresa de Scribd logo
1 de 17
Descargar para leer sin conexión
How Spark Fits Baidu’s Scale
Dr. James Peng
Principal Architect, Baidu U.S.A.
Baidu is Leading Chinese Search Engine
No. 1
Website in China
By average daily visitors
and page views
No. 1
Chinese search engine
By number of search
question conducted
No. 1
Most valuable brand
Among private sector
companies in China
73
%
Search market share
in China
$73
bn.
Market capitalization
Premium global
Internet company
$2 bn.
Q1 2015 total revenues
By number of search
question conducted
47,000
Employees
Top 3 China’s
best employer
Baidu’s Big Data Infrastructure
Baidu’s History in Big Data Infrastructure
2003 2008 2009 2010 2011 2012 2013 2014
Distributed
Search
System
Hadoop in
production
Distributed
webpage storage:
storing over 100
billion web pages
Large-scale
machine learning
platform
> 10000 nodes
Distributed
computing engine
in production
Real-time
streaming system
with delay in
milliseconds range
Large-scale
deployment of in-house
developed network
switcher and ARM
servers
Large scale
DNN: supporting
over 100 billion
features
Largest MR Cluster
0
2000
4000
6000
8000
10000
12000
14000
2008 2009 2010 2011 2012 2013
MR Cluster Size
Number of Nodes
0
200000
400000
600000
800000
1000000
1200000
2008 2009 2010 2011 2012 2013
MR Cluster Daily Load
Number of Jobs
Baidu’s Distributed Shuffle Engine
http://sortbenchmark.org/BaiduSort2014.pdf
Champion of 2014 Sort Benchmark
Indy
2014, 8.38 TB/min
BaiduSort
100 TB in 716 seconds 
982 nodes x 
(2 2.10Ghz Intel Xeon E5-2450, 192 GB memory, 8x3TB
7200 RPM SATA) 
Dasheng Jiang 
Baidu Inc. and Peking University
Baidu Proprietary Platforms
DCE/DAG
•  DAG model
Distributed
Computing
Engines
•  Outperforms
open-source
MR by at
least 30%
DSTREAM
•  Distributed real-
time
computation
engine
•  99.99% of
tasks with
latency less
than 50 ms
TM
•  mini-batch
execution
engine
•  Transaction-
based,
guarantee
exactly once
delivery
PADDLE
•  PArallel
Distributed
Deep Learning
Engine
•  With CPU /
GPU support
and suitable
for vision,
speech, and
NLP
workloads
ELF
•  Essential
Learning
Framework
•  Dataflow-
based
programming
model
•  Parameter
server able to
store > 1
trillion
parameters
THERE’S STRONG, AND THEN THERE’S BAIDU STRONG
A Spark in Baidu
•  Scale: > 1000 nodes(20000 Cores, 100 TB Memory)
•  Daily Jobs:2000~3000
•  Supported Business Units: Ads, Search, Map, Commerce, etc
Initial
Evaluation
Spark
0.8
Internal
Beta
Usage
Spark
0.9
Beta
Cluster
with 200
nodes
Spark
1.0
Incubation
Integrate
into Baidu
Open
Cloud
Spark
1.1
Spark SQL
in
Production
Spark
1.2
Continue
Scaling
Spark
1.3
Production
and Scaling
Spark’s History in Baidu
Spark in Baidu Internal Infrastructure
SparkDSTREAM TM DCE/DAG ELF Compute
Layer
Baidu Data Centers + Resource Management + Resource Scheduling
Resource
Layer
Dataflow Compiler
Java C++ Python Scala SQL
Programming
Layer
Spark in Baidu Open Cloud
Baidu Cloud Cluster
HDFS
Map Reduce
Baidu Object Storage
HBase
Pig Hive SQL Streaming GraphX MLLib
BMR Console BMR SDK Audit
Scheduler
Management
http://bce.baidu.com/
Enabling Interactive Queries with
Spark and Tachyon
Need for Interaction (Speed)
Problem
•  John is a PM and he needs to keep track of the top queries submitted
to Baidu everyday
•  Based on the top queries of the day, he will perform additional analysis
•  But John is very frustrated that each query takes tens of minutes
Requirements for Baidu Interactive Query Engine
•  Manages Petabytes of data
•  Able to finish 95% of queries within 30 seconds
Query UI
Data Warehouse
Baidu File System
Storage Layer
Tachyon	
  Hot Data Layer
Baidu Interactive Query Architecture
Compute Layer Meta Store
HiveQL
Spark SQL
UDFs SerDes
Apache Spark
Baidu Interactive Query Performance
> 50X Acceleration of Big
Data Analytics Workloads
0
200
400
600
800
1000
1200
MR (sec) Spark (sec) Spark + Tachyon
(sec)
Setup:
1.  Use MR to query 6 TB of data
2.  Use Spark to query 6 TB of data
3.  Use Spark + Tachyon to query 6 TB
of data
Summary
Spark and Tachyon provides
significant performance advantage
•  Suitable for latency
sensitive workloads,
such as interactive ad-
hoc query
Takes time for Spark to reach MR
scale
•  We still use MR for
high throughput
batch workloads
•  Deployment Scale:
Spark 1000 nodes vs.
MR 13000 nodes
•  Memory Management
is one of the main
blockers of scalability
Look forward to Spark
Improvements
•  Project Tungsten:
better memory
management: main
blocker of scalability
•  Continuous
improvement of
Catalyst
•  Support for multiple
meta stores
Looking Into the Future
Spark + TachyonSystems Software
FPGA + GPU
Acceleration
Hardware Faster Network
Spark
SQL
Applications DNN Spark R
Spark
MlLib
Speed
Functionality

Más contenido relacionado

La actualidad más candente

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillDatabricks
 
Databricks @ Strata SJ
Databricks @ Strata SJDatabricks @ Strata SJ
Databricks @ Strata SJDatabricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkDatabricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksDatabricks
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...Databricks
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at ScaleDatabricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_dataNitin Kumar
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Databricks
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkChester Chen
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...Databricks
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleDatabricks
 

La actualidad más candente (20)

Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteSpark Summit San Francisco 2016 - Ali Ghodsi Keynote
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
 
From R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep GillFrom R Script to Production Using rsparkling with Navdeep Gill
From R Script to Production Using rsparkling with Navdeep Gill
 
Databricks @ Strata SJ
Databricks @ Strata SJDatabricks @ Strata SJ
Databricks @ Strata SJ
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache SparkReal-Time Analytics and Actions Across Large Data Sets with Apache Spark
Real-Time Analytics and Actions Across Large Data Sets with Apache Spark
 
Accelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on DatabricksAccelerating Data Science with Better Data Engineering on Databricks
Accelerating Data Science with Better Data Engineering on Databricks
 
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...
 
Automated Production Ready ML at Scale
Automated Production Ready ML at ScaleAutomated Production Ready ML at Scale
Automated Production Ready ML at Scale
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Brandon obrien streaming_data
Brandon obrien streaming_dataBrandon obrien streaming_data
Brandon obrien streaming_data
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
 
Real Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With SparkReal Time Machine Learning Visualization With Spark
Real Time Machine Learning Visualization With Spark
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork with Than...
 
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...
 
How Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at ScaleHow Adobe uses Structured Streaming at Scale
How Adobe uses Structured Streaming at Scale
 

Destacado

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Spark Summit
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to SparkSky Yin
 
Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingYu-Sheng (Yosen) Chen
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展hdhappy001
 
CV_Shilidong
CV_ShilidongCV_Shilidong
CV_Shilidong?? ?
 
前端规范(初稿)
前端规范(初稿)前端规范(初稿)
前端规范(初稿)EnLei-Cai
 
Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yongharryvanhaaren
 
Fast flux domain detection
Fast flux domain detectionFast flux domain detection
Fast flux domain detectionNi Zhiqiang
 
台湾趴趴走
台湾趴趴走台湾趴趴走
台湾趴趴走Limbo Wong
 
CVLinkedIn
CVLinkedInCVLinkedIn
CVLinkedInJun Ma
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resumeCheng Wang
 
Kafka文件系统设计
Kafka文件系统设计Kafka文件系统设计
Kafka文件系统设计志涛 李
 
Wei_Zhao_Resume
Wei_Zhao_ResumeWei_Zhao_Resume
Wei_Zhao_ResumeWei Zhao
 

Destacado (20)

Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
Data Driven-Toyota Customer 360 Insights on Apache Spark and MLlib-(Brian Kur...
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Migration from Redshift to Spark
Migration from Redshift to SparkMigration from Redshift to Spark
Migration from Redshift to Spark
 
Stanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programmingStanford splash spring 2016 basic programming
Stanford splash spring 2016 basic programming
 
Cv 12112015
Cv 12112015Cv 12112015
Cv 12112015
 
冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展冯宏华:H base在小米的应用与扩展
冯宏华:H base在小米的应用与扩展
 
CV_Shilidong
CV_ShilidongCV_Shilidong
CV_Shilidong
 
前端规范(初稿)
前端规范(初稿)前端规范(初稿)
前端规范(初稿)
 
Lichang Wang_CV
Lichang Wang_CVLichang Wang_CV
Lichang Wang_CV
 
Sara Saile cv
Sara Saile cv Sara Saile cv
Sara Saile cv
 
Paradigm Shifts
Paradigm ShiftsParadigm Shifts
Paradigm Shifts
 
Dpdk Validation - Liu, Yong
Dpdk Validation - Liu, YongDpdk Validation - Liu, Yong
Dpdk Validation - Liu, Yong
 
Fast flux domain detection
Fast flux domain detectionFast flux domain detection
Fast flux domain detection
 
dpdp
dpdpdpdp
dpdp
 
台湾趴趴走
台湾趴趴走台湾趴趴走
台湾趴趴走
 
CVLinkedIn
CVLinkedInCVLinkedIn
CVLinkedIn
 
Cheng_Wang_resume
Cheng_Wang_resumeCheng_Wang_resume
Cheng_Wang_resume
 
Kafka文件系统设计
Kafka文件系统设计Kafka文件系统设计
Kafka文件系统设计
 
Wei_Zhao_Resume
Wei_Zhao_ResumeWei_Zhao_Resume
Wei_Zhao_Resume
 

Similar a How Spark Fits into Baidu's Scale-(James Peng, Baidu)

Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecJonathan Woodward
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...George Walters
 
Webinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalWebinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalMongoDB
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarRasel Rana
 
Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Rasel Rana
 
J1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan KumarJ1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan KumarMS Cloud Summit
 
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle Cloud Platform - Japan
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsClusterpoint
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsData Driven Innovation
 
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayCherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayVishal Pawar
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTJames Chittenden
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCguest7c2e070
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsJoel Oleson
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Mydbops
 
Dot netcampus2015 francescosodano-sharepoint2016whatsnew
Dot netcampus2015 francescosodano-sharepoint2016whatsnewDot netcampus2015 francescosodano-sharepoint2016whatsnew
Dot netcampus2015 francescosodano-sharepoint2016whatsnewDotNetCampus
 
SHAREPOINT 2016 - WHAT'S NEW
SHAREPOINT 2016 - WHAT'S NEWSHAREPOINT 2016 - WHAT'S NEW
SHAREPOINT 2016 - WHAT'S NEWDotNetCampus
 
An afternoon with mongo db new delhi
An afternoon with mongo db new delhiAn afternoon with mongo db new delhi
An afternoon with mongo db new delhiRajnish Verma
 
Microsoft SQL Server - Parallel Data Warehouse Presentation
Microsoft SQL Server - Parallel Data Warehouse PresentationMicrosoft SQL Server - Parallel Data Warehouse Presentation
Microsoft SQL Server - Parallel Data Warehouse PresentationMicrosoft Private Cloud
 
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...GeeksLab Odessa
 

Similar a How Spark Fits into Baidu's Scale-(James Peng, Baidu) (20)

Data Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd DecData Culture Series - Keynote - 3rd Dec
Data Culture Series - Keynote - 3rd Dec
 
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...Customer migration to azure sql database from on-premises SQL, for a SaaS app...
Customer migration to azure sql database from on-premises SQL, for a SaaS app...
 
Webinar: NoSQL as the New Normal
Webinar: NoSQL as the New NormalWebinar: NoSQL as the New Normal
Webinar: NoSQL as the New Normal
 
Google Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery WebinarGoogle Developer Group - Cloud Singapore BigQuery Webinar
Google Developer Group - Cloud Singapore BigQuery Webinar
 
Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)Google BigQuery is the future of Analytics! (Google Developer Conference)
Google BigQuery is the future of Analytics! (Google Developer Conference)
 
J1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan KumarJ1 - Keynote Data Platform - Rohan Kumar
J1 - Keynote Data Platform - Rohan Kumar
 
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big queryGoogle for モバイル アプリ   16:00: モバイル kpi 分析の新標準 fluentd + google big query
Google for モバイル アプリ 16:00: モバイル kpi 分析の新標準 fluentd + google big query
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Critical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and AnalyticsCritical Breakthroughs and Challenges in Big Data and Analytics
Critical Breakthroughs and Challenges in Big Data and Analytics
 
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in dayCherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
Cherokee nation 2 day AIAD & DIAD - App in a day and Dashboard in day
 
IoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoTIoT NY - Google Cloud Services for IoT
IoT NY - Google Cloud Services for IoT
 
SharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPCSharePoint and Large Scale SQL Deployments - NZSPC
SharePoint and Large Scale SQL Deployments - NZSPC
 
Large Scale SharePoint SQL Deployments
Large Scale SharePoint SQL DeploymentsLarge Scale SharePoint SQL Deployments
Large Scale SharePoint SQL Deployments
 
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
Choosing the Right Database: Exploring MySQL Alternatives for Modern Applicat...
 
Dot netcampus2015 francescosodano-sharepoint2016whatsnew
Dot netcampus2015 francescosodano-sharepoint2016whatsnewDot netcampus2015 francescosodano-sharepoint2016whatsnew
Dot netcampus2015 francescosodano-sharepoint2016whatsnew
 
SHAREPOINT 2016 - WHAT'S NEW
SHAREPOINT 2016 - WHAT'S NEWSHAREPOINT 2016 - WHAT'S NEW
SHAREPOINT 2016 - WHAT'S NEW
 
An afternoon with mongo db new delhi
An afternoon with mongo db new delhiAn afternoon with mongo db new delhi
An afternoon with mongo db new delhi
 
Microsoft SQL Server - Parallel Data Warehouse Presentation
Microsoft SQL Server - Parallel Data Warehouse PresentationMicrosoft SQL Server - Parallel Data Warehouse Presentation
Microsoft SQL Server - Parallel Data Warehouse Presentation
 
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...
Java/Scala Lab: Anton Vidishchev - Microsoft Azure как облачная платформа для...
 

Más de Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Más de Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Último

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 

Último (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 

How Spark Fits into Baidu's Scale-(James Peng, Baidu)

  • 1. How Spark Fits Baidu’s Scale Dr. James Peng Principal Architect, Baidu U.S.A.
  • 2. Baidu is Leading Chinese Search Engine No. 1 Website in China By average daily visitors and page views No. 1 Chinese search engine By number of search question conducted No. 1 Most valuable brand Among private sector companies in China 73 % Search market share in China $73 bn. Market capitalization Premium global Internet company $2 bn. Q1 2015 total revenues By number of search question conducted 47,000 Employees Top 3 China’s best employer
  • 3. Baidu’s Big Data Infrastructure
  • 4. Baidu’s History in Big Data Infrastructure 2003 2008 2009 2010 2011 2012 2013 2014 Distributed Search System Hadoop in production Distributed webpage storage: storing over 100 billion web pages Large-scale machine learning platform > 10000 nodes Distributed computing engine in production Real-time streaming system with delay in milliseconds range Large-scale deployment of in-house developed network switcher and ARM servers Large scale DNN: supporting over 100 billion features
  • 5. Largest MR Cluster 0 2000 4000 6000 8000 10000 12000 14000 2008 2009 2010 2011 2012 2013 MR Cluster Size Number of Nodes 0 200000 400000 600000 800000 1000000 1200000 2008 2009 2010 2011 2012 2013 MR Cluster Daily Load Number of Jobs
  • 6. Baidu’s Distributed Shuffle Engine http://sortbenchmark.org/BaiduSort2014.pdf Champion of 2014 Sort Benchmark Indy 2014, 8.38 TB/min BaiduSort 100 TB in 716 seconds  982 nodes x  (2 2.10Ghz Intel Xeon E5-2450, 192 GB memory, 8x3TB 7200 RPM SATA)  Dasheng Jiang  Baidu Inc. and Peking University
  • 7. Baidu Proprietary Platforms DCE/DAG •  DAG model Distributed Computing Engines •  Outperforms open-source MR by at least 30% DSTREAM •  Distributed real- time computation engine •  99.99% of tasks with latency less than 50 ms TM •  mini-batch execution engine •  Transaction- based, guarantee exactly once delivery PADDLE •  PArallel Distributed Deep Learning Engine •  With CPU / GPU support and suitable for vision, speech, and NLP workloads ELF •  Essential Learning Framework •  Dataflow- based programming model •  Parameter server able to store > 1 trillion parameters THERE’S STRONG, AND THEN THERE’S BAIDU STRONG
  • 8. A Spark in Baidu
  • 9. •  Scale: > 1000 nodes(20000 Cores, 100 TB Memory) •  Daily Jobs:2000~3000 •  Supported Business Units: Ads, Search, Map, Commerce, etc Initial Evaluation Spark 0.8 Internal Beta Usage Spark 0.9 Beta Cluster with 200 nodes Spark 1.0 Incubation Integrate into Baidu Open Cloud Spark 1.1 Spark SQL in Production Spark 1.2 Continue Scaling Spark 1.3 Production and Scaling Spark’s History in Baidu
  • 10. Spark in Baidu Internal Infrastructure SparkDSTREAM TM DCE/DAG ELF Compute Layer Baidu Data Centers + Resource Management + Resource Scheduling Resource Layer Dataflow Compiler Java C++ Python Scala SQL Programming Layer
  • 11. Spark in Baidu Open Cloud Baidu Cloud Cluster HDFS Map Reduce Baidu Object Storage HBase Pig Hive SQL Streaming GraphX MLLib BMR Console BMR SDK Audit Scheduler Management http://bce.baidu.com/
  • 12. Enabling Interactive Queries with Spark and Tachyon
  • 13. Need for Interaction (Speed) Problem •  John is a PM and he needs to keep track of the top queries submitted to Baidu everyday •  Based on the top queries of the day, he will perform additional analysis •  But John is very frustrated that each query takes tens of minutes Requirements for Baidu Interactive Query Engine •  Manages Petabytes of data •  Able to finish 95% of queries within 30 seconds
  • 14. Query UI Data Warehouse Baidu File System Storage Layer Tachyon  Hot Data Layer Baidu Interactive Query Architecture Compute Layer Meta Store HiveQL Spark SQL UDFs SerDes Apache Spark
  • 15. Baidu Interactive Query Performance > 50X Acceleration of Big Data Analytics Workloads 0 200 400 600 800 1000 1200 MR (sec) Spark (sec) Spark + Tachyon (sec) Setup: 1.  Use MR to query 6 TB of data 2.  Use Spark to query 6 TB of data 3.  Use Spark + Tachyon to query 6 TB of data
  • 16. Summary Spark and Tachyon provides significant performance advantage •  Suitable for latency sensitive workloads, such as interactive ad- hoc query Takes time for Spark to reach MR scale •  We still use MR for high throughput batch workloads •  Deployment Scale: Spark 1000 nodes vs. MR 13000 nodes •  Memory Management is one of the main blockers of scalability Look forward to Spark Improvements •  Project Tungsten: better memory management: main blocker of scalability •  Continuous improvement of Catalyst •  Support for multiple meta stores
  • 17. Looking Into the Future Spark + TachyonSystems Software FPGA + GPU Acceleration Hardware Faster Network Spark SQL Applications DNN Spark R Spark MlLib Speed Functionality