Enviar búsqueda
Cargar
Empower Hive with Spark
•
13 recomendaciones
•
2,912 vistas
DataWorks Summit
Seguir
Hadoop Summit 2015
Leer menos
Leer más
Tecnología
Denunciar
Compartir
Denunciar
Compartir
1 de 39
Recomendados
Hive Now Sparks
Hive Now Sparks
DataWorks Summit
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
Yarns About Yarn
Yarns About Yarn
Cloudera, Inc.
Hive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
Intro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
Recomendados
Hive Now Sparks
Hive Now Sparks
DataWorks Summit
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
trihug
Yarns About Yarn
Yarns About Yarn
Cloudera, Inc.
Hive on spark berlin buzzwords
Hive on spark berlin buzzwords
Szehon Ho
Improving HDFS Availability with IPC Quality of Service
Improving HDFS Availability with IPC Quality of Service
DataWorks Summit
Operationalizing YARN based Hadoop Clusters in the Cloud
Operationalizing YARN based Hadoop Clusters in the Cloud
DataWorks Summit/Hadoop Summit
Intro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
DataWorks Summit
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
InMobi Technology
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
InMobi Technology
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
Más contenido relacionado
La actualidad más candente
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
DataWorks Summit
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
DataWorks Summit
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
DataWorks Summit
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
DataWorks Summit
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Hortonworks
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
InMobi Technology
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Uwe Printz
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
Application Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
DataWorks Summit/Hadoop Summit
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
DataWorks Summit
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
InMobi Technology
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Cloudera, Inc.
La actualidad más candente
(20)
Dawn of YARN @ Rocket Fuel
Dawn of YARN @ Rocket Fuel
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Apache Hadoop YARN 2015: Present and Future
Apache Hadoop YARN 2015: Present and Future
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
YARN Containerized Services: Fading The Lines Between On-Prem And Cloud
Apache Hadoop YARN - Enabling Next Generation Data Applications
Apache Hadoop YARN - Enabling Next Generation Data Applications
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
What's new in Hadoop Yarn- Dec 2014
What's new in Hadoop Yarn- Dec 2014
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
Application Architectures with Hadoop
Application Architectures with Hadoop
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Running Hadoop as Service in AltiScale Platform
Running Hadoop as Service in AltiScale Platform
Configuring a Secure, Multitenant Cluster for the Enterprise
Configuring a Secure, Multitenant Cluster for the Enterprise
Destacado
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Steve Loughran
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
DataWorks Summit
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
hdhappy001
Hive join optimizations
Hive join optimizations
Szehon Ho
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
alanfgates
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
alanfgates
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
alanfgates
HiveServer2
HiveServer2
Schubert Zhang
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Mark Rittman
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
Caserta
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Julian Hyde
Advanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
Murtaza Doctor
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
Hortonworks
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
DataWorks Summit
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
DataWorks Summit
Destacado
(20)
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Evolution of Big Data at Intel - Crawl, Walk and Run Approach
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
Hive join optimizations
Hive join optimizations
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive acid-updates-summit-sjc-2014
Hive acid-updates-summit-sjc-2014
HiveServer2
HiveServer2
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the...
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
Advanced Analytics using Apache Hive
Advanced Analytics using Apache Hive
How To Analyze Geolocation Data with Hive and Hadoop
How To Analyze Geolocation Data with Hive and Hadoop
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
HBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
Running Spark and MapReduce together in Production
Running Spark and MapReduce together in Production
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Carpe Datum: Building Big Data Analytical Applications with HP Haven
Similar a Empower Hive with Spark
Spark One Platform Webinar
Spark One Platform Webinar
Cloudera, Inc.
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Cloudera, Inc.
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Guru Dharmateja Medasani
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Hortonworks
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
hitesh1892
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Bikas Saha
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
YARN
YARN
Alex Moundalexis
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
Similar a Empower Hive with Spark
(20)
Spark One Platform Webinar
Spark One Platform Webinar
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Apache Kudu: Technical Deep Dive
Apache Kudu: Technical Deep Dive
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Chicago spark meetup-april2017-public
Chicago spark meetup-april2017-public
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Apache Tez - Accelerating Hadoop Data Processing
Apache Tez - Accelerating Hadoop Data Processing
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
YARN
YARN
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Más de DataWorks Summit
Data Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
Más de DataWorks Summit
(20)
Data Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Último
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Dubai Multi Commodity Centre
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
Florian Wilhelm
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Fwdays
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
gvaughan
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Stephanie Beckett
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Addepto
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
Lars Bell
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
Sri Ambati
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Hervé Boutemy
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Kalema Edgar
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
charlottematthew16
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
Enterprise Knowledge
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Lonnie McRorey
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
Manik S Magar
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
null - The Open Security Community
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Commit University
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Lorenzo Miniero
How to write a Business Continuity Plan
How to write a Business Continuity Plan
Databarracks
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Fwdays
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
hariprasad279825
Último
(20)
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
How to write a Business Continuity Plan
How to write a Business Continuity Plan
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Empower Hive with Spark
1.
1© Cloudera, Inc.
All rights reserved. Empower Hive with Spark Chao Sun, Cloudera Chengxiang Li, Intel
2.
2© Cloudera, Inc.
All rights reserved. Outline • Background • Architecture and Design • Challenges • Current Status • Benchmarks
3.
3© Cloudera, Inc.
All rights reserved. Background • Apache Hive: a popular data processing tool for Hadoop • Apache Spark: a data computing framework to succeed MapReduce • Marrying the two can benefit users from both community • Hive-7292: The most watched JIRA in Hive (170+)
4.
4© Cloudera, Inc.
All rights reserved. Community Involvement • Efforts from both communities (Hive and Spark) • Contributions from many organizations
5.
5© Cloudera, Inc.
All rights reserved. Design Principles • No or limited impact on Hive’s existing code path • Maximum code reuse • Minimum feature customization • Low future maintenance cost
6.
6© Cloudera, Inc.
All rights reserved. Hive Internal Parser Semantic Analyzer HiveServer2 Hadoop Client Task Compiler Hive Client (Beeline, JDBC) MetaStore Result HQL AST Task Hadoop Operator Tree
7.
7© Cloudera, Inc.
All rights reserved. Hive Internal Parser Semantic Analyzer HiveServer2 Hadoop Client Task Compiler Hive Client (Beeline, JDBC) MetaStore Result HQL AST Task Hadoop Operator Tree
8.
8© Cloudera, Inc.
All rights reserved. Class Hierarchy TaskCompiler MapReduceCompiler TezCompiler Task Work MapRedTask TezTask MapRedWork TezWork Generate Described By
9.
9© Cloudera, Inc.
All rights reserved. Class Hierarchy TaskCompiler MapReduceCompiler TezCompiler Task Work MapRedTask TezTask MapRedWork TezWork Generate Described By SparkCompiler SparkTask SparkWork
10.
10© Cloudera, Inc.
All rights reserved. Work – Metadata for Task • MapRedWork contains a MapWork and a possible ReduceWork • SparkWork contains a graph of MapWorks and ReduceWorks MapWork1 ReduceWork1 MapWork2 ReduceWork2 MapWork1 ReduceWork1 ReduceWork2 MR Job 1 MR Job 2 Spark Job Ex Query: SELECT name, sum(value) AS v FROM src GROUP BY name ORDER BY v;
11.
11© Cloudera, Inc.
All rights reserved. Spark Client • Abreast with MR client and Tez Client • Talk to Spark cluster • Job submission, monitoring, error reporting, statistics, metrics, and counters. • Support local, local-cluster, standalone, yarn-cluster, and yarn-client
12.
12© Cloudera, Inc.
All rights reserved. Spark Context • Core of Spark client • Heavy-weighted, thread-unsafe • Designed for a single-user application • Doesn’t work in multi-session environment • Doesn’t scale with user sessions
13.
13© Cloudera, Inc.
All rights reserved. Remote Spark Context (RSC) • Being created and living outside HiveServer2 • In yarn-cluster mode, Spark context lives in application master (AM) • Otherwise, Spark context lives in a separate process (other than HiveServer2) Session 1 AM (RSC)User 1 User 2 Session 2 HiveServer 2 Node 1 AM (RSC) Node 2 Node 3 Yarn Cluster
14.
14© Cloudera, Inc.
All rights reserved. Data Processing via MapReduce • Table as HDFS files and read by MR framework • Map-side processing • Map output is shuffled by MR framework • Reduce-side processing • Reduce output is written to disk as part of reduce-side processing • Output may be further processed by next MR job or returned to client
15.
15© Cloudera, Inc.
All rights reserved. Data Processing via Spark • Treat Table as HadoopRDD (input RDD) • Apply a transformation that wraps MR’s map-side processing • Shuffle map output using Spark’s transformations (groupByKey, sortByKey, etc) • Apply a transformation that wraps MR’s reduce-side processing • Output is either written to file or shuffled again
16.
16© Cloudera, Inc.
All rights reserved. Spark Plan • MapInput – encapsulates a table • MapTran – map-side processing • ShuffleTran – shuffling • ReduceTran – reduce-side processing MapInput MapTran ShuffleTran ReduceTranShuffleTranReduceTran Ex Query: SELECT name, sum(value) AS v FROM src GROUP BY name ORDER BY v;
17.
17© Cloudera, Inc.
All rights reserved. Advantages • Reuse existing map-side and reduce-side processing • Agonistic to Spark’s special transformations or actions • No need to reinvent wheels • Adopt existing features: authorization, window functions, UDFs, etc • Open to future features
18.
18© Cloudera, Inc.
All rights reserved. Challenges • Missing features or functional gaps in Spark • Concurrent data pipelines • Spark Context issues • Scala vs Java API • Scheduling issues • Large code base in Hive, many contributors working in different areas • Library dependency conflicts among projects
19.
19© Cloudera, Inc.
All rights reserved. Dynamic Executor Scaling • Spark cluster per user session • Heavy user vs light user • Big query vs small query • Solution: executors up and down based on workload
20.
20© Cloudera, Inc.
All rights reserved. Current Status • All functionality is implemented • First round of optimization is completed • More optimization and benchmarking coming • Beta release in CDH5.4 • Released in Apache Hive 1.1.0 onward • Follow HIVE-7292 for current and future work
21.
21© Cloudera, Inc.
All rights reserved. Optimizations • Map join, bucket map join, SMB, skew join (static and dynamic) • Split generating and grouping • CBO, vectorization • More to come, including table caching, dynamic partition pruning
22.
22© Cloudera, Inc.
All rights reserved. Summary • Community driven project • Multi-organization support • Combining merits from multiple projects • Benefiting a large user base • Bearing solid foundations • A solid, evolving project
23.
23© Cloudera, Inc.
All rights reserved. Benchmarks – Cluster Setup 8 physical nodes Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz, with 32 logical cores and 64GB memory 10Gb/s network between the nodes Component Version Hive Spark-branch Spark 1.3.0 Hadoop 2.6.0 Tez 0.5.3
24.
24© Cloudera, Inc.
All rights reserved. Benchmarks – Test Configurations 320GB and 4TB TPC-DS datasets Three engines share the most configurations Each node is allocated 32 cores and 48GB memory Vectorization enabled CBO enabled
25.
25© Cloudera, Inc.
All rights reserved. Benchmarks – Test Configurations Hive on Spark • spark.master = yarn-client • spark.executor.memory = 5120m • spark.yarn.executor.memoryOverhead = 1024 • spark.executor.cores = 4 • spark.kryo.referenceTracking = false • spark.io.compression.codec = lzf Hive on Tez • hive.prewarm.numcontainers = 250 • hive.tez.auto.reducer.parallelism = true • hive.tez.dynamic.partition.pruning = true
26.
26© Cloudera, Inc.
All rights reserved. Benchmarks – Data Collecting We run each query 2 times and measure the 2nd run Spark on yarn waits for a number of executors to register before scheduling tasks, thus with a bigger start-up overhead We also measure a few queries for Tez with dynamic partition pruning disabled for fair comparison, as this optimization hasn't been implemented in Hive on Spark yet
27.
27© Cloudera, Inc.
All rights reserved. MR vs Spark, 320GB
28.
28© Cloudera, Inc.
All rights reserved. MR vs Spark vs Tez, 320GB
29.
29© Cloudera, Inc.
All rights reserved. MR vs Spark vs Tez, 320GB DPP helps Tez
30.
30© Cloudera, Inc.
All rights reserved. Prune partitions at runtime Can dramatically improve performance when tables are joined on partitioned columns To be implemented for Hive on Spark Benchmarks - Dynamic Partition Pruning
31.
31© Cloudera, Inc.
All rights reserved. Spark vs Tez vs Tez w/o DPP, 320GB
32.
32© Cloudera, Inc.
All rights reserved. MR vs Spark, 4TB
33.
33© Cloudera, Inc.
All rights reserved. Spark vs Tez, 4TB
34.
34© Cloudera, Inc.
All rights reserved. Spark vs Tez, 4TB DPP helps Tez
35.
35© Cloudera, Inc.
All rights reserved. Spark vs Tez vs Tez w/o DPP, 4TB
36.
36© Cloudera, Inc.
All rights reserved. Spark vs Tez, 4TB Spark is faster
37.
37© Cloudera, Inc.
All rights reserved. Spark vs Tez, 4TB Tez is faster
38.
38© Cloudera, Inc.
All rights reserved. Benchmark - Summary In general, Spark is (a few) times faster than MR Spark is as fast as or faster than Tez on many queries Dynamic partition pruning helps Tez a lot on certain queries (Q3, Q15, Q19). Without DPP for Tez, they are close. Tez is slightly faster on certain queries (common join, Q84) Bigger dataset seems helping Spark more. Spark will likely be faster after DPP is implemented.
39.
39© Cloudera, Inc.
All rights reserved. Questions?