SlideShare una empresa de Scribd logo
1 de 18
Big Data Benchmarks
Srinivasa Rao Aravilli
N Venkata Naga Ravi
2
Why ..
 Evaluating the effect of a hardware/software
upgrade:
 OS, Java VM,. . .
 Hadoop, Cloudera CDH, Pig, Hive, Impala,.
. .
 Debugging:
 Compare with other clusters or published
results.
 Performance tuning
3
Industry Standard benchmarking organizations
• TPC - Transaction Processing Performance Council (http://www.tpc.org/ )
• SPEC - The Standard Performance Evaluation Corporation
(https://www.spec.org/ )
• CLDS – Centre for Large- scale Data System Research
(http://clds.sdsc.edu/bdbc)
• Top Outcomes
• BigData Top100 - an end-to-end application-layer benchmark for big data
applications
• Terasort - Functional benchmark focusing on Sort function ( quicksort using
MapReduce)
• Hibench
• Sort, Machine learning ( K-means clustering, Classification)
4
Types of Benchmark
• Micro-benchmarks. To evaluate specific lower-level, system operations
• E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on
Modern Clusters, Panda et al, OSU
• Functional / component benchmarks. Specific high-level function.
• E.g. Sorting: Terasort
• E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join,
Order-By, ...
• Application-level benchmarks.
• Measure system performance (hardware and software) for a given
application scenario—with given data and workload
5
Terasort using Hadoop
Terasort includes 3 MapReduce Applications
• Teragen – generates the data
• Terasort – samples the input data and uses them with MapReduce to
sort the data
• Teravalidate – validates the output data is sorted
6
MapReduce for Teragen
7
Map Reduce Modelloser look at MapReduce’s implementation model
source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
8
Benchmarking Suite
• HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench)
• YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo!
(https://github.com/brianfrankcooper/YCSB/)
• Berkeley Big Data Benchmark, Pavlo et al., AMPLab
(https://amplab.cs.berkeley.edu/benchmark/)
• BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences
(http://prof.ict.ac.cn/BigDataBench/)
• Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html)
• Big Bench (https://github.com/intel-hadoop/Big-Bench)
• TPCx-HS (http://www.tpc.org/tpcx-hs/ )
9
TPCx-HS benchmarks
X: Express H: Hadoop S: Sort
• TPCx-HS was developed to provide an
objective measure of hardware,
operating system and commercial
Apache Hadoop File System API
compatible software distributions, and
to provide the industry with verifiable
performance, price-performance and
availability metrics.
• http://www.tpc.org/tpcx-hs/
10
TPCx HS Demo
11
TPCx-HS benchmarks
Scale Factor
The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test
dataset must be chosen from the set of fixed Scale Factors defined as :
• 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB.
• The corresponding number of records are
• 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B,
where each record is 100 bytes generated by HSGen.
• http://www.tpc.org/tpcx-hs/
12
TPCx-HS benchmarks - Metrics
13
TPCx-HS Results on Cisco UCS
Cisco Published Results
14
Comparison of various Benchmarks Suites.
15
16
Spark Performance
17
Spark sorted the same data 3X faster using 10X fewer machines. All the sorting
took place on disk (HDFS), without using Spark’s in-memory cache.
18
Sort Bench Mark http://sortbenchmark.org/
• GraySort
• MinuteSort
• CloudSort
• JouleSort
• PennySort
• TeraByteSort
• DatamationSort

Más contenido relacionado

La actualidad más candente

Getting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGetting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGeorg Knon
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Timothy Spann
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxwonyong hwang
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiManish Gupta
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache SparkDataWorks Summit
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseMartin Toshev
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCErik Krogen
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsTimothy Spann
 
Cisco Live! :: Content Delivery Networks (CDN)
Cisco Live! :: Content Delivery Networks (CDN)Cisco Live! :: Content Delivery Networks (CDN)
Cisco Live! :: Content Delivery Networks (CDN)Bruno Teixeira
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachCalculated Systems
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesWalaa Hamdy Assy
 

La actualidad más candente (20)

Getting started with Splunk - Break out Session
Getting started with Splunk - Break out SessionGetting started with Splunk - Break out Session
Getting started with Splunk - Break out Session
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Nginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptxNginx Reverse Proxy with Kafka.pptx
Nginx Reverse Proxy with Kafka.pptx
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Data Federation with Apache Spark
Data Federation with Apache SparkData Federation with Apache Spark
Data Federation with Apache Spark
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GCHadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
Hadoop Meetup Jan 2019 - Dynamometer and a Case Study in NameNode GC
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Cisco Live! :: Content Delivery Networks (CDN)
Cisco Live! :: Content Delivery Networks (CDN)Cisco Live! :: Content Delivery Networks (CDN)
Cisco Live! :: Content Delivery Networks (CDN)
 
Apache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop ApproachApache NiFi: A Drag and Drop Approach
Apache NiFi: A Drag and Drop Approach
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 

Destacado

Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking TutorialTilmann Rabl
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingBart Vandewoestyne
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataNicolas Poggi
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRDouglas Bernardini
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIOhhyin
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopYahoo Developer Network
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastImpetus Technologies
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarksTilmann Rabl
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarkingSqrrl
 
Terasort
TerasortTerasort
Terasorthhyin
 
Java Agile ALM: OTAP and DevOps in the Cloud
Java Agile ALM: OTAP and DevOps in the CloudJava Agile ALM: OTAP and DevOps in the Cloud
Java Agile ALM: OTAP and DevOps in the CloudMongoDB
 
Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0bsd free
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Business Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop BenchmarkBusiness Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop Benchmarkatscaleinc
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideDouglas Bernardini
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmarkkevin han
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 

Destacado (20)

Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Benchmarking Hadoop and Big Data
Benchmarking Hadoop and Big DataBenchmarking Hadoop and Big Data
Benchmarking Hadoop and Big Data
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
TestDFSIO
TestDFSIOTestDFSIO
TestDFSIO
 
TeraSort
TeraSortTeraSort
TeraSort
 
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing HadoopHadoop Summit 2010 Benchmarking And Optimizing Hadoop
Hadoop Summit 2010 Benchmarking And Optimizing Hadoop
 
Microservices with Docker
Microservices with Docker Microservices with Docker
Microservices with Docker
 
Performance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus WebcastPerformance Testing of Big Data Applications - Impetus Webcast
Performance Testing of Big Data Applications - Impetus Webcast
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
Ycsb benchmarking
Ycsb benchmarkingYcsb benchmarking
Ycsb benchmarking
 
Terasort
TerasortTerasort
Terasort
 
Java Agile ALM: OTAP and DevOps in the Cloud
Java Agile ALM: OTAP and DevOps in the CloudJava Agile ALM: OTAP and DevOps in the Cloud
Java Agile ALM: OTAP and DevOps in the Cloud
 
Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0Linux io-stack-diagram v1.0
Linux io-stack-diagram v1.0
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Business Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop BenchmarkBusiness Intelligence on Hadoop Benchmark
Business Intelligence on Hadoop Benchmark
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Yahoo Cloud Serving Benchmark
Yahoo Cloud Serving BenchmarkYahoo Cloud Serving Benchmark
Yahoo Cloud Serving Benchmark
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 

Similar a Big Data Benchmarking

Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera, Inc.
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsGeoffrey Fox
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningVarad Meru
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningSwiss Big Data User Group
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...Big Data Value Association
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes Gredler
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopRTTS
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotechAdam Muise
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesGeoffrey Fox
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 

Similar a Big Data Benchmarking (20)

Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Cloud Services for Big Data Analytics
Cloud Services for Big Data AnalyticsCloud Services for Big Data Analytics
Cloud Services for Big Data Analytics
 
Dev Ops Training
Dev Ops TrainingDev Ops Training
Dev Ops Training
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Design Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time LearningDesign Patterns for Large-Scale Real-Time Learning
Design Patterns for Large-Scale Real-Time Learning
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
 
Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17Hannes end-of-the-router-tnc17
Hannes end-of-the-router-tnc17
 
Testing Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of HadoopTesting Big Data: Automated ETL Testing of Hadoop
Testing Big Data: Automated ETL Testing of Hadoop
 
Agile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric ApproachAgile Big Data Analytics Development: An Architecture-Centric Approach
Agile Big Data Analytics Development: An Architecture-Centric Approach
 
2012 sept 18_thug_biotech
2012 sept 18_thug_biotech2012 sept 18_thug_biotech
2012 sept 18_thug_biotech
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Matching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software ArchitecturesMatching Data Intensive Applications and Hardware/Software Architectures
Matching Data Intensive Applications and Hardware/Software Architectures
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 

Más de Venkata Naga Ravi

Más de Venkata Naga Ravi (11)

Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Quick Trip with Docker
Quick Trip with DockerQuick Trip with Docker
Quick Trip with Docker
 
Glint with Apache Spark
Glint with Apache SparkGlint with Apache Spark
Glint with Apache Spark
 
Flocker
FlockerFlocker
Flocker
 
Go Lang
Go LangGo Lang
Go Lang
 
Kubernetes
KubernetesKubernetes
Kubernetes
 
NoSQL & HBase overview
NoSQL & HBase overviewNoSQL & HBase overview
NoSQL & HBase overview
 
Software Defined Network - SDN
Software Defined Network - SDNSoftware Defined Network - SDN
Software Defined Network - SDN
 
Virtual Container - Docker
Virtual Container - Docker Virtual Container - Docker
Virtual Container - Docker
 
Java 8 Lambda and Streams
Java 8 Lambda and StreamsJava 8 Lambda and Streams
Java 8 Lambda and Streams
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 

Último

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dashnarutouzumaki53779
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Último (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Visualising and forecasting stocks using Dash
Visualising and forecasting stocks using DashVisualising and forecasting stocks using Dash
Visualising and forecasting stocks using Dash
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate AgentsRyan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
Ryan Mahoney - Will Artificial Intelligence Replace Real Estate Agents
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Big Data Benchmarking

  • 1. Big Data Benchmarks Srinivasa Rao Aravilli N Venkata Naga Ravi
  • 2. 2 Why ..  Evaluating the effect of a hardware/software upgrade:  OS, Java VM,. . .  Hadoop, Cloudera CDH, Pig, Hive, Impala,. . .  Debugging:  Compare with other clusters or published results.  Performance tuning
  • 3. 3 Industry Standard benchmarking organizations • TPC - Transaction Processing Performance Council (http://www.tpc.org/ ) • SPEC - The Standard Performance Evaluation Corporation (https://www.spec.org/ ) • CLDS – Centre for Large- scale Data System Research (http://clds.sdsc.edu/bdbc) • Top Outcomes • BigData Top100 - an end-to-end application-layer benchmark for big data applications • Terasort - Functional benchmark focusing on Sort function ( quicksort using MapReduce) • Hibench • Sort, Machine learning ( K-means clustering, Classification)
  • 4. 4 Types of Benchmark • Micro-benchmarks. To evaluate specific lower-level, system operations • E.g., A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Panda et al, OSU • Functional / component benchmarks. Specific high-level function. • E.g. Sorting: Terasort • E.g. Basic SQL: Individual SQL operations, e.g. Select, Project, Join, Order-By, ... • Application-level benchmarks. • Measure system performance (hardware and software) for a given application scenario—with given data and workload
  • 5. 5 Terasort using Hadoop Terasort includes 3 MapReduce Applications • Teragen – generates the data • Terasort – samples the input data and uses them with MapReduce to sort the data • Teravalidate – validates the output data is sorted
  • 7. 7 Map Reduce Modelloser look at MapReduce’s implementation model source: http:/ / developer.yahoo.com/ hadoop/ tutorial/ module4.html”
  • 8. 8 Benchmarking Suite • HiBench, Yan Li, Intel (https://github.com/intel-hadoop/HiBench) • YCSB -Yahoo Cloud Serving Benchmark, Brian Cooper, Yahoo! (https://github.com/brianfrankcooper/YCSB/) • Berkeley Big Data Benchmark, Pavlo et al., AMPLab (https://amplab.cs.berkeley.edu/benchmark/) • BigDataBench, Jianfeng Zhan, Chinese Academy of Sciences (http://prof.ict.ac.cn/BigDataBench/) • Grid Mix (http://hadoop.apache.org/docs/r1.2.1/gridmix.html) • Big Bench (https://github.com/intel-hadoop/Big-Bench) • TPCx-HS (http://www.tpc.org/tpcx-hs/ )
  • 9. 9 TPCx-HS benchmarks X: Express H: Hadoop S: Sort • TPCx-HS was developed to provide an objective measure of hardware, operating system and commercial Apache Hadoop File System API compatible software distributions, and to provide the industry with verifiable performance, price-performance and availability metrics. • http://www.tpc.org/tpcx-hs/
  • 11. 11 TPCx-HS benchmarks Scale Factor The TPCx-HS follows a stepped size model. Scale factor (SF) used for the test dataset must be chosen from the set of fixed Scale Factors defined as : • 1TB, 3TB, 10TB, 30TB, 100TB, 300TB, 1000TB, 3000TB, 10000TB. • The corresponding number of records are • 10B, 30B, 100B, 300B, 1000B, 3000B, 10000B, 30000B, 100000B, where each record is 100 bytes generated by HSGen. • http://www.tpc.org/tpcx-hs/
  • 13. 13 TPCx-HS Results on Cisco UCS Cisco Published Results
  • 14. 14 Comparison of various Benchmarks Suites.
  • 15. 15
  • 17. 17 Spark sorted the same data 3X faster using 10X fewer machines. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache.
  • 18. 18 Sort Bench Mark http://sortbenchmark.org/ • GraySort • MinuteSort • CloudSort • JouleSort • PennySort • TeraByteSort • DatamationSort

Notas del editor

  1. <10 bytes key><10 bytes rowid><78 bytes filler>\r\n $ hadoop jar hadoop-*examples*.jar teragen -D dfs.block.size=536870912 ... http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
  2. CPU Type: Intel Xeon E5-2660 - 2.20 GHz   Total # of Processors: 32   Total # of Cores: 320  Total # of Threads: 640  Cluster: Yes Data Generation Time (hours): .23 Data Sort Time (hours): 1.29 Data Validation Time (hours): .22 Total Storage/Database Size Ratio: 38.40 TPCx - HS FDR 11 January, 2015 Measured Configuration: The measured configuration consisted of :  Total Nodes: 16  Total Processors/Cores/Threads: 32/320/640  Total Memory: 4,096GB  Total Number of Storage Drives/Devices: 384  Total Storage Capacity: 384 TB
  3. MVAPICH2 is an open source implementation of Message Passing Interface (MPI) that delivers the best performance, scalability and fault tolerance for high-end computing systems and servers using InfiniBand, 10GigE/iWARP and RoCE networking technologies.
  4. https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html