SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
1
2
RUNNING A PETABYTE
SCALE DATA SYSTEM
Alexey Kharlamov
Nov 14st, 2016
Good, Bad, and Ugly Decisions
3
2
1
3
AGENDA
MULTITENANCY
• Problem statement
• Resource management
• Workload isolation
CONTINOUS
INTEGRATION
• What is different?
• Caveats of the conventional approach
• BigData release pipeline
INTRODUCTION
• Who?
• What?
• Why?
4
SERVICES
Data Strategy Big Data
Architecture
Data Science Big Data DevOps
and Support
Solutions
and
Accelerators
BIG DATA AND DATA SCIENCE PRACTICE
15+
World-Class Data
Architects
200+
Big Data Engineers
& Hadoop DevOps
10%
Hadoop Certified
Engineers
20+
Data Scientists
5
BIO
Alexey a Solution Architect at EPAM Systems Ltd, where he leads EMEA Big Data
Competency Center. He has over 20 years of software engineering experience
and built multiple systems in the area of low-latency and distributed data
processing in financial, e-retail and advertising industries.
During his career, Alexey has designed systems processing millions of messages
per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data
grids, and Big Data toolchain in his daily work to help companies on their Big
Data journey.
Alexey Kharlamov
EPAM Systems, Solution Architect
6
DATA THAT CAN NOT BE PROCESSED
ON A SINGLE MACHINE
7
• Data
– Machine generated data by social networks, games, sensors, ad networks
– Large volumes
– Allow to build fine grained models of reality
• Traits
– ~1000 USD/TB
– Hundreds of servers, thousands of rotational drives (Failure is a reality)
– High performance server to server network
– It takes days to copy data from a single server
BIG DATA SYSTEM
8
CONTINOUS INTEGRATION @ SCALE
9
• Multiple environments for
different purposes
– Local/Continuous
Integration
– Quality Assurance
– Production
• The environments are kept
in sync
– Configuration
– Databases
• Code and test datasets are
deployed to the
environments to test
different aspects of a
system
CLASSICAL (WEB) APPROACH
1 Laptop 1 VM 2 hosts 100+
hosts
TRADITIONAL APPROACH
10
TOTALLY DIFFERENT
ENVIRONMENT SYNCRHONIZATION OUTCOME
• CI, QA and PROD are constantly different
• Test failure on CI and QA does not mean it
will fail in PROD and visa versa
• People stop to rely on additional
environments to test their jobs
• The most frequent bugs
– Unexpected field value / rubbish
– Input data change
– Resource issue due data skew or growth
• Environments have different hardware
– Number of nodes
– Generations of servers
• Hard to synchronize configuration
– Reprovisioning takes hours
– Engineers tend to forget to copy
configuration parameters
• Hard to synchronize data
– Different amount of disk space and CPU
– Coping takes hours
11
PREVAILING ISSUE TYPES
• Unexpected field value / rubbish
– Test data do not cover all possible values
– Sampled data may miss exactly this error
– Need to test on production data
• Incompatible change in data format
– Frequently brought in by third-parties and unexpected
– Fall through ETL layers
– Need to test on production data
• Resource issue due data skew or growth
– Causes job termination or cluster failure
– Must be tested on exactly the same hardware
configuration
– Need to test on production data
12
PERFECT TEST USES PRODUCTION DATA
PERFECT TEST USES PRODUCTION
HARDWARE
13
• Logical partitions for DEV, QA, PROD on the
cluster
– Full processing capacity available
– Always up-to-date data and
configuration
– No environment synchronization
needed
• Cluster becomes multitenant
– Partitions must be isolated!
– Code must be portable!
• Developers need more
– Faster turnaround times
– Easy interactive debugging and cross-
process traceability
QA: SINGLE CLUSTER FOR EVERYTHING
14
QA: HADOOP MINICLUSTER
• Full clone of a Hadoop Cluster in a single JVM
– Job Driver
– NameNode
– DataNode
– Hive
– Hbase
• Step Into... Hadoop and debug
– MapReduce Jobs
– User Defined Functions
– Coprocessors
– Queries
15
QA: CONTINUOUS QUALITY MONITORING
• Assertion of invariants per data chunk or
time period
– Number of records
– Field data profile
– Conversion failures
– Missing dictionary/dimension data
– Field values range
• Alerting on assertion failure
– Too many errors!
– Number of records differs!
16
MULTITENANCY
17
• Uses unit allocated to them, but always
would like to get more
• Wants independence from others
• Do not want to be bothered by other, but can
throw a party from time to time
APARTMENT RENTAL
TENANT
• Provides unit fulfilling tenant needs
• Fixes broken facilities
• Ensures tenants follow rules
• Evicts misbehaving tenants
LANDLORD
18
• A logical partition of platform resources
independently executing a cluster application
– Data processing scripts and drivers
– Cluster services (workflow managers, query engines)
– Bespoken services (REST, Web UI, etc)
• Resource management
– YARN resource pool defines share of resource available
to application
– HDFS quotes for data volume control
• Isolation
– Linux Cgroups enforce CPU/RAM utilization
– Filesystem ACLs restrict access
– Own service instance per domain (Hive, scheduler,
etc)
– YARN can preempt tasks running for too long
– Watchdog processes terminates ran away jobs
APPLICATION DOMAIN
19
ELASTIC COMPUTING CAPACITY
Mesosphere
• Researchers and Developers frequently need a
playground
• Application domains need to dynamically allocate
resources
– Metal as a Service
– Virtualization
– Containerization
• Containers are perfect for portable code bundling
– Statelessness encourages externalization of
configuration
– All dependencies included
– Explicit amount of resources allocated
– Easy migration between hosts
20
2
1
3
TAKE AWAYS
AUGMENT HADOOP WITH FLUID COMPUTATIONAL
CAPACITY
CREATE ISOLATED DOMAINS FOR TENANTS AND
WORKLOADS
USE UNIFIED PLATFORM FOR ALL ACTIVITIES
21
THANK
YOU
alexey@kharlamov.biz
@aih1013

Más contenido relacionado

La actualidad más candente

From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

La actualidad más candente (20)

Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
Geospatial Analytics at Scale with Deep Learning and Apache Spark with Tim hu...
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Lessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics PlatformLessons Learned from Modernizing USCIS Data Analytics Platform
Lessons Learned from Modernizing USCIS Data Analytics Platform
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ ExpediaBridging the gap of Relational to Hadoop using Sqoop @ Expedia
Bridging the gap of Relational to Hadoop using Sqoop @ Expedia
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Modern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data CaptureModern ETL Pipelines with Change Data Capture
Modern ETL Pipelines with Change Data Capture
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Apache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing SystemApache Pulsar: The Next Generation Messaging and Queuing System
Apache Pulsar: The Next Generation Messaging and Queuing System
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics Quark Virtualization Engine for Analytics
Quark Virtualization Engine for Analytics
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Functional architectural patterns
Functional architectural patternsFunctional architectural patterns
Functional architectural patterns
 
The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 

Destacado

Destacado (11)

Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
Stream Processing as Game Changer for Big Data and Internet of Things by Kai ...
 
Growing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso BetanzosGrowing Data Scientists by Amparo Alonso Betanzos
Growing Data Scientists by Amparo Alonso Betanzos
 
Managing Data Science by David Martínez Rego
Managing Data Science by David Martínez RegoManaging Data Science by David Martínez Rego
Managing Data Science by David Martínez Rego
 
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 Turning an idea into a Data-Driven Production System: An Energy Load Forecas... Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
Turning an idea into a Data-Driven Production System: An Energy Load Forecas...
 
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan GatesApache Hive 2.0 SQL, Speed, Scale by Alan Gates
Apache Hive 2.0 SQL, Speed, Scale by Alan Gates
 
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
TENSORFLOW: ARCHITECTURE AND USE CASE - NASA SPACE APPS CHALLENGE by Gema Par...
 
Inferring the effect of an event using CausalImpact by Kay H. Brodersen
Inferring the effect of an event using CausalImpact by Kay H. BrodersenInferring the effect of an event using CausalImpact by Kay H. Brodersen
Inferring the effect of an event using CausalImpact by Kay H. Brodersen
 
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
Apache Flink for IoT: How Event-Time Processing Enables Easy and Accurate Ana...
 
Next generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan KolmarNext generation Polyglot Architectures using Neo4j by Stefan Kolmar
Next generation Polyglot Architectures using Neo4j by Stefan Kolmar
 
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
HOW TO APPLY BIG DATA ANALYTICS AND MACHINE LEARNING TO REAL TIME PROCESSING ...
 
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
Open data : from Insight to Visualisation with Google BigQuery and Carto.com ...
 

Similar a RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
Kognitio
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
Cloudera, Inc.
 

Similar a RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov (20)

Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014Gobblin' Big Data With Ease @ QConSF 2014
Gobblin' Big Data With Ease @ QConSF 2014
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with InnovationNot Your Father’s Data Warehouse: Breaking Tradition with Innovation
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Transform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big DataTransform your DBMS to drive engagement innovation with Big Data
Transform your DBMS to drive engagement innovation with Big Data
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
Data Warehouse Optimization
Data Warehouse OptimizationData Warehouse Optimization
Data Warehouse Optimization
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which Hadoop and the Data Warehouse: When to Use Which
Hadoop and the Data Warehouse: When to Use Which
 
Key Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to PostgresKey Methodologies for Migrating from Oracle to Postgres
Key Methodologies for Migrating from Oracle to Postgres
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
Logical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business OutcomesLogical Data Fabric and Data Mesh – Driving Business Outcomes
Logical Data Fabric and Data Mesh – Driving Business Outcomes
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
Managing Performance in the Cloud
Managing Performance in the CloudManaging Performance in the Cloud
Managing Performance in the Cloud
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 

Más de Big Data Spain

Más de Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 

RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov

  • 1. 1
  • 2. 2 RUNNING A PETABYTE SCALE DATA SYSTEM Alexey Kharlamov Nov 14st, 2016 Good, Bad, and Ugly Decisions
  • 3. 3 2 1 3 AGENDA MULTITENANCY • Problem statement • Resource management • Workload isolation CONTINOUS INTEGRATION • What is different? • Caveats of the conventional approach • BigData release pipeline INTRODUCTION • Who? • What? • Why?
  • 4. 4 SERVICES Data Strategy Big Data Architecture Data Science Big Data DevOps and Support Solutions and Accelerators BIG DATA AND DATA SCIENCE PRACTICE 15+ World-Class Data Architects 200+ Big Data Engineers & Hadoop DevOps 10% Hadoop Certified Engineers 20+ Data Scientists
  • 5. 5 BIO Alexey a Solution Architect at EPAM Systems Ltd, where he leads EMEA Big Data Competency Center. He has over 20 years of software engineering experience and built multiple systems in the area of low-latency and distributed data processing in financial, e-retail and advertising industries. During his career, Alexey has designed systems processing millions of messages per second and managing petabytes of stored data. He uses RDBMs, NoSQL, data grids, and Big Data toolchain in his daily work to help companies on their Big Data journey. Alexey Kharlamov EPAM Systems, Solution Architect
  • 6. 6 DATA THAT CAN NOT BE PROCESSED ON A SINGLE MACHINE
  • 7. 7 • Data – Machine generated data by social networks, games, sensors, ad networks – Large volumes – Allow to build fine grained models of reality • Traits – ~1000 USD/TB – Hundreds of servers, thousands of rotational drives (Failure is a reality) – High performance server to server network – It takes days to copy data from a single server BIG DATA SYSTEM
  • 9. 9 • Multiple environments for different purposes – Local/Continuous Integration – Quality Assurance – Production • The environments are kept in sync – Configuration – Databases • Code and test datasets are deployed to the environments to test different aspects of a system CLASSICAL (WEB) APPROACH 1 Laptop 1 VM 2 hosts 100+ hosts TRADITIONAL APPROACH
  • 10. 10 TOTALLY DIFFERENT ENVIRONMENT SYNCRHONIZATION OUTCOME • CI, QA and PROD are constantly different • Test failure on CI and QA does not mean it will fail in PROD and visa versa • People stop to rely on additional environments to test their jobs • The most frequent bugs – Unexpected field value / rubbish – Input data change – Resource issue due data skew or growth • Environments have different hardware – Number of nodes – Generations of servers • Hard to synchronize configuration – Reprovisioning takes hours – Engineers tend to forget to copy configuration parameters • Hard to synchronize data – Different amount of disk space and CPU – Coping takes hours
  • 11. 11 PREVAILING ISSUE TYPES • Unexpected field value / rubbish – Test data do not cover all possible values – Sampled data may miss exactly this error – Need to test on production data • Incompatible change in data format – Frequently brought in by third-parties and unexpected – Fall through ETL layers – Need to test on production data • Resource issue due data skew or growth – Causes job termination or cluster failure – Must be tested on exactly the same hardware configuration – Need to test on production data
  • 12. 12 PERFECT TEST USES PRODUCTION DATA PERFECT TEST USES PRODUCTION HARDWARE
  • 13. 13 • Logical partitions for DEV, QA, PROD on the cluster – Full processing capacity available – Always up-to-date data and configuration – No environment synchronization needed • Cluster becomes multitenant – Partitions must be isolated! – Code must be portable! • Developers need more – Faster turnaround times – Easy interactive debugging and cross- process traceability QA: SINGLE CLUSTER FOR EVERYTHING
  • 14. 14 QA: HADOOP MINICLUSTER • Full clone of a Hadoop Cluster in a single JVM – Job Driver – NameNode – DataNode – Hive – Hbase • Step Into... Hadoop and debug – MapReduce Jobs – User Defined Functions – Coprocessors – Queries
  • 15. 15 QA: CONTINUOUS QUALITY MONITORING • Assertion of invariants per data chunk or time period – Number of records – Field data profile – Conversion failures – Missing dictionary/dimension data – Field values range • Alerting on assertion failure – Too many errors! – Number of records differs!
  • 17. 17 • Uses unit allocated to them, but always would like to get more • Wants independence from others • Do not want to be bothered by other, but can throw a party from time to time APARTMENT RENTAL TENANT • Provides unit fulfilling tenant needs • Fixes broken facilities • Ensures tenants follow rules • Evicts misbehaving tenants LANDLORD
  • 18. 18 • A logical partition of platform resources independently executing a cluster application – Data processing scripts and drivers – Cluster services (workflow managers, query engines) – Bespoken services (REST, Web UI, etc) • Resource management – YARN resource pool defines share of resource available to application – HDFS quotes for data volume control • Isolation – Linux Cgroups enforce CPU/RAM utilization – Filesystem ACLs restrict access – Own service instance per domain (Hive, scheduler, etc) – YARN can preempt tasks running for too long – Watchdog processes terminates ran away jobs APPLICATION DOMAIN
  • 19. 19 ELASTIC COMPUTING CAPACITY Mesosphere • Researchers and Developers frequently need a playground • Application domains need to dynamically allocate resources – Metal as a Service – Virtualization – Containerization • Containers are perfect for portable code bundling – Statelessness encourages externalization of configuration – All dependencies included – Explicit amount of resources allocated – Easy migration between hosts
  • 20. 20 2 1 3 TAKE AWAYS AUGMENT HADOOP WITH FLUID COMPUTATIONAL CAPACITY CREATE ISOLATED DOMAINS FOR TENANTS AND WORKLOADS USE UNIFIED PLATFORM FOR ALL ACTIVITIES