SlideShare a Scribd company logo
1 of 29
A cluster is only as strong
as its weakest link.
@DanRomike
Hadoop Tooling Engineer / Configuration
Manager
@Twitter
1#HadoopSummit
Introduction
• Hadoop health at Twitter:
– Scope of our operation
– What are some of our weak links?
– What is in our checkup?
– Where does our health check run?
– Which faults are meaningful to us?
– What is our future health strategy?
– Summary of our achievements
2#HadoopSummit
Cluster Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
3#HadoopSummit
MANAGING HADOOP
What we support
4#HadoopSummit
The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of 1000s/Nodes,
10s/Clusters
5#HadoopSummit
Clusters
Data
Warehouse
/ HBase
Large number of
computing jobs:
10’sk/ day
High storage
consumption
Tripled in Size
Processing
Large number of
computing jobs:
10’sk/ day
Doubled in Size
Backups
HDFS Storage
Doubled in Size
Test
Test releases
Evaluate jobs
6#HadoopSummit
Site Operations
Central Site
Operations
Team
• Ticket based
• Short repair times
• Infrastructure
Generally, what
breaks?
• PSU, LOM, BIOS, Wiring
• Network Bonding
• Disks, Controllers
• TOR Switches
• Rack Power
7#HadoopSummit
Our Configuration Manager
Role
Run
Attribute
8#HadoopSummit
Automation
Refined
processes
Source
Control
Repository
Config
Mgmt
Puppet
9#HadoopSummit
Cluster Reliability Team
10
Manage
Build, grow, and
migrate
On-boarding Migrate distcp harness
Configuration
Optimized
properties
heartbeats.in.seconds Set to cluster size
Reliability
Data integrity
Failures, under-
rep, 3-reps
fsck, -report,
metasave
Violated,MISSING
Balance Balancer rack-topology.sh
Nodes LIVE, DEAD, B-LIST Break/fix Recommission
HEALTH Scan Isolate issues Report failures
#HadoopSummit
Weak Links
Node Issues
• Performance loss, slow
• Storage failures
• High CPU usage
• Memory failures
• Onboard network failures
• Power On/Off
Infrastructure Issues
• Changes, adds and moves
• Site power maintenance
• Rack issues
• Unscheduled changes
• Cooling
• Network infrastructure
11#HadoopSummit
CLUSTER HEALTH
Health checks for Hadoop production environments
12#HadoopSummit
The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
13#HadoopSummit
Health Check Mission
Create and deploy a
comprehensive
health check that
reports failing
nodes, reduces
impact to
performance, and
uses common
standard tools.
Fast: logs may grow quickly,
avoid timeouts
Adjustable: setting the right
thresholds
Reliable: must not cause issues
or ‘brownouts’
Reusable: new tools will use
status and results
14#HadoopSummit
Health Goals
Reduce on-call incidents
Reduce
troubleshooting
Prevent cascading
failures
Verify after
maintenance
Facilitate change
and growth
15#HadoopSummit
Early Detection
Health
1-3mins
Thresholds
Preset Level
Blacklist
ERROR,Exclude
Notify
Alert
Monitor
Threshold Alert
Alerts
Email
Page
On-Call
Heartbeats
It’s Alive
Delays
Performance
Datanodes
0-3secs
Tasks
0-5secs
16#HadoopSummit
mapred-site.xml
<name>mapred.healthChecker.script.path</name>
<value>/etc/hadoop/conf/healthcheck2</value>
<name>mapred.healthChecker.interval</name>
<value>180000</value>
<name>mapred.healthChecker.script.timeout</n
ame>
<value>45000</value>
17#HadoopSummit
Healthy to Blacklisted
PASS ERROR
WARN
Con
figu
re Exe
cute
Eval
uate
FAIL
Health
18#HadoopSummit
FAULTS
What to scan for
19#HadoopSummit
Faults to Detect
• Network
– Speed decrease
– Partial rack power outages, loss of services
– Rack switch packet loss
– Errors/drops/retries bursts
• Reported memory vs. installed memory
• Induced fault: for node maintenance
20#HadoopSummit
More Faults
• Storage
– Full
– Incorrect disk installed
– Correct inodes per file system
– File system type: ext4
– HW disk controller issues
• Kernel is too old
• High CPU spikes with high loads
• Datanode failure
21#HadoopSummit
Log Checking
• Which logs to check
– System logs
– Datanode logs
– Tasktracker logs
• How to check
– Relevant records
– Bottom up scan
– Positive Pattern Matching
– Use of fault counters and scan thresholds
22#HadoopSummit
FUTURE STRATEGY
Reduce recovery time by building a management shell
23#HadoopSummit
The Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
24#HadoopSummit
Management Shell
• Health Shell (CLI) maintains a working list
– Refines the list as node state changes
– Interactive BASH Shell is the CLI
– Concurrent execution functions
– Interfaces to all Hadoop admin functions
– Familiar interface
25#HadoopSummit
Today’s Health Pyramid
Us
Tools and
Jenkins
A Cluster
Management Shell
Health Scans
Management of
1000s/Nodes, 10s/Clusters
26#HadoopSummit
CONCLUSION
Change weak links into strong links
27#HadoopSummit
Achievements
• Failing nodes are blacklisted
• New cluster validations
• Fewer Job tails
• Less intervention
• Increased job throughput
• Improved health
28#HadoopSummit
#ThankYou
@DanRomike
29#HadoopSummit

More Related Content

Similar to A Cluster Is Only As Strong As its Weakest Link

Enter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingEnter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingIntel IT Center
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum JapanBrian Brazil
 
Changing the tires on a big data racecar
Changing the tires on a big data racecarChanging the tires on a big data racecar
Changing the tires on a big data racecarDavid McNelis
 
Sql saturday databasemonitoringbestpractices_updated
Sql saturday databasemonitoringbestpractices_updatedSql saturday databasemonitoringbestpractices_updated
Sql saturday databasemonitoringbestpractices_updatedaspectconsult
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Brian Brazil
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxJOELFRANKLIN13
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web ApplicationsDavid Mitzenmacher
 
Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5guestea711d0
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 
How Fit is Your Data?
How Fit is Your Data?How Fit is Your Data?
How Fit is Your Data?CQLCorp
 
Everything you do is wrong
Everything you do is wrongEverything you do is wrong
Everything you do is wrongAbhaya Chauhan
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Brian Brazil
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 
Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Rafael Dohms
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)Brian Brazil
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...Torsten Steinbach
 

Similar to A Cluster Is Only As Strong As its Weakest Link (20)

Enter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputingEnter the Age of Hadoop SuperComputing
Enter the Age of Hadoop SuperComputing
 
Prometheus - Open Source Forum Japan
Prometheus  - Open Source Forum JapanPrometheus  - Open Source Forum Japan
Prometheus - Open Source Forum Japan
 
Changing the tires on a big data racecar
Changing the tires on a big data racecarChanging the tires on a big data racecar
Changing the tires on a big data racecar
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
TSE_Pres12.pptx
TSE_Pres12.pptxTSE_Pres12.pptx
TSE_Pres12.pptx
 
Sql saturday databasemonitoringbestpractices_updated
Sql saturday databasemonitoringbestpractices_updatedSql saturday databasemonitoringbestpractices_updated
Sql saturday databasemonitoringbestpractices_updated
 
Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)Prometheus (Prometheus London, 2016)
Prometheus (Prometheus London, 2016)
 
Twitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptxTwitter_Sentiment_analysis.pptx
Twitter_Sentiment_analysis.pptx
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 
How Fit is Your Data?
How Fit is Your Data?How Fit is Your Data?
How Fit is Your Data?
 
Everything you do is wrong
Everything you do is wrongEverything you do is wrong
Everything you do is wrong
 
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
Your data is in Prometheus, now what? (CurrencyFair Engineering Meetup, 2016)
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18Application metrics with Prometheus - DPC18
Application metrics with Prometheus - DPC18
 
An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)An Introduction to Prometheus (GrafanaCon 2016)
An Introduction to Prometheus (GrafanaCon 2016)
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

A Cluster Is Only As Strong As its Weakest Link

Editor's Notes

  1. Dan Romike, Hadoop Tooling Engineer / Configuration Manager, Twitter, Inc.Dan Romike started with Hadoop in the summer of 2008 at Yahoo!, Inc., in their Hadoop data warehouse and site operations teams and received a ‘You Rock’ award for a very large data management project. He has since worked with Hadoop operations at eBay, Inc. and now at Twitter as a Hadoop Reliability Engineer. He recently gave a presentation at the 2011 Summit discussing Hadoop automation and has an extensive background building and managing Unix based production environments.
  2. Early detection and correction of cluster health issues isa vital part of daily cluster management, no matter thesize. Building and managing a healthy cluster is the bestcure to meeting service level agreements and preventing or avoidingelongated troubleshooting. A cluster is effective and efficientwhen problems are detected and eliminated early.
  3. Deploying simple tools and processes prevents minor problemsfrom becoming major headaches. This talk covers how Twitter&apos;sHadoop Reliability team developed, tested, and deployed a broadspectrum cluster health check that detects problems quickly andearly.
  4. Clusters run at full efficiency when all LIVE nodes are working at their peek. During node failures, partial or full, the cluster may behave in unexpected ways and thus causing a weak link. Finding a small problem on thousands of nodes is time consuming. What’s we deployed is an internal check that is able to affect a change in the cluster’s behavior and blacklist failing nodes thus preventing new tasks from starting in a failed condition.
  5. We start with a high-level review of the Hadoop environment at Twitter.We are a very small operational team and we need the ability to manage a large Hadoop environment from installation to production and we try to avoid losing time working on troubleshooting issues that are affecting the cluster.Our team effort is to build these missing layers in the Health and management pyramid that will provide us meaningful and simple interfaces for the Hadoop admins.
  6. Each clusters have a primary use. We run close to maximum for storage and processing on most clusters so it is important to test and evaluate all releases and production changes to prevent failures on the large clusters.These clusters are thousands of nodes and 10s of petabytesin multiple datacenters with a large number of jobs / day
  7. The Site Operations team manages our infrastructure and corrects node failures (after being withdrawn from the clusters). We ticket each failure, one per ticket, and they quickly and accurately correct the issues and return the node ready for commissioning. The support we receive is immeasurable because we would not be able to grow as quickly as we have in the last year.Some of the issues that are discovered and resolved by Site operations are discussed.
  8. Our nodes belong to roles managed by an internal Configuration Manager. All nodes must belong to a role, each node has inherited attributes, and we may affect a role-wide operation by executing commands through the manager.
  9. To ensure that our code and configurations are accurate, we have a rigorous process that includes: Peer reviews, review boards, staging, validations, canary, restaging, production. The code and configurations are checked in and distributed to the nodes via Puppet, without exception.
  10. Reliability covers many aspects of cluster management and is part of the daily maintenance, outages, preventative care, and health evaluation that every cluster, irrespective of size, requires.Our focus is the HEALTH aspect of Hadoop and to be able to manage failures without intervention. We do so with a complex Health process that has simple roots, it isolates node issues, and reported failures are rolled up into the monitoring system, which is an independent function.
  11. Hadoop is highly dependent on a healthy cluster, be it 10 or 1000 nodes. A cluster may exhibit failed behaviors from minor issues on a single node, and discovering the issue and immediately blacklisting it is important.Listed here are most of the weak links that will cause data and job issues.
  12. This section covers what we wanted to achieve to obtain full cluster health. We realized early that the health process plays an important role in node health as well is validation and ensuring that returning nodes enter the cluster fully functional. The same script is able to perform multiple tasks with no code changes.
  13. What are some of the best methods to building and deploying a check? There is a limited amount of time to run checks, seconds, and to scan for other issues, and a full log body scan was not reasonable nor may be accurate. Here are some of the aspects we sought.
  14. After the script is deployed, we needed to verify these goals. Though difficult to track, we used our work load and number of people required to manage our clusters as an primary indicator. We are pleased with the results.
  15. Each Hadoop cluster has three primary columns of health, we created two and one is provided:The health check finds issues collecting in logs and process states based on thresholds and timeOur monitoring system will notify us of issues over time using aggregation.And finally, Hadoop manages heartbeats for both datanodes and tasks, these provide critical information on the node’s status. Should the heartbeat be delayed too long, the cluster will automatically take corrective actions.The administrator takes manual actions to exclude or include nodes into the cluster, however, in some cases, nodes have to be excluded to kill an issue.
  16. To install a health check, update these properties, as described.
  17. The actual process of the health check is to return a result message to the job manager. An ‘ERROR’ indicates the node is to be taken out of circulation, but the attempts are allowed to finish. Any other terms may be used to indicate to automation that the node PASSed or that other issues exist and actions are required.Because tasks finish, instead of being terminated, the blacklist gives us time to evaluate the problem and take corrective actions such as fail-tasking. StoriesFull file systems from errant jobs filled node storage; health caused a brownout and shown in blacklistedErrant jars cause Full GCs in TT: updated health to count Full GCs over 999 records to restart TTRacks lost packets: added rack packet loss detection in same rack to blacklist the rack and wrote a crawler for inter-rackPredictive disk failures in the controller: detected and blacklistedKickstart install root on the wrong disk, detectedHigh load averages slow down jobs: blacklist immediatelyMemory shortfall: detect and blacklist nodesbinfsusedfsused $E $FAULTS_DF root $ERRS_DFsbinmkfswrongfsused $W $FAULTS_DF root $WARN_DFsbindiskwrongfsused $W $FAULTS_DW root $WARN_DW file mounts $proc/mounts $E 1 root ^\/dev\/ file fstab $etc/fstab $E 1 root ^LABEL= file loadavg $proc/loadavg $W 70.0 root [0-9.]+procdatanode $dnpid $E 1 hadoop $PROC_DNproctasktracker $ttpid $W 1 hadoop $PROC_TTprocregionserver $rspid $W 1 hadoop $PROC_RSprocmonit $log/monit.log $W 1 root $ubin/monitprocsyslogd $run/syslog-ng.pid $W 1 root syslog-ngproc scribed $run/scribe.pid $W 1 root $usbin/scribed log syslog-dev $log/syslog $E $FAULTS_DV root $ERRS_DV log syslog-hw $log/syslog $E $FAULTS_HW root $ERRS_HW log mcelog-hw $log/mcelog $E $FAULTS_MC root $ERRS_MC log ttlog $ttlog $W $FAULTS_TT hadoop $ERRS_TT log dnlog $dnlog $W $FAULTS_DN hadoop $ERRS_DN log rslog $rslog $W $FAULTS_RS hadoop $ERRS_RS log scribe $sclog $W $FAULTS_SC hadoop $ERRS_SC log shortmem $proc/meminfo $W $FAULTS_SM root $ERRS_SM log bonding $bonding $F $FAULTS_EB root $ERRS_EB toggle blacklisted $bllog $E $FAULTS_BL hadoop $ERRS_BL
  18. Detecting faults is based on real-life experiences and is usually taught by errors and failures. This section describes some of the faults we scan for and which provide the basis of a health check.We also induce faults into the health system to perform maintenance operations. It is easier to do maintain by blacklisting than to exclude.
  19. Managing a node’s expected performance is a major concern and ‘weakness’ in working with large clusters. A single node issue may cause cascading problems which extends job run times.Some of the possible issues are network losses, speed reductions, and issues caused by manual interventions as part of general maintenance. The health process needs to trap and blacklist nodes that are not meeting specifications.
  20. The Hadoop storage system may be difficult to maintain as the cluster grows. With 10s of petabytes spinning on 1000s of nodes, storage issues have caused major issues in the past.However, with improvements in disks and controllers, storage has been far less of an issue. We are now focused on performance gains and storage efficiency:Running the latest file systemsReducing inodes to recover 3% in storageImprove build time by reduced inodesImprove fsck time by reduced inodesHadoopmay only use 1-2% of inodesFSCK time dramatically improvesKickstarat improvementsOld kernels have security issuesDataode goes down, TT needs to blacklistTask tracker failures on disk full
  21. Bottom up log scans is an effective method of limiting the amount of data to process and for locating just recent issues. Some logs are large and may be stale, so keeping the information fresh and current prevents brownouts and less blacklisting.We also use ‘positive exception’ matching logic via egrep based on receiving many ‘false positives’. We choose to match the majority of the pattern directly and then with a positive column match [123], on the ‘not’ side, we negated a column match [^123]. We want to match what we were looking for, not what we weren’t looking for.
  22. We have a two layers remaining in our health strategy or pyramid.Underway is a management shell that will ease the process of managing lists, faults, and reducing recovery time.
  23. We are currently wrapping up our management CLI that assists the Hadoop Administrators to start, stop clusters, manage lists, and perform beak/fix actions, to name a few.Our goal is to reduce the time to manage and recover a cluster:Improve recovery time from a crashReduce node time to repairReduce recovery from brownoutsImprove the ability to manage nodes based on state without a SQL database
  24. The management shell is a BASH CLI that eases the administrative functions for large clusters.
  25. The last part of the pyramid is the future of integrating automation and tools where the health process provides an essential role.
  26. Hadoop clusters have increased efficiency by having fewer task failures due to node hardware faults. Long tails rarely occur due to node issues. On-call issues have also declined because we have less troubleshooting issues due to stuck jobs. We are looking forward to hearing your comments.
  27. Questions.