SlideShare una empresa de Scribd logo
1 de 19
Hadoop @ eBay Marketplaces
Ming Ma
June 27th, 2013
Overview
• Hadoop growth @ eBay Marketplaces
• Availability study
• Opportunities ahead
Big Data @ eBay Marketplaces
120+ Million Active users
300+ Million search queries every single day
350+ Million items available
hadoop @ eBay Marketplaces 3
Data Sets
•Inventory Data
– Product Listings, Catalogue, Quantity etc.
•Transactional Data
– Buying, Returning etc.
•User Behavioral Data
– Click stream, comments, suggestions, user activities etc.
•Customer profiles
– Buyer, Seller, Partner information etc.
•Machine data
– Logs, application data etc.
hadoop @ eBay Marketplaces 4
Hadoop Evolution @ eBay Marketplaces
2007
Single digit
nodes
2010
Shared
cluster
• 100s nodes
• 1000s +
core
• PB
• CDH2
2011
• Shared
clusters
• 1000s node
• 10,000+ core
• 10s PB
• Wilma (0.20)
2012
• Shared
clusters
• 1000s node
• 10,000+ core
• 10s PB
2013
• Shared
clusters
• 4k+ node
• 40,000+ core
• 50s PB
• HDP
2009
Search
• 10s-
nodes
hadoop @ eBay Marketplaces 5
Shared vs. Dedicated Clusters
Shared clusters
– 10s of PB and 10s of thousands of slots per cluster
– Run HDP 1.2
– Used primarily for analytics of user behavior and inventory
– Mix of production and ad-hoc jobs
– Mix of MR, Hive, PIG, Cascading etc.
– Hadoop and HBase security enabled
Dedicated clusters
– Very specific use cases like Index Building
– Tight SLAs for jobs (in order of minutes)
– Immediate revenue impact
– Usually smaller than our shared clusters, but still big (100s of nodes…)
hadoop @ eBay Marketplaces 6
Job Distribution by Type
hadoop @ eBay Marketplaces 7
Use Case Examples
•Cassini, full re-write of eBay’s search engine:
– Use MR to build full and incremental near-real-time indexes
– Data for indexing is stored in HBase for efficient updates and random read
– Strong SLAs
– Run on dedicated clusters
•Related and similar Items recommendations:
– Use transactional data, click stream data, search index, etc.
– Production MR jobs on a shared cluster
•Analytics dashboard:
– Run Mobius MR jobs to join click stream data and transactional data
– Store summary data in HBase
– Web application to query HBase
hadoop @ eBay Marketplaces 8
eBay Hadoop Data Platform
hadoop @ eBay Marketplaces 9
Data Ingest
Extract
Load Validate
Transform
Clients
Java
Scala
Pig
Hive Cascading
Mobius
Hadoop Behavioral Transactional Inventory
Metadata Metastore Type System ServiceAPI
Data Access
Java POJO
Pig UDF
Hive UDF
Tools
ETL Monitor
Metadata Mgmt
Data Catalog
User Mgmt
Platform Innovation
•Many reliability improvements
•New Security features
– Multi-realm support
– Encryption
– https in hadoop 1
•Hadoop 2.0
– MR 1 and YARN binary compatibility
•Automation for operations
– Machine decommission and re-commission process
•Data and user management
– Metadata management
– User account provisioning
hadoop @ eBay Marketplaces 10
Overview
• Hadoop growth @ eBay
• Availability study
• Next steps
Case study – defective applications
•HBase: A test app created heavy write load
– Test app used all region server RPC threads
– All RPCs are blocked by region flush
– RPC requests from production HBase MR job timed out
•HDFS: An app created lots of small files inside map tasks
– NN RPC Queue length spiked
– DN heartbeat RPC can’t be processed
– HDFS replication storm
hadoop @ eBay Marketplaces 12
Case study – platform bugs
•Hadoop:
– DFSClient.LeaseChecker thread leak in job tracker -> bi-weekly JT restart
– dfs.datanode.balance.bandwidthPerSec set to 200MB -> big performance impact
•JVM:
– leap second bug -> All clusters were down the same time
– GC setting -> NN full GC happened regularly
•OS:
– “Divide by zero” in CentOS and RH 6.1 -> machine reboot
hadoop @ eBay Marketplaces 13
Case study – cluster maintenance
•Code rollout:
– NN SPOF
– RPC compatibility between old and new versions
•Hadoop configuration change:
– Likely required Hadoop JVM restart
– Rolling restart has impact on job latency
– Datanode rolling restart caused HBase region servers to exit
•Machines re-commission:
– Hadoop version drift
– OS configuration bug reappeared
hadoop @ eBay Marketplaces 14
Metrics
•Definition:
– Availability = MTBF ( mean time between failure ) / MTBF + MDT ( mean down time )
– Down time includes planned maintenance
•Measurement:
– Synthetic transaction approach
– Run regular canary work count MR job
– Canary job times out in X minutes
hadoop @ eBay Marketplaces 15
More about metrics
•Availability != MTTR ( mean time to recover )
– MTTR is more important for applications like Cassini index build
•What is considered “available”?
– Performance degradation
– % of live slave nodes
– Other entry points such as Web UI
– Core data set availability
– Multi-tenancy scenario
hadoop @ eBay Marketplaces 16
Ways to improve availability
•Automation
– Use puppet and daemontools
– Monitor system health
•Redundancy
– Namenode HA
– Hot standby region server
•Isolation
– HDFS federation
– Region server grouping
•Congestion control
– RPC congestion control, Hadoop-9640
– Apply to both HDFS and HBase
•Features to enable “no downtime maintenance”
– Dynamic configuration update
– RPC compatibility
– Better ways to do rolling restart
hadoop @ eBay Marketplaces 17
Overview
• Hadoop growth @ eBay
• Availability study
• Next steps
Opportunities ahead
•More automation
•Availability and scalability
– Hadoop 2.0
– HBase fast recovery time
•Multi-tenancy
– Run production jobs with strong SLAs in big shared clusters
– QoS in HDFS and HBase
•New scenarios
– Interactive Analysis with SQL language
– Direct Hadoop Access from dev machines
hadoop @ eBay Marketplaces 19

Más contenido relacionado

La actualidad más candente

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Simplilearn
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
Simplilearn
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
DataWorks Summit
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

La actualidad más candente (20)

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and FacebookHow Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
 
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platformApache Bigtop: a crash course in deploying a Hadoop bigdata management platform
Apache Bigtop: a crash course in deploying a Hadoop bigdata management platform
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Building robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and DebeziumBuilding robust CDC pipeline with Apache Hudi and Debezium
Building robust CDC pipeline with Apache Hudi and Debezium
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Impala presentation
Impala presentationImpala presentation
Impala presentation
 
Devops Devops Devops
Devops Devops DevopsDevops Devops Devops
Devops Devops Devops
 
Inside HDFS Append
Inside HDFS AppendInside HDFS Append
Inside HDFS Append
 
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
HBase Tutorial For Beginners | HBase Architecture | HBase Tutorial | Hadoop T...
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
Next Generation Scheduling for YARN and K8s: For Hybrid Cloud/On-prem Environ...
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 

Destacado

3.3.3.4 lab using wireshark to view network traffic
3.3.3.4 lab   using wireshark to view network traffic3.3.3.4 lab   using wireshark to view network traffic
3.3.3.4 lab using wireshark to view network traffic
Aransues
 

Destacado (12)

Process of Inventory management & control
Process of Inventory management & controlProcess of Inventory management & control
Process of Inventory management & control
 
Inventory control & management
Inventory control & managementInventory control & management
Inventory control & management
 
Apache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use CasesApache HBase - Introduction & Use Cases
Apache HBase - Introduction & Use Cases
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Comparision
ComparisionComparision
Comparision
 
Lapsed policy
Lapsed policyLapsed policy
Lapsed policy
 
Pets Health Insurance
Pets Health InsurancePets Health Insurance
Pets Health Insurance
 
What You Should Know About Buying A Lake House
What You Should Know About Buying A Lake HouseWhat You Should Know About Buying A Lake House
What You Should Know About Buying A Lake House
 
3.3.3.4 lab using wireshark to view network traffic
3.3.3.4 lab   using wireshark to view network traffic3.3.3.4 lab   using wireshark to view network traffic
3.3.3.4 lab using wireshark to view network traffic
 
Food & Beverage Liability Insurance
Food & Beverage Liability InsuranceFood & Beverage Liability Insurance
Food & Beverage Liability Insurance
 
Coinsurance & Builder's Risk Insurance
Coinsurance & Builder's Risk InsuranceCoinsurance & Builder's Risk Insurance
Coinsurance & Builder's Risk Insurance
 
Teradata Demand Chain Management (DCM): Version 4
Teradata Demand Chain Management (DCM): Version 4Teradata Demand Chain Management (DCM): Version 4
Teradata Demand Chain Management (DCM): Version 4
 

Similar a Hadoop and HBase @eBay

A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
DataWorks Summit
 
Membase Meetup - Silicon Valley
Membase Meetup - Silicon ValleyMembase Meetup - Silicon Valley
Membase Meetup - Silicon Valley
Membase
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
DataWorks Summit
 

Similar a Hadoop and HBase @eBay (20)

Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Hadoop Master Class : A concise overview
Hadoop Master Class : A concise overviewHadoop Master Class : A concise overview
Hadoop Master Class : A concise overview
 
Big data.ppt
Big data.pptBig data.ppt
Big data.ppt
 
Lecture1
Lecture1Lecture1
Lecture1
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Lecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in detailsLecture1 BIG DATA and Types of data in details
Lecture1 BIG DATA and Types of data in details
 
Big data
Big dataBig data
Big data
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Membase Meetup - Silicon Valley
Membase Meetup - Silicon ValleyMembase Meetup - Silicon Valley
Membase Meetup - Silicon Valley
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
Lessons Learned from Migration of a Large-analytics Platform from MPP Databas...
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

Hadoop and HBase @eBay

  • 1. Hadoop @ eBay Marketplaces Ming Ma June 27th, 2013
  • 2. Overview • Hadoop growth @ eBay Marketplaces • Availability study • Opportunities ahead
  • 3. Big Data @ eBay Marketplaces 120+ Million Active users 300+ Million search queries every single day 350+ Million items available hadoop @ eBay Marketplaces 3
  • 4. Data Sets •Inventory Data – Product Listings, Catalogue, Quantity etc. •Transactional Data – Buying, Returning etc. •User Behavioral Data – Click stream, comments, suggestions, user activities etc. •Customer profiles – Buyer, Seller, Partner information etc. •Machine data – Logs, application data etc. hadoop @ eBay Marketplaces 4
  • 5. Hadoop Evolution @ eBay Marketplaces 2007 Single digit nodes 2010 Shared cluster • 100s nodes • 1000s + core • PB • CDH2 2011 • Shared clusters • 1000s node • 10,000+ core • 10s PB • Wilma (0.20) 2012 • Shared clusters • 1000s node • 10,000+ core • 10s PB 2013 • Shared clusters • 4k+ node • 40,000+ core • 50s PB • HDP 2009 Search • 10s- nodes hadoop @ eBay Marketplaces 5
  • 6. Shared vs. Dedicated Clusters Shared clusters – 10s of PB and 10s of thousands of slots per cluster – Run HDP 1.2 – Used primarily for analytics of user behavior and inventory – Mix of production and ad-hoc jobs – Mix of MR, Hive, PIG, Cascading etc. – Hadoop and HBase security enabled Dedicated clusters – Very specific use cases like Index Building – Tight SLAs for jobs (in order of minutes) – Immediate revenue impact – Usually smaller than our shared clusters, but still big (100s of nodes…) hadoop @ eBay Marketplaces 6
  • 7. Job Distribution by Type hadoop @ eBay Marketplaces 7
  • 8. Use Case Examples •Cassini, full re-write of eBay’s search engine: – Use MR to build full and incremental near-real-time indexes – Data for indexing is stored in HBase for efficient updates and random read – Strong SLAs – Run on dedicated clusters •Related and similar Items recommendations: – Use transactional data, click stream data, search index, etc. – Production MR jobs on a shared cluster •Analytics dashboard: – Run Mobius MR jobs to join click stream data and transactional data – Store summary data in HBase – Web application to query HBase hadoop @ eBay Marketplaces 8
  • 9. eBay Hadoop Data Platform hadoop @ eBay Marketplaces 9 Data Ingest Extract Load Validate Transform Clients Java Scala Pig Hive Cascading Mobius Hadoop Behavioral Transactional Inventory Metadata Metastore Type System ServiceAPI Data Access Java POJO Pig UDF Hive UDF Tools ETL Monitor Metadata Mgmt Data Catalog User Mgmt
  • 10. Platform Innovation •Many reliability improvements •New Security features – Multi-realm support – Encryption – https in hadoop 1 •Hadoop 2.0 – MR 1 and YARN binary compatibility •Automation for operations – Machine decommission and re-commission process •Data and user management – Metadata management – User account provisioning hadoop @ eBay Marketplaces 10
  • 11. Overview • Hadoop growth @ eBay • Availability study • Next steps
  • 12. Case study – defective applications •HBase: A test app created heavy write load – Test app used all region server RPC threads – All RPCs are blocked by region flush – RPC requests from production HBase MR job timed out •HDFS: An app created lots of small files inside map tasks – NN RPC Queue length spiked – DN heartbeat RPC can’t be processed – HDFS replication storm hadoop @ eBay Marketplaces 12
  • 13. Case study – platform bugs •Hadoop: – DFSClient.LeaseChecker thread leak in job tracker -> bi-weekly JT restart – dfs.datanode.balance.bandwidthPerSec set to 200MB -> big performance impact •JVM: – leap second bug -> All clusters were down the same time – GC setting -> NN full GC happened regularly •OS: – “Divide by zero” in CentOS and RH 6.1 -> machine reboot hadoop @ eBay Marketplaces 13
  • 14. Case study – cluster maintenance •Code rollout: – NN SPOF – RPC compatibility between old and new versions •Hadoop configuration change: – Likely required Hadoop JVM restart – Rolling restart has impact on job latency – Datanode rolling restart caused HBase region servers to exit •Machines re-commission: – Hadoop version drift – OS configuration bug reappeared hadoop @ eBay Marketplaces 14
  • 15. Metrics •Definition: – Availability = MTBF ( mean time between failure ) / MTBF + MDT ( mean down time ) – Down time includes planned maintenance •Measurement: – Synthetic transaction approach – Run regular canary work count MR job – Canary job times out in X minutes hadoop @ eBay Marketplaces 15
  • 16. More about metrics •Availability != MTTR ( mean time to recover ) – MTTR is more important for applications like Cassini index build •What is considered “available”? – Performance degradation – % of live slave nodes – Other entry points such as Web UI – Core data set availability – Multi-tenancy scenario hadoop @ eBay Marketplaces 16
  • 17. Ways to improve availability •Automation – Use puppet and daemontools – Monitor system health •Redundancy – Namenode HA – Hot standby region server •Isolation – HDFS federation – Region server grouping •Congestion control – RPC congestion control, Hadoop-9640 – Apply to both HDFS and HBase •Features to enable “no downtime maintenance” – Dynamic configuration update – RPC compatibility – Better ways to do rolling restart hadoop @ eBay Marketplaces 17
  • 18. Overview • Hadoop growth @ eBay • Availability study • Next steps
  • 19. Opportunities ahead •More automation •Availability and scalability – Hadoop 2.0 – HBase fast recovery time •Multi-tenancy – Run production jobs with strong SLAs in big shared clusters – QoS in HDFS and HBase •New scenarios – Interactive Analysis with SQL language – Direct Hadoop Access from dev machines hadoop @ eBay Marketplaces 19

Notas del editor

  1. Need to identify User or Usage MetricsClick ratesVolume of data in the hub Cluster sizeSize of data in the cluster----- Meeting Notes (5/15/13 16:22) -----numbers needs to be adjusted - Charles Cox/Bass Chong
  2. This list needs updated – Stephen lee – Data domains