SlideShare una empresa de Scribd logo
1 de 18
The curious case of
Data Lake Redemption
Shivinder Singh
Distinguished Member Technical Staff
© 2017 Verizon. This document is the property of Verizon and may not be used, modified or further distributed without Verizon’s written permission.
2
About Verizon
The best, most reliable networks in the industry
The largest U.S. wireless company with the largest
4G LTE network
The largest and fastest all-fiber network in the U.S.
One of the largest, most reliable and secure global
networks
Using technology to address big challenges
Verizon Innovation Center in San Francisco, CA
3
Dedicated Corporate Citizen
Creating a platform for long-term growth for our
customers, shareowners and society
Using our talent and technology to address
society’s biggest challenges
Focusing on finding new ways our technology can
improve healthcare, education and energy
management
Focusing our philanthropic resources on becoming
a channel for innovation and social change
Applying innovative technology to social issues
4
Big Data in the Enterprise
As the enterprise masters Big Data, it will become part of the enterprise solution framework
5
Shrinking the Interval
Analyzing
Reporting
Predicting
Operationalizing
Activating
WHAT happened?
WHY did it happen?
WHAT is happening?
What WILL happen?
MAKING it happen!
Batch
Ad Hoc Analysis
Analytics
Continuous Updates / Short Queries
Event-Based
Triggering
Understand Change Grow Compete Lead
6
Effective strategies answer three key questions:
How will we
Deliver value?
How will we
Create value?
How will we
Capture value?
7
Unix Inode Management
mode
owners (2)
timestamps (3)
size block
count
direct blocks
single indirect
double indirect
triple indirect
data
data
data
data
data
data
data
data
data
data
8
Block Size comparison Data lake vs Single Client
DATA LAKE TOP 20
DB Size
(GB)
DB Name Total Files Total Blocks Average Block
Size (bytes)
328,807 /apps/hive/warehouse/prd1.db 32,461,500 30,283,722 11,678,898
180,361 /apps/hive/warehouse/prd2.db 7,030,688 6,568,455 29,498,992
114,237 /apps/hive/warehouse/prd3db 7,218,443 7,663,817 16,004,037
113,144 /apps/hive/warehouse/prd4.db 2,041,641 2,830,226 42,925,340
42,535 /apps/hive/warehouse/prd5.db 169,111 504,297 90,567,016
30,615 /apps/hive/warehouse/prd6.db 86,923 297,950 110,331,894
21,433 /apps/hive/warehouse/prd7.db 637,283 730,173 31,520,262
21,401 /apps/hive/warehouse/prd8.db 29,971 188,875 121,668,441
11,564 /apps/hive/warehouse/prd9.db 30,873 110,838 119,432,578
11,184 /apps/hive/warehouse/prd10.db 157,975 196,467 61,127,078
10,301 /apps/hive/warehouse/prd11.db 9,713,823 8,953,109 1,236,123
8,972 /apps/hive/warehouse/prd12.db 20,236 80,666 119,426,068
8,711 /apps/hive/warehouse/prd13.db 352,294 390,780 23,994,662
8,359 /apps/hive/warehouse/prd14.db 21,175 70,756 126,829,445
7,920 /apps/hive/warehouse/prd15.db 1,316,631 1,215,234 7,017,294
5,843 /apps/hive/warehouse/prd16.db 1,055,270 468,010 13,406,724
5,829 /apps/hive/warehouse/prd17.db 552,918 486,693 12,881,117
5,669 /apps/hive/warehouse/prd18.db 1,605 46,147 131,925,260
5,652 /apps/hive/warehouse/prd19.db 5,362,238 5,360,747 1,135,249
987 /apps/hive/warehouse/prd20.db 565,537 571,859 1,854,672
Single Client
DB Size
(GB)
DB Name Total Files Total Blocks Average Block
Size (bytes)
315,866 /apps/hive/warehouse/prd.db 2,245,257 2,574,897 131,717,734
9
Small File Namenode Impact
High GC pauses
High RPC running into minutes
Cluster Unresponsive
Jobs stalled
Full downtime
10
The S-curve Maps Major Transitions
Performance
Time
Ferment
Takeoff
Maturity Reverse Aging
11
Analysis
Support Engagement
Increase NN heap
Bounce the NN/cluster
5 bug fix patches
Root Cause still not found
12
Root Cause and fix
Deep dive for 40 data lakes clients
Review of 456 Databases
Review of 373,083 tables
Review of 5K jobs
Fix
Reduce job frequency
Block size parameters for hive and yarn
Zookeeper tuning
13
Run Times
0
50
100
150
200
250
300
350
400
Run Times
Average_2017 Average_2018
14
Job Counts
0
500
1000
1500
2000
2500
3000
3500
Job Count
2017_Job_count 2018_Job_count
15
Other considerations
ZK is most critical components
Numerous third party components
Znodes being written outside of HDP components
ZK image size 10 gb
5 M znodes
Fix
Targeted purge of znodes to 100 K
Znode image size down to 100 Mb
Ongoing ZK tuning
16
Stack Selection
Physical limit?
Performance is ultimately constrained
by physical limits
E.g.:
Sailing ships & the power of the wind
Copper wire & transmission capability
Semiconductors & the speed of the electron
Performance
Time
17
Once Upon a Time There Was a Inode…
• Redemption…
Andy Dufresne: ”He's a phantom, an apartition, second cousin
to Harvey the Rabbit.”
Unix Kernel is a basic !
Packaging changes, basic remains the same
Small files a technology limitation
Data Democracy can be boon or a bane
Issues are platform agnostic
18
Q & A
You can reach us at
shivinder.singh@vzw.com
Go to www.verizon.com/about/ for more information and news
about our company, social responsibility, investor relations and
careers.

Más contenido relacionado

La actualidad más candente

Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0
DataWorks Summit
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
Dr. Wilfred Lin (Ph.D.)
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
DataWorks Summit
 
Maven and google pharma r&d (1)
Maven and google pharma r&d  (1)Maven and google pharma r&d  (1)
Maven and google pharma r&d (1)
Matt Barnes
 
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Hortonworks
 

La actualidad más candente (20)

Data science workshop
Data science workshopData science workshop
Data science workshop
 
Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0Open Source Data Management for Industry 4.0
Open Source Data Management for Industry 4.0
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
ING's Customer-Centric Data Journey from Community Idea to Private Cloud Depl...
 
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...Open Source in the Energy Industry - Creating a New Operational Model for Dat...
Open Source in the Energy Industry - Creating a New Operational Model for Dat...
 
Hortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your dataHortonworks Hybrid Cloud - Putting you back in control of your data
Hortonworks Hybrid Cloud - Putting you back in control of your data
 
6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop6 enriching your data warehouse with big data and hadoop
6 enriching your data warehouse with big data and hadoop
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
 
Maven and google pharma r&d (1)
Maven and google pharma r&d  (1)Maven and google pharma r&d  (1)
Maven and google pharma r&d (1)
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
Capgemini Insights and Data
Capgemini Insights and Data Capgemini Insights and Data
Capgemini Insights and Data
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
Accelerating the Value of Big Data Analytics for P&C Insurers with Hortonwork...
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Harnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case StudyHarnessing Hadoop Distuption: A Telco Case Study
Harnessing Hadoop Distuption: A Telco Case Study
 
Complex Analytics using Open Source Technologies
Complex Analytics using Open Source TechnologiesComplex Analytics using Open Source Technologies
Complex Analytics using Open Source Technologies
 
Big Data Maturity Scorecard
Big Data Maturity ScorecardBig Data Maturity Scorecard
Big Data Maturity Scorecard
 
A Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision MedicineA Modern Data Strategy for Precision Medicine
A Modern Data Strategy for Precision Medicine
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 

Similar a The curious case of data lake redemption

5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov20155 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
Nick Bogden
 

Similar a The curious case of data lake redemption (20)

MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1MicroStrategy 9 vs SAP BusinessObjects 4.1
MicroStrategy 9 vs SAP BusinessObjects 4.1
 
MicroStrategy 9 vs Oracle 11G BI capabilities
MicroStrategy 9 vs Oracle 11G BI capabilitiesMicroStrategy 9 vs Oracle 11G BI capabilities
MicroStrategy 9 vs Oracle 11G BI capabilities
 
MicroStrategy 9 vs IBM Cognos 10
MicroStrategy 9 vs IBM Cognos 10MicroStrategy 9 vs IBM Cognos 10
MicroStrategy 9 vs IBM Cognos 10
 
Creating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick ImplementationCreating a Big data Strategy with Tactics for Quick Implementation
Creating a Big data Strategy with Tactics for Quick Implementation
 
Enterprise asset management industry whitepaper extract | "Asset intelligence...
Enterprise asset management industry whitepaper extract | "Asset intelligence...Enterprise asset management industry whitepaper extract | "Asset intelligence...
Enterprise asset management industry whitepaper extract | "Asset intelligence...
 
Government and Education Webinar: How the New Normal Could Improve your IT Op...
Government and Education Webinar: How the New Normal Could Improve your IT Op...Government and Education Webinar: How the New Normal Could Improve your IT Op...
Government and Education Webinar: How the New Normal Could Improve your IT Op...
 
9 Steps to Successful Information Lifecycle Management
9 Steps to Successful Information Lifecycle Management9 Steps to Successful Information Lifecycle Management
9 Steps to Successful Information Lifecycle Management
 
Wearable Technology Orientation using Big Data Analytics for Improving Qualit...
Wearable Technology Orientation using Big Data Analytics for Improving Qualit...Wearable Technology Orientation using Big Data Analytics for Improving Qualit...
Wearable Technology Orientation using Big Data Analytics for Improving Qualit...
 
Ubiwhere Research and Innovation Profile
Ubiwhere Research and Innovation ProfileUbiwhere Research and Innovation Profile
Ubiwhere Research and Innovation Profile
 
Big Data Industry Insights 2015
Big Data Industry Insights 2015 Big Data Industry Insights 2015
Big Data Industry Insights 2015
 
Predictive Maintenance Solution for Industries - Cyient
Predictive Maintenance Solution for Industries - CyientPredictive Maintenance Solution for Industries - Cyient
Predictive Maintenance Solution for Industries - Cyient
 
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
Barcelona Digital Festival 28th Nov 2019 - Data Analytics in eSports. UbeatCa...
 
MicroStrategy 9 vs Qlikview 11
MicroStrategy 9 vs Qlikview 11MicroStrategy 9 vs Qlikview 11
MicroStrategy 9 vs Qlikview 11
 
5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov20155 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
5 2 Mobile Veteran Facing Applications Design Pattern Signed 4Nov2015
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
VIRTUAL CLINIC: A CDSS ASSISTEDTELEMEDICINE FRAMEWORK
VIRTUAL CLINIC: A CDSS ASSISTEDTELEMEDICINE FRAMEWORKVIRTUAL CLINIC: A CDSS ASSISTEDTELEMEDICINE FRAMEWORK
VIRTUAL CLINIC: A CDSS ASSISTEDTELEMEDICINE FRAMEWORK
 
Cisco best practices connecting manufaturing
Cisco best practices connecting manufaturingCisco best practices connecting manufaturing
Cisco best practices connecting manufaturing
 
Making a Better World with Technology Innovations
Making a Better World with Technology InnovationsMaking a Better World with Technology Innovations
Making a Better World with Technology Innovations
 
Cisco connect winnipeg 2018 introducing the network intuitive
Cisco connect winnipeg 2018   introducing the network intuitiveCisco connect winnipeg 2018   introducing the network intuitive
Cisco connect winnipeg 2018 introducing the network intuitive
 
Acetech company profile
Acetech company profileAcetech company profile
Acetech company profile
 

Más de DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Último (20)

CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 

The curious case of data lake redemption

  • 1. The curious case of Data Lake Redemption Shivinder Singh Distinguished Member Technical Staff © 2017 Verizon. This document is the property of Verizon and may not be used, modified or further distributed without Verizon’s written permission.
  • 2. 2 About Verizon The best, most reliable networks in the industry The largest U.S. wireless company with the largest 4G LTE network The largest and fastest all-fiber network in the U.S. One of the largest, most reliable and secure global networks Using technology to address big challenges Verizon Innovation Center in San Francisco, CA
  • 3. 3 Dedicated Corporate Citizen Creating a platform for long-term growth for our customers, shareowners and society Using our talent and technology to address society’s biggest challenges Focusing on finding new ways our technology can improve healthcare, education and energy management Focusing our philanthropic resources on becoming a channel for innovation and social change Applying innovative technology to social issues
  • 4. 4 Big Data in the Enterprise As the enterprise masters Big Data, it will become part of the enterprise solution framework
  • 5. 5 Shrinking the Interval Analyzing Reporting Predicting Operationalizing Activating WHAT happened? WHY did it happen? WHAT is happening? What WILL happen? MAKING it happen! Batch Ad Hoc Analysis Analytics Continuous Updates / Short Queries Event-Based Triggering Understand Change Grow Compete Lead
  • 6. 6 Effective strategies answer three key questions: How will we Deliver value? How will we Create value? How will we Capture value?
  • 7. 7 Unix Inode Management mode owners (2) timestamps (3) size block count direct blocks single indirect double indirect triple indirect data data data data data data data data data data
  • 8. 8 Block Size comparison Data lake vs Single Client DATA LAKE TOP 20 DB Size (GB) DB Name Total Files Total Blocks Average Block Size (bytes) 328,807 /apps/hive/warehouse/prd1.db 32,461,500 30,283,722 11,678,898 180,361 /apps/hive/warehouse/prd2.db 7,030,688 6,568,455 29,498,992 114,237 /apps/hive/warehouse/prd3db 7,218,443 7,663,817 16,004,037 113,144 /apps/hive/warehouse/prd4.db 2,041,641 2,830,226 42,925,340 42,535 /apps/hive/warehouse/prd5.db 169,111 504,297 90,567,016 30,615 /apps/hive/warehouse/prd6.db 86,923 297,950 110,331,894 21,433 /apps/hive/warehouse/prd7.db 637,283 730,173 31,520,262 21,401 /apps/hive/warehouse/prd8.db 29,971 188,875 121,668,441 11,564 /apps/hive/warehouse/prd9.db 30,873 110,838 119,432,578 11,184 /apps/hive/warehouse/prd10.db 157,975 196,467 61,127,078 10,301 /apps/hive/warehouse/prd11.db 9,713,823 8,953,109 1,236,123 8,972 /apps/hive/warehouse/prd12.db 20,236 80,666 119,426,068 8,711 /apps/hive/warehouse/prd13.db 352,294 390,780 23,994,662 8,359 /apps/hive/warehouse/prd14.db 21,175 70,756 126,829,445 7,920 /apps/hive/warehouse/prd15.db 1,316,631 1,215,234 7,017,294 5,843 /apps/hive/warehouse/prd16.db 1,055,270 468,010 13,406,724 5,829 /apps/hive/warehouse/prd17.db 552,918 486,693 12,881,117 5,669 /apps/hive/warehouse/prd18.db 1,605 46,147 131,925,260 5,652 /apps/hive/warehouse/prd19.db 5,362,238 5,360,747 1,135,249 987 /apps/hive/warehouse/prd20.db 565,537 571,859 1,854,672 Single Client DB Size (GB) DB Name Total Files Total Blocks Average Block Size (bytes) 315,866 /apps/hive/warehouse/prd.db 2,245,257 2,574,897 131,717,734
  • 9. 9 Small File Namenode Impact High GC pauses High RPC running into minutes Cluster Unresponsive Jobs stalled Full downtime
  • 10. 10 The S-curve Maps Major Transitions Performance Time Ferment Takeoff Maturity Reverse Aging
  • 11. 11 Analysis Support Engagement Increase NN heap Bounce the NN/cluster 5 bug fix patches Root Cause still not found
  • 12. 12 Root Cause and fix Deep dive for 40 data lakes clients Review of 456 Databases Review of 373,083 tables Review of 5K jobs Fix Reduce job frequency Block size parameters for hive and yarn Zookeeper tuning
  • 15. 15 Other considerations ZK is most critical components Numerous third party components Znodes being written outside of HDP components ZK image size 10 gb 5 M znodes Fix Targeted purge of znodes to 100 K Znode image size down to 100 Mb Ongoing ZK tuning
  • 16. 16 Stack Selection Physical limit? Performance is ultimately constrained by physical limits E.g.: Sailing ships & the power of the wind Copper wire & transmission capability Semiconductors & the speed of the electron Performance Time
  • 17. 17 Once Upon a Time There Was a Inode… • Redemption… Andy Dufresne: ”He's a phantom, an apartition, second cousin to Harvey the Rabbit.” Unix Kernel is a basic ! Packaging changes, basic remains the same Small files a technology limitation Data Democracy can be boon or a bane Issues are platform agnostic
  • 18. 18 Q & A You can reach us at shivinder.singh@vzw.com Go to www.verizon.com/about/ for more information and news about our company, social responsibility, investor relations and careers.