SlideShare una empresa de Scribd logo
1 de 14
Capacity Planning
Big Data Solution
Hello!
I am Riyaz A Shaikh
Full Stack Architect
You can find me at:
@jf @rizAShaikh
Riyaz A Shaikh
www.riyazshaikh.com
Requirement
Need to setup analytical and alerting system
on data produced by 10,000 servers.
Assuming 10 million events generated per
day by all servers. Considering 50 GB of data
per day.
Big Data Cluster
Considering Hortonworks Hadoop
distribution for cluster setup with following
systems.
 HDFS for data backup in compressed
format.
 Spark for data computation and
transformation.
 Apache Kafka as messaging service for
data completeness.
 Flume for data capture
 Elasticsearch for Analytical data storage
and search engine.
 Kibana for data visualization
Kafka cluster capacity
Assumption Size in GB Rationale
Daily average raw data ingest rate 50
Kafka retention period of 2 days 100 Raw data * retention period
Kafka replication factor of 3 300 Raw data * retention period * replication factor
Storage per Day 300 GB
Storage per Month
This is staging. Monthly calculation is not required because
data will be auto purged after retention period.
Table 1
Elasticsearch cluster capacity
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
Elasticsearch 3 shards 50
Shards are index split. No
extra space required.
Elasticsearch 3 replica 150 Raw data * replicas
Each shards will be
replicated 3 times
Storage per Day 150 GB
Storage per Month 4500 GB Per day * 30 4.5 TB per month
Table 2
HDFS to backup Elasticsearch data
Assumption Size in GB Rationale Remarks
Daily average raw data ingest rate 50
HDFS replication factor by 3 150 Raw data * replication factor
70 % Compression 45 (150 – (150*70/100)) LZO compression
Storage per Day 45 GB
Storage per Month 1350 GB 1.35 TB per month
Table 3
Typical Node structure
Table 4
Node Structure
Typical per data node storage capacity
4 TB 2 X 2 TB HDD
Temp space for processing by Spark,
Map Reduce etc. 1 TB 25% of the data node
Data node usable storage
3 TB
Raw storage - Spark
reserve
Considering storage capacity from above three tables
Table 1, Table 2 and Table 3.
Total storage required per month is
300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
“
Assuming 10% data growth per quarter. Further, considering
15% year-on-year growth in data volume.
Below given Table 5 indicated capacity required as per data
growth year-on-year
Capacity growth year-on-year
Table 5
10% Data Growth Quarterly (Data in TB)
Quarter Year 1 Year 2 Year 3 Year 4 Year 5
Q1 6.15 9.4 12.5 16.7 22.2
Q2 6.8 9.9 13.2 17.5 23.3
Q3 7.4 10.4 13.8 18.4 24.5
Q4 8.2 10.9 14.5 19.3 25.7
Yearly storage 28.5 40.6 54.0 71.9 95.7
Data nodes required =
yearly storage / Data node usable storage
10 14 18 24 32
Hardware Specs
Considering one year storage on ten data
node with one Namenode and one standby
Namenode.
Table 6 & 7 shows hardware configuration of
each machines.
Typical worker node hardware configurations
Table 6
Midline configuration (Data Node)
CPU 2 × 8 core 2.9 Ghz
Memory 64 GB DDR3-1600 ECC
Disk controller SAS 6 Gb/s
Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Typical Namenode hardware configurations
Table 7
Namenode configuration
CPU 2 × 8 core 2.9 Ghz
Memory 128 GB
Disk controller RAID 1
Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node
Network controller 2 × 1 Gb Ethernet
Notes
CPU features such as Intel’s Hyper-Threading and QPI are desirable.
Allocate memory to take advantage of triple- or quad-channel
memory configurations.
Thanks!
Any questions & feedback!
Write to me at:
@rizAShaikh
Shaikh.r.a@gmail.com

Más contenido relacionado

La actualidad más candente

Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow CacheAlluxio, Inc.
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtAung Thu Rha Hein
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataZhong Wang
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所Ryuji Tamagawa
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in DepthSyed Hadoop
 
Introduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugIntroduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugDavid Morin
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchInfluxData
 
SUNY Ulster - GIS Program Server Storage Options
SUNY Ulster - GIS Program Server Storage OptionsSUNY Ulster - GIS Program Server Storage Options
SUNY Ulster - GIS Program Server Storage OptionsMichael Dobe, Ph.D.
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 
20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所Ryuji Tamagawa
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7Ryuji Tamagawa
 

La actualidad más candente (20)

Improve Presto Architectural Decisions with Shadow Cache
 Improve Presto Architectural Decisions with Shadow Cache Improve Presto Architectural Decisions with Shadow Cache
Improve Presto Architectural Decisions with Shadow Cache
 
Coriani 2
Coriani 2Coriani 2
Coriani 2
 
MongoDB @ fliptop
MongoDB @ fliptopMongoDB @ fliptop
MongoDB @ fliptop
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Can the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaughtCan the elephants handle the no sql onslaught
Can the elephants handle the no sql onslaught
 
Big data
Big dataBig data
Big data
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
20170927 pydata tokyo データサイエンスな皆様に送る分散処理の基礎の基礎、そしてPySparkの勘所
 
PgconfSV compression
PgconfSV compressionPgconfSV compression
PgconfSV compression
 
Hadoop Architecture in Depth
Hadoop Architecture in DepthHadoop Architecture in Depth
Hadoop Architecture in Depth
 
Introduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJugIntroduction to Hadoop - FinistJug
Introduction to Hadoop - FinistJug
 
R&D for L&D
R&D for L&DR&D for L&D
R&D for L&D
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
SUNY Ulster - GIS Program Server Storage Options
SUNY Ulster - GIS Program Server Storage OptionsSUNY Ulster - GIS Program Server Storage Options
SUNY Ulster - GIS Program Server Storage Options
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 
20171012 found IT #9 PySparkの勘所
20171012 found  IT #9 PySparkの勘所20171012 found  IT #9 PySparkの勘所
20171012 found IT #9 PySparkの勘所
 
Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019Aerospike Nested CDTs - Meetup Dec 2019
Aerospike Nested CDTs - Meetup Dec 2019
 
20170210 sapporotechbar7
20170210 sapporotechbar720170210 sapporotechbar7
20170210 sapporotechbar7
 

Similar a Big data solution capacity planning

Security sizing meetup
Security sizing meetupSecurity sizing meetup
Security sizing meetupDaliya Spasova
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...Fred de Villamil
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHPAndrew Goodwin
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryKristofferson A
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing GuideJose De La Rosa
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql serverIsabelle Van Campenhoudt
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Serverserge luca
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Danielle Womboldt
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Community
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 editionBob Ward
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdfhellobank1
 

Similar a Big data solution capacity planning (20)

Dba tuning
Dba tuningDba tuning
Dba tuning
 
Hadoop Research
Hadoop Research Hadoop Research
Hadoop Research
 
Security sizing meetup
Security sizing meetupSecurity sizing meetup
Security sizing meetup
 
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
 
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New Features
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Extended memory access in PHP
Extended memory access in PHPExtended memory access in PHP
Extended memory access in PHP
 
Intro to hadoop
Intro to hadoopIntro to hadoop
Intro to hadoop
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Whitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success StoryWhitepaper: Exadata Consolidation Success Story
Whitepaper: Exadata Consolidation Success Story
 
Ceph Performance and Sizing Guide
Ceph Performance and Sizing GuideCeph Performance and Sizing Guide
Ceph Performance and Sizing Guide
 
AWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon RedshiftAWS June Webinar Series - Getting Started: Amazon Redshift
AWS June Webinar Series - Getting Started: Amazon Redshift
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Espc17 make your share point fly by tuning and optimising sql server
Espc17 make your share point  fly by tuning and optimising sql serverEspc17 make your share point  fly by tuning and optimising sql server
Espc17 make your share point fly by tuning and optimising sql server
 
Make your SharePoint fly by tuning and optimizing SQL Server
Make your SharePoint  fly by tuning and optimizing SQL ServerMake your SharePoint  fly by tuning and optimizing SQL Server
Make your SharePoint fly by tuning and optimizing SQL Server
 
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph Performance by Leveraging Intel Optane and...
 
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
Ceph Day Beijing - Optimizing Ceph performance by leveraging Intel Optane and...
 
Sql server 2016 it just runs faster sql bits 2017 edition
Sql server 2016 it just runs faster   sql bits 2017 editionSql server 2016 it just runs faster   sql bits 2017 edition
Sql server 2016 it just runs faster sql bits 2017 edition
 
3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf3.INTEL.Optane_on_ceph_v2.pdf
3.INTEL.Optane_on_ceph_v2.pdf
 
11g R2
11g R211g R2
11g R2
 

Último

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Último (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

Big data solution capacity planning

  • 2. Hello! I am Riyaz A Shaikh Full Stack Architect You can find me at: @jf @rizAShaikh Riyaz A Shaikh www.riyazshaikh.com
  • 3. Requirement Need to setup analytical and alerting system on data produced by 10,000 servers. Assuming 10 million events generated per day by all servers. Considering 50 GB of data per day.
  • 4. Big Data Cluster Considering Hortonworks Hadoop distribution for cluster setup with following systems.  HDFS for data backup in compressed format.  Spark for data computation and transformation.  Apache Kafka as messaging service for data completeness.  Flume for data capture  Elasticsearch for Analytical data storage and search engine.  Kibana for data visualization
  • 5. Kafka cluster capacity Assumption Size in GB Rationale Daily average raw data ingest rate 50 Kafka retention period of 2 days 100 Raw data * retention period Kafka replication factor of 3 300 Raw data * retention period * replication factor Storage per Day 300 GB Storage per Month This is staging. Monthly calculation is not required because data will be auto purged after retention period. Table 1
  • 6. Elasticsearch cluster capacity Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 Elasticsearch 3 shards 50 Shards are index split. No extra space required. Elasticsearch 3 replica 150 Raw data * replicas Each shards will be replicated 3 times Storage per Day 150 GB Storage per Month 4500 GB Per day * 30 4.5 TB per month Table 2
  • 7. HDFS to backup Elasticsearch data Assumption Size in GB Rationale Remarks Daily average raw data ingest rate 50 HDFS replication factor by 3 150 Raw data * replication factor 70 % Compression 45 (150 – (150*70/100)) LZO compression Storage per Day 45 GB Storage per Month 1350 GB 1.35 TB per month Table 3
  • 8. Typical Node structure Table 4 Node Structure Typical per data node storage capacity 4 TB 2 X 2 TB HDD Temp space for processing by Spark, Map Reduce etc. 1 TB 25% of the data node Data node usable storage 3 TB Raw storage - Spark reserve Considering storage capacity from above three tables Table 1, Table 2 and Table 3. Total storage required per month is 300GB+ 4500GB+1350GB = 6150 GB (approx.. 6.15 TB)
  • 9. “ Assuming 10% data growth per quarter. Further, considering 15% year-on-year growth in data volume. Below given Table 5 indicated capacity required as per data growth year-on-year
  • 10. Capacity growth year-on-year Table 5 10% Data Growth Quarterly (Data in TB) Quarter Year 1 Year 2 Year 3 Year 4 Year 5 Q1 6.15 9.4 12.5 16.7 22.2 Q2 6.8 9.9 13.2 17.5 23.3 Q3 7.4 10.4 13.8 18.4 24.5 Q4 8.2 10.9 14.5 19.3 25.7 Yearly storage 28.5 40.6 54.0 71.9 95.7 Data nodes required = yearly storage / Data node usable storage 10 14 18 24 32
  • 11. Hardware Specs Considering one year storage on ten data node with one Namenode and one standby Namenode. Table 6 & 7 shows hardware configuration of each machines.
  • 12. Typical worker node hardware configurations Table 6 Midline configuration (Data Node) CPU 2 × 8 core 2.9 Ghz Memory 64 GB DDR3-1600 ECC Disk controller SAS 6 Gb/s Disks 5 × 1 TB LFF SATA II 7200 RPM. 1 TB for OS Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 13. Typical Namenode hardware configurations Table 7 Namenode configuration CPU 2 × 8 core 2.9 Ghz Memory 128 GB Disk controller RAID 1 Disks 4 × 1 TB 1 for the OS, 2 TB for FS image and 1 for Journal node Network controller 2 × 1 Gb Ethernet Notes CPU features such as Intel’s Hyper-Threading and QPI are desirable. Allocate memory to take advantage of triple- or quad-channel memory configurations.
  • 14. Thanks! Any questions & feedback! Write to me at: @rizAShaikh Shaikh.r.a@gmail.com