SlideShare una empresa de Scribd logo
1 de 73
Intelligent People. Uncommon Ideas. Handling Data in Mega Scale Web Apps(lessons learnt @ Directi) Vineet Gupta | GM – Software Engineering | Directi http://vineetgupta.spaces.live.com Licensed under Creative Commons Attribution Sharealike Noncommercial
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Not Covering Offline Processing (Batching / Queuing) Distributed Processing – Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
How Big Does it Get 22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source:http://highscalability.com/digg-architecture
How Big Does it Get 1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
How Big Does it Get 2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
How Big Does it Get Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source:http://highscalability.com/ebay-architecture/
How Big Does it Get 450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 – 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source:http://highscalability.com/google-architecture/
Key Trends Data Size ~ PB Data Growth ~ TB / day No of servers – 10s to 10,000 No of datacenters – 1 to 10 Queries – B+ / day Specialized needs – more / other than RDBMS
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Host RAM CPU CPU RAM CPU RAM App Server DB Server Vertical Scaling (Scaling Up)
Big Irons Sunfire E20k PowerEdge SC1435 36x 1.8GHz processors Dualcore 1.8 GHz processor $450,000 - $2,500,000 Around $1,500
Vertical Scaling (Scaling Up) Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
Host Host App Server DB Server Vertical Partitioning of Services
Vertical Partitioning of Services Split services on separate nodes Each node performs different tasks Pros Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases Cons Sub-optimal resource utilization May not increase overall availability Finite Scalability
Horizontal Scaling of App Server Web Server Load Balancer Web Server DB Server Web Server
Horizontal Scaling of App Server Add more nodes for the same service Identical, doing the same task Load Balancing Hardware balancers are faster Software balancers are more customizable
The problem - State Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
Sticky Sessions Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server Asymmetrical load distribution Downtime
Central Session Store Web Server User 1 Load Balancer Web Server Session Store User 2 Web Server SPOF Reads and Writes generate network + disk IO
Clustered Sessions Web Server User 1 Load Balancer Web Server User 2 Web Server
Clustered Sessions Pros No SPOF Easier to setup Fast Reads Cons n x Writes Increase in network IO with increase in nodes Stale data (rare)
Sticky Sessions with Central Store Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
More Session Management No Sessions Stuff state in a cookie and sign it! Cookie is sent with every request / response Super Slim Sessions Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)
Sessions - Recommendation Bad Sticky sessions Good Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume Great No Sessions!
App Tier Scaling - More HTTP Accelerators / Reverse Proxy Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool CDN Get closer to your user Akamai, Limelight IP Anycasting Async NIO
Scaling a Web App App-Layer Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!! Data Store Tricky! Very Tricky!!!
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Replication = Scaling by Duplication App Layer T1, T2, T3, T4
Replication = Scaling by Duplication App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster
Replication Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data – each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
Master-Slave App Layer Master Slave Slave Slave Slave n x Writes – Async vs. Sync SPOF Async -  Critical Reads from Master!
Multi-Master App Layer Master Master Slave Slave Slave n x Writes – Async vs. Sync No SPOF Conflicts!
Replication Considerations Asynchronous Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client  Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads Synchronous Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client  Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads
Replication Considerations Replication at RDBMS level Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master Replication at Driver / DAO level Driver / DAO layer ensures  writes are performed on all connected DBs Reads are load balanced Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable
Diminishing Returns Per Server: 4R, 1W 2R, 1W 1R, 1W Read Read Read Write Write Write Read Read Read Read Write Write Write Write
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Partitioning = Scaling by Division Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
Vertical Partitioning App Layer T1, T2, T3, T4, T5 Note: A node here typically represents a shared nothing cluster
Vertical Partitioning App Layer T3 T4 T5 T2 T1 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
Horizontal Partitioning App Layer T3 T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
Horizontal Partitioning Schemes Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
CAP Theorem Source:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
Transactions Transactions make you feel alone No one else manipulates the data when you are Transactional serializability The behavior is as if a serial order exists Source:http://blogs.msdn.com/pathelland/ Slide 46
Life in the “Now” Transactions live in the “now” inside services Time marches forward Transactions commit  Advancing time Transactions see the committed transactions A service’s biz-logic lives in the “now” Source:http://blogs.msdn.com/pathelland/ Slide 47
Sending Unlocked Data Isn’t “Now” Messages contain unlocked data Assume no shared transactions Unlocked data may change Unlocking it allows change Messages are not from the “now” They are from the past There is no simultaneity at a distance! ,[object Object]
 Knowledge travels at speed of light
 By the time you see a distant object it may have changed!
 By the time you see a message, the data may have changed!Services, transactions, and locks bound simultaneity! ,[object Object]
 Simultaneity only inside a transaction!
 Simultaneity only inside a service!Source:http://blogs.msdn.com/pathelland/ Slide 48
Outside Data: a Blast from the Past All data from distant stars is from the past ,[object Object]
 The sun may have blown up 5 minutes ago
 We won’t know for 3 minutes more…All data seen from a distant service is from the “past” By the time you see it, it has been unlocked and may change Each service has its own perspective Inside data is “now”; outside data is “past” My inside is not your inside; my outside is not your outside This is like going from Newtonian to Einstonian physics ,[object Object]
 Instant knowledge
 Classic distributed computing: many systems look like one
 RPC, 2-phase commit, remote method calls…
 In Einstein’s world, everything is “relative” to one’s perspective
 Today: No attempt to blur the boundarySource:http://blogs.msdn.com/pathelland/ Slide 49
Versions and Distributed Systems Can’t have “the same” dataat many locations Unless it isa snapshot Changing distributed dataneeds versions Creates asnapshot… Source:http://blogs.msdn.com/pathelland/
Subjective Consistency Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help Subjective Consistency  Given the information I have at hand, make a decision and act on it !  Remember the information at hand ! Ambassadors Had Authority Back before radio, it could be months between communication with the king.  Ambassadors would make treaties and much more... They had binding authority.  The mess was sorted out later! Source:http://blogs.msdn.com/pathelland/
Eventual Consistency Eventually, all the copies of the object share their changes “I’ll show you mine if you show me yours!” Now, apply subjective consistency: “Given the information I have at hand, make a decision and act on it!” Everyone has the same information, everyone comes to the same conclusion about the decisions to take… Eventual Consistency ,[object Object]
 Everyone sharing their knowledge leads to the same result...This is NOT magic; it is a design requirement ! Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement Source:http://blogs.msdn.com/pathelland/
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Why Normalize? Classic problemwith de-normalization Can’t updateSam’s phone #since there aremany copies Emp # Emp Name Mgr # Mgr Name Emp Phone 47 Joe 13 Sam 5-1234 18 Sally 38 Harry 3-3123 91 Pete 13 Sam 2-1112 66 Mary 02 Betty 5-7349 Mgr Phone 6-9876 5-6782 6-9876 4-0101 Normalization’s Goal Is Eliminating Update Anomalies Can Be Changed Without “Funny Behavior” Each Data Item Lives in One Place De-normalization is OK if you aren’t going to update! Source:http://blogs.msdn.com/pathelland/
Eliminate Joins
Eliminate Joins 6 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume But disk is cheap and getting cheaper And can lead to inconsistent data If you are lazy However this is not really an issue
“Append-Only” Data Many Kinds of Computing are “Append-Only” Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can’t change the history but you can add new observations Derived Results May Be Calculated Estimate of the “current” inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
Databases and Transaction Logs Transaction Logs Are the Truth High-performance & write-only Describe ALL the changes to the data Data-Base  the Current Opinion Describes the latest value of the data as perceived by the application Log DB The Database Is a Caching of the Transaction Log ! It is the subset of the latest committed values represented in  the transaction log… Source:http://blogs.msdn.com/pathelland/
We Are Swimming in a Sea of Immutable Data  Source:http://blogs.msdn.com/pathelland/
Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types

Más contenido relacionado

La actualidad más candente

MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)Aurimas Mikalauskas
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL DatabasesDerek Stainer
 
Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Mydbops
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Mydbops
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeperSaurav Haloi
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Kristofferson A
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...HostedbyConfluent
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQLRTigger
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013Jun Rao
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabaseTung Nguyen Thanh
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineSunil Nagaraj
 

La actualidad más candente (20)

MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
MySQL Performance Tuning. Part 1: MySQL Configuration (includes MySQL 5.7)
 
Introduction to NoSQL Databases
Introduction to NoSQL DatabasesIntroduction to NoSQL Databases
Introduction to NoSQL Databases
 
Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1Galera Cluster Best Practices for DBA's and DevOps Part 1
Galera Cluster Best Practices for DBA's and DevOps Part 1
 
Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security Achieving compliance With MongoDB Security
Achieving compliance With MongoDB Security
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?Whitepaper: Where did my CPU go?
Whitepaper: Where did my CPU go?
 
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
How Zillow Unlocked Kafka to 50 Teams in 8 months | Shahar Cizer Kobrinsky, Z...
 
Sql vs NoSQL
Sql vs NoSQLSql vs NoSQL
Sql vs NoSQL
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Mongo db intro.pptx
Mongo db intro.pptxMongo db intro.pptx
Mongo db intro.pptx
 
Performance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL DatabasePerformance Tuning And Optimization Microsoft SQL Database
Performance Tuning And Optimization Microsoft SQL Database
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
Allyourbase
AllyourbaseAllyourbase
Allyourbase
 
Databus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture PipelineDatabus - LinkedIn's Change Data Capture Pipeline
Databus - LinkedIn's Change Data Capture Pipeline
 

Similar a Handling Data in Mega Scale Systems

Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseTim Vaillancourt
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...PROIDEA
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformMarcelo Paiva
 
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"Fwdays
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesJon Meredith
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web appsDirecti Group
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101MongoDB
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Nati Shalom
 
(DAT312) Using Amazon Aurora for Enterprise Workloads
(DAT312) Using Amazon Aurora for Enterprise Workloads(DAT312) Using Amazon Aurora for Enterprise Workloads
(DAT312) Using Amazon Aurora for Enterprise WorkloadsAmazon Web Services
 
MinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraMinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraJeff Smoley
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailabilitywebuploader
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Alluxio, Inc.
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyershuguk
 

Similar a Handling Data in Mega Scale Systems (20)

Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce DatabaseBlack Friday and Cyber Monday- Best Practices for Your E-Commerce Database
Black Friday and Cyber Monday- Best Practices for Your E-Commerce Database
 
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
 
GWAB 2015 - Data Plaraform
GWAB 2015 - Data PlaraformGWAB 2015 - Data Plaraform
GWAB 2015 - Data Plaraform
 
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
Serhiy Kalinets "Embracing architectural challenges in the modern .NET world"
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Amazon Kinesis
Amazon KinesisAmazon Kinesis
Amazon Kinesis
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
Designing a Scalable Twitter - Patterns for Designing Scalable Real-Time Web ...
 
Azure and cloud design patterns
Azure and cloud design patternsAzure and cloud design patterns
Azure and cloud design patterns
 
(DAT312) Using Amazon Aurora for Enterprise Workloads
(DAT312) Using Amazon Aurora for Enterprise Workloads(DAT312) Using Amazon Aurora for Enterprise Workloads
(DAT312) Using Amazon Aurora for Enterprise Workloads
 
MinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with CassandraMinneBar 2013 - Scaling with Cassandra
MinneBar 2013 - Scaling with Cassandra
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...Building a high-performance data lake analytics engine at Alibaba Cloud with ...
Building a high-performance data lake analytics engine at Alibaba Cloud with ...
 
Amazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian MeyersAmazon Elastic Map Reduce - Ian Meyers
Amazon Elastic Map Reduce - Ian Meyers
 

Más de Directi Group

Hr coverage directi 2012
Hr coverage directi 2012Hr coverage directi 2012
Hr coverage directi 2012Directi Group
 
MDI - Mandevian Knights
MDI - Mandevian KnightsMDI - Mandevian Knights
MDI - Mandevian KnightsDirecti Group
 
FMS - Riders on the Storm
FMS - Riders on the StormFMS - Riders on the Storm
FMS - Riders on the StormDirecti Group
 
ISB - Beirut Film Fiesta
ISB - Beirut Film FiestaISB - Beirut Film Fiesta
ISB - Beirut Film FiestaDirecti Group
 
Great Lakes - Synergy
Great Lakes - SynergyGreat Lakes - Synergy
Great Lakes - SynergyDirecti Group
 
Great Lakes - Fabulous Four
Great Lakes - Fabulous FourGreat Lakes - Fabulous Four
Great Lakes - Fabulous FourDirecti Group
 
IIM C - Baker Street
IIM C - Baker StreetIIM C - Baker Street
IIM C - Baker StreetDirecti Group
 
Directi Case Study Contest - Team idate from MDI Gurgaon
Directi Case Study Contest -  Team idate from MDI GurgaonDirecti Case Study Contest -  Team idate from MDI Gurgaon
Directi Case Study Contest - Team idate from MDI GurgaonDirecti Group
 
Directi Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB HyderabadDirecti Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB HyderabadDirecti Group
 
Directi Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB HyderabadDirecti Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB HyderabadDirecti Group
 
Directi Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM CDirecti Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM CDirecti Group
 
Directi Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFTDirecti Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFTDirecti Group
 
Directi Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM ADirecti Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM ADirecti Group
 
Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012Directi Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti Group
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti Group
 

Más de Directi Group (20)

Hr coverage directi 2012
Hr coverage directi 2012Hr coverage directi 2012
Hr coverage directi 2012
 
IIM L - ConArtists
IIM L - ConArtistsIIM L - ConArtists
IIM L - ConArtists
 
MDI - Mandevian Knights
MDI - Mandevian KnightsMDI - Mandevian Knights
MDI - Mandevian Knights
 
ISB - Pikturewale
ISB - PikturewaleISB - Pikturewale
ISB - Pikturewale
 
FMS - Riders on the Storm
FMS - Riders on the StormFMS - Riders on the Storm
FMS - Riders on the Storm
 
IIM L - Inferno
IIM L - InfernoIIM L - Inferno
IIM L - Inferno
 
ISB - Beirut Film Fiesta
ISB - Beirut Film FiestaISB - Beirut Film Fiesta
ISB - Beirut Film Fiesta
 
Great Lakes - Synergy
Great Lakes - SynergyGreat Lakes - Synergy
Great Lakes - Synergy
 
Great Lakes - Fabulous Four
Great Lakes - Fabulous FourGreat Lakes - Fabulous Four
Great Lakes - Fabulous Four
 
IIM C - Baker Street
IIM C - Baker StreetIIM C - Baker Street
IIM C - Baker Street
 
Directi Case Study Contest - Team idate from MDI Gurgaon
Directi Case Study Contest -  Team idate from MDI GurgaonDirecti Case Study Contest -  Team idate from MDI Gurgaon
Directi Case Study Contest - Team idate from MDI Gurgaon
 
Directi Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB HyderabadDirecti Case Study Contest - Relationships Matter from ISB Hyderabad
Directi Case Study Contest - Relationships Matter from ISB Hyderabad
 
Directi Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB HyderabadDirecti Case Study Contest - Team Goodfellas from ISB Hyderabad
Directi Case Study Contest - Team Goodfellas from ISB Hyderabad
 
Directi Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM CDirecti Case Study Contest- Team Joka warriors from IIM C
Directi Case Study Contest- Team Joka warriors from IIM C
 
Directi Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFTDirecti Case Study Contest - Team Alkaline Jazz from IIFT
Directi Case Study Contest - Team Alkaline Jazz from IIFT
 
Directi Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM ADirecti Case Study Contest - Singles 360 by Team Awesome from IIM A
Directi Case Study Contest - Singles 360 by Team Awesome from IIM A
 
Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012Directi On Campus- Engineering Presentation - 2011-2012
Directi On Campus- Engineering Presentation - 2011-2012
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 
Directi On Campus- Engineering Presentation
Directi On Campus- Engineering PresentationDirecti On Campus- Engineering Presentation
Directi On Campus- Engineering Presentation
 

Último

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Último (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Handling Data in Mega Scale Systems

  • 1. Intelligent People. Uncommon Ideas. Handling Data in Mega Scale Web Apps(lessons learnt @ Directi) Vineet Gupta | GM – Software Engineering | Directi http://vineetgupta.spaces.live.com Licensed under Creative Commons Attribution Sharealike Noncommercial
  • 2. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 3. Not Covering Offline Processing (Batching / Queuing) Distributed Processing – Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery
  • 4. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 5. How Big Does it Get 22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine Source:http://highscalability.com/digg-architecture
  • 6. How Big Does it Get 1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
  • 7. How Big Does it Get 2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
  • 8. How Big Does it Get Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers Source:http://highscalability.com/ebay-architecture/
  • 9. How Big Does it Get 450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 – 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce 100k jobs / day 20 PB of data processed / day 10k MapReduce programs Source:http://highscalability.com/google-architecture/
  • 10. Key Trends Data Size ~ PB Data Growth ~ TB / day No of servers – 10s to 10,000 No of datacenters – 1 to 10 Queries – B+ / day Specialized needs – more / other than RDBMS
  • 11. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 12. Host RAM CPU CPU RAM CPU RAM App Server DB Server Vertical Scaling (Scaling Up)
  • 13. Big Irons Sunfire E20k PowerEdge SC1435 36x 1.8GHz processors Dualcore 1.8 GHz processor $450,000 - $2,500,000 Around $1,500
  • 14. Vertical Scaling (Scaling Up) Increasing the hardware resources on a host Pros Simple to implement Fast turnaround time Cons Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
  • 15. Host Host App Server DB Server Vertical Partitioning of Services
  • 16. Vertical Partitioning of Services Split services on separate nodes Each node performs different tasks Pros Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases Cons Sub-optimal resource utilization May not increase overall availability Finite Scalability
  • 17. Horizontal Scaling of App Server Web Server Load Balancer Web Server DB Server Web Server
  • 18. Horizontal Scaling of App Server Add more nodes for the same service Identical, doing the same task Load Balancing Hardware balancers are faster Software balancers are more customizable
  • 19. The problem - State Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
  • 20. Sticky Sessions Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server Asymmetrical load distribution Downtime
  • 21. Central Session Store Web Server User 1 Load Balancer Web Server Session Store User 2 Web Server SPOF Reads and Writes generate network + disk IO
  • 22. Clustered Sessions Web Server User 1 Load Balancer Web Server User 2 Web Server
  • 23. Clustered Sessions Pros No SPOF Easier to setup Fast Reads Cons n x Writes Increase in network IO with increase in nodes Stale data (rare)
  • 24. Sticky Sessions with Central Store Web Server User 1 Load Balancer Web Server DB Server User 2 Web Server
  • 25. More Session Management No Sessions Stuff state in a cookie and sign it! Cookie is sent with every request / response Super Slim Sessions Keep small amount of frequently used data in cookie Pull rest from DB (or central session store)
  • 26. Sessions - Recommendation Bad Sticky sessions Good Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume Great No Sessions!
  • 27. App Tier Scaling - More HTTP Accelerators / Reverse Proxy Static content caching, redirect to lighter HTTP Async NIO on user-side, Keep-alive connection pool CDN Get closer to your user Akamai, Limelight IP Anycasting Async NIO
  • 28. Scaling a Web App App-Layer Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!! Data Store Tricky! Very Tricky!!!
  • 29. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 30. Replication = Scaling by Duplication App Layer T1, T2, T3, T4
  • 31. Replication = Scaling by Duplication App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 Each node has its own copy of data Shared Nothing Cluster
  • 32. Replication Read : Write = 4:1 Scale reads at cost of writes! Duplicate Data – each node has its own copy Master Slave Writes sent to one node, cascaded to others Multi-Master Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
  • 33. Master-Slave App Layer Master Slave Slave Slave Slave n x Writes – Async vs. Sync SPOF Async - Critical Reads from Master!
  • 34. Multi-Master App Layer Master Master Slave Slave Slave n x Writes – Async vs. Sync No SPOF Conflicts!
  • 35. Replication Considerations Asynchronous Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads Synchronous Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads
  • 36. Replication Considerations Replication at RDBMS level Support may exists in RDBMS or through 3rd party tool Faster and more reliable App must send writes to Master, reads to any db and critical reads to Master Replication at Driver / DAO level Driver / DAO layer ensures writes are performed on all connected DBs Reads are load balanced Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable
  • 37. Diminishing Returns Per Server: 4R, 1W 2R, 1W 1R, 1W Read Read Read Write Write Write Read Read Read Read Write Write Write Write
  • 38. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 39. Partitioning = Scaling by Division Vertical Partitioning Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite Horizontal Partitioning Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
  • 40. Vertical Partitioning App Layer T1, T2, T3, T4, T5 Note: A node here typically represents a shared nothing cluster
  • 41. Vertical Partitioning App Layer T3 T4 T5 T2 T1 Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
  • 42. Horizontal Partitioning App Layer T3 T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
  • 43. Horizontal Partitioning Schemes Value Based Split on timestamp of posts Split on first alphabet of user name Hash Based Use a hash function to determine cluster Lookup Map First Come First Serve Round Robin
  • 44. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 46. Transactions Transactions make you feel alone No one else manipulates the data when you are Transactional serializability The behavior is as if a serial order exists Source:http://blogs.msdn.com/pathelland/ Slide 46
  • 47. Life in the “Now” Transactions live in the “now” inside services Time marches forward Transactions commit Advancing time Transactions see the committed transactions A service’s biz-logic lives in the “now” Source:http://blogs.msdn.com/pathelland/ Slide 47
  • 48.
  • 49. Knowledge travels at speed of light
  • 50. By the time you see a distant object it may have changed!
  • 51.
  • 52. Simultaneity only inside a transaction!
  • 53. Simultaneity only inside a service!Source:http://blogs.msdn.com/pathelland/ Slide 48
  • 54.
  • 55. The sun may have blown up 5 minutes ago
  • 56.
  • 58. Classic distributed computing: many systems look like one
  • 59. RPC, 2-phase commit, remote method calls…
  • 60. In Einstein’s world, everything is “relative” to one’s perspective
  • 61. Today: No attempt to blur the boundarySource:http://blogs.msdn.com/pathelland/ Slide 49
  • 62. Versions and Distributed Systems Can’t have “the same” dataat many locations Unless it isa snapshot Changing distributed dataneeds versions Creates asnapshot… Source:http://blogs.msdn.com/pathelland/
  • 63. Subjective Consistency Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help Subjective Consistency  Given the information I have at hand, make a decision and act on it !  Remember the information at hand ! Ambassadors Had Authority Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later! Source:http://blogs.msdn.com/pathelland/
  • 64.
  • 65. Everyone sharing their knowledge leads to the same result...This is NOT magic; it is a design requirement ! Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement Source:http://blogs.msdn.com/pathelland/
  • 66. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 67. Why Normalize? Classic problemwith de-normalization Can’t updateSam’s phone #since there aremany copies Emp # Emp Name Mgr # Mgr Name Emp Phone 47 Joe 13 Sam 5-1234 18 Sally 38 Harry 3-3123 91 Pete 13 Sam 2-1112 66 Mary 02 Betty 5-7349 Mgr Phone 6-9876 5-6782 6-9876 4-0101 Normalization’s Goal Is Eliminating Update Anomalies Can Be Changed Without “Funny Behavior” Each Data Item Lives in One Place De-normalization is OK if you aren’t going to update! Source:http://blogs.msdn.com/pathelland/
  • 69. Eliminate Joins 6 joins for 1 query! Do you think FB would do this? And how would you do joins with partitioned data? De-normalization removes joins But increases data volume But disk is cheap and getting cheaper And can lead to inconsistent data If you are lazy However this is not really an issue
  • 70. “Append-Only” Data Many Kinds of Computing are “Append-Only” Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You can’t change the history but you can add new observations Derived Results May Be Calculated Estimate of the “current” inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
  • 71. Databases and Transaction Logs Transaction Logs Are the Truth High-performance & write-only Describe ALL the changes to the data Data-Base  the Current Opinion Describes the latest value of the data as perceived by the application Log DB The Database Is a Caching of the Transaction Log ! It is the subset of the latest committed values represented in the transaction log… Source:http://blogs.msdn.com/pathelland/
  • 72. We Are Swimming in a Sea of Immutable Data Source:http://blogs.msdn.com/pathelland/
  • 73. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 74. Caching Makes scaling easier (cheaper) Core Idea Read data from persistent store into memory Store in a hash-table Read first from cache, if not, load from persistent store
  • 75. Write thru Cache App Server Cache
  • 76. Write back Cache App Server Cache
  • 77. Sideline Cache App Server Cache
  • 79. How does it work In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table Which item is stored on which instance Memcached Server maintains a hash table Which item is stored in which memory location
  • 80. Outline Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
  • 81. It’s not all Relational! Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft – SQL Data Services, Azure Storages Facebook – Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable
  • 82. Tuplespaces Basic Concepts No tables - Containers-Entity No schema - each tuple has its own set of properties Amazon SimpleDB – strings only Microsoft Azure SQL Data Services Strings, blob, datetime, bool, int, double, etc. No x-container joins as of now Google App Engine Datastore Strings, blob, datetime, bool, int, double, etc.
  • 83. Key-Value Stores Google BigTable Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes Amazon Dynamo Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning Facebook Cassandra Used for Inbox search Open Source Scalaris Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence
  • 84. In Summary Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!
  • 86. Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial