Scaling Out With Hadoop And HBase

•

17 recomendaciones•5,192 vistas

A very high-level introduction to scaling out wth Hadoop and NoSQL combined with some experiences on my current project. I gave this presentation at the JFall 2009 conference in the Netherlands

Tecnología

An Introduction to Dealing with

Big Data

My Current Project...

IP Address Registration for
Europe, Middle East, Russia

Ipv4:2 32 (4.3×109)addresses
Ipv6: 2128 (3.4×1038) addresses

Challenge

10 years of historical registration/routing data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

Google Yahoo Amazon
eBay
Facebookusers
300M MySpace users
264M Wikipedia
LinkedInusers
Twitterusers
50M

45M Digg Hyves
Flickr users YouTube
32M
Marktplaats 5.5M ads
6.5M users,

Scalability:

Handling more load / requests
Handling more data
Handling more types of data

...without anything breaking or falling over
...and without going bankrupt

UP
Out Out Out Out
Out Out Out Out
Out Out Out Out
VS Out Out Out Out
Out Out Out Out
Out Out Out Out

Scaling Out, Part 1

Processing Data
a.k.a. Data Crunching

Map/Reduce

Parallel Batch Processing of Data
Break the data into chunks
Distribute the chunks
Process the chunks in parallel
Merge the results

Reliable, Scalable, Distributed Computing

(written in Java)

Distributed File System (DFS)

Foundation for all Hadoop projects
Automatic ﬁle replication
Automatic checksumming / error correction
Based on Google’s File System (GFS)

Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages

4TB of raw image TIFF data (stored in S3)
100 Amazon EC2 instances
Hadoop Map/Reduce
11 million ﬁnished PDFs
24 hours, about $240

Scaling Out, Part 1I

Storing & Retrieving Data
Reads and Writes

Relational Databases
are hard to scale out

Ways to Scale out an RDBMS (1)

Replication
Good for scaling reads
Master-Slave Single point of failure
Single point of bottleneck
Master-Master Limited scaling of writes
Complicated

Ways to Scale out an RDBMS (2)

Partitioning
Vertical : by function / table
Horizontal : by key / id (Sharding)

Not truly Relational anymore (application joins)
Limited Scalability (relocating, resharding)

Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance ...pick any two

Relational Non-Relational

ACID vs BASE
Atomic Basic
Consistent Availability
Isolated Soft State
Durable Eventual Consistency

NoSQL NO-SQL

Non-Relational Databases

Better Different

Types of NOSQL
(Distributed) Key-Value
Redis
Voldemort Document Oriented
Scalaris (D)
CouchDB
MongoDB
Riak (D)

Column Oriented
Cassandra (D)
HBase (D)
Graph Oriented
Neo4J

(D) = Distributed (automatic out scaling)

Those Big Numbers Again...

10 years of historical data in ﬂat ﬁles
200+ billion (!) historical data records (25 TB)

30 billion records per year (4 TB)
80 million per day / 1,000 per second

Make it searchable...

~ 200 000 000 000 records

Map / Reduce

~ 15 000 000 000 records

Our Data is 3D

IP Address
1 0..*
Record
Record
1 0..*
Timestamp
Timestamp

Best ﬁt & performance:
Column Oriented

Row Column Name (!) Values (!)

Facebook
Cassandra Twitter
Digg

Tunable: Availability vs Consistency
Very active community
0.4.1
No documentation

Yahoo Adobe
Meetup Tumblr
StumbleUpon
Streamy

Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation

Initial Results:
Tested on an EC2 cluster of 8 XLarge instances

3.8 B (23 GB) 33 M (1 GB)
5 hours

33 M (1 GB) 15 GB
Record duplication: 6x

75 minutes “Needle in a haystack” full on-disk table scan:
44000 inserts/second 0.5 M records/second

In order to choose the right
scaling tools, you need to:
Understand your data
Know what you want to query and how

val shameless = <SelfPromotion>

Try some Scala in the basement !

</SelfPromotion>

Más contenido relacionado

La actualidad más candente

Introduction to Hadoop and Hadoop component rebeccatho

Hadoop And Their Ecosystem pptsunera pathan

NOSQL- Presentation on NoSQLRamakant Soni

Planningahmad bassiouny

Introduction to HadoopDr. C.V. Suresh Babu

State Space Representation and SearchHitesh Mohapatra

Big Data Analytics with HadoopPhilippe Julio

High Dimensional Data VisualizationFabian Keller

Supervised learning and Unsupervised learning Usama Fayyaz

Hadoop YARNVigen Sahakyan

Hadoop Map ReduceVNIT-ACM Student Chapter

I. AO* SEARCH ALGORITHMvikas dhakane

Artificial Intelligence Searching TechniquesDr. C.V. Suresh Babu

Data cube computationRashmi Sheikh

Association rule miningAcad

OLAP operationskunj desai

ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtapVikas Jagtap

Neural networkKRISH na TimeTraveller

Noise ModelsSardar Alam

Clustering: Large Databases in data miningZHAO Sam

La actualidad más candente (20)

Introduction to Hadoop and Hadoop component

Hadoop And Their Ecosystem ppt

NOSQL- Presentation on NoSQL

Planning

Introduction to Hadoop

State Space Representation and Search

Big Data Analytics with Hadoop

High Dimensional Data Visualization

Supervised learning and Unsupervised learning

Hadoop YARN

Hadoop Map Reduce

I. AO* SEARCH ALGORITHM

Artificial Intelligence Searching Techniques

Data cube computation

Association rule mining

OLAP operations

ADVANCE DATABASE MANAGEMENT SYSTEM CONCEPTS & ARCHITECTURE by vikas jagtap

Neural network

Noise Models

Clustering: Large Databases in data mining

Destacado

An Introduction to Functional Programming using HaskellMichel Rijnders

Next-Generation SIEM: Delivered from the Cloud Alert Logic

Modern Big Data Analytics Tools: An OverviewGreat Wide Open

NewSQL overview, Feb 2015Ivan Glushkov

Big data unit iNavjot Kaur

MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett

Up to speed in domain driven designRick van der Arend

Destacado (7)

An Introduction to Functional Programming using Haskell

Next-Generation SIEM: Delivered from the Cloud

Modern Big Data Analytics Tools: An Overview

NewSQL overview, Feb 2015

Big data unit i

MySQL vs. NoSQL and NewSQL - survey results

Up to speed in domain driven design

Similar a Scaling Out With Hadoop And HBase

Small, Medium and Big DataPierre De Wilde

Above the cloud: Big Data and BIDenny Lee

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev

Data Driven Innovation with Amazon Web ServicesAmazon Web Services

Mongodb labBas van Oudenaarde

Next Generation Data Platforms - Deon ThomasThoughtworks

The Cassandra Distributed DatabaseEric Evans

Introduction to NoSQLYan Cui

Schemaless DatabasesDan Gunter

(DAT203) Building Graph Databases on AWSAmazon Web Services

Yahoo compares Storm and SparkChicago Hadoop Users Group

NO SQL: What, Why, HowIgor Moochnick

BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy

Microsoft Openness Mongo DBHeriyadi Janwar

Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall

Apache Spark: The Next Gen toolset for Big Data Processingprajods

Etu L2 Training - Hadoop 企業應用實作James Chen

Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB

MySQL And Search At CraigslistJeremy Zawodny

Similar a Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data

Above the cloud: Big Data and BI

Big Data Essentials meetup @ IBM Ljubljana 23.06.2015

Data Driven Innovation with Amazon Web Services

Mongodb lab

Next Generation Data Platforms - Deon Thomas

The Cassandra Distributed Database

Introduction to NoSQL

Schemaless Databases

(DAT203) Building Graph Databases on AWS

Yahoo compares Storm and Spark

NO SQL: What, Why, How

BDI- The Beginning (Big data training in Coimbatore)

Microsoft Openness Mongo DB

Big Data/Hadoop Infrastructure Considerations

Apache Spark: The Next Gen toolset for Big Data Processing

Etu L2 Training - Hadoop 企業應用實作

Sf NoSQL MeetUp: Apache Hadoop and HBase

Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...

MySQL And Search At Craigslist

Último

CloudStudio User manual (basic edition):comworks

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

How to write a Business Continuity PlanDatabarracks

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Story boards and shot lists for my a level piececharlottematthew16

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Commit 2024 - Secret Management made easyAlfredo García Lavilla

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Scaling Out With Hadoop And HBase

1. Scaling Out Hadoop and NoSQL Age Mooij

2. An Introduction to Dealing with Big Data

3. About me... @agemooij

4. Big Data ...and me

5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses

6. Challenge 10 years of historical registration/routing data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

7. Big Data ...and you

8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,

9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt

10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out

11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching

12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results

13. Reliable, Scalable, Distributed Computing (written in Java)

14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic ﬁle replication Automatic checksumming / error correction Based on Google’s File System (GFS)

15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages

16.

17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million ﬁnished PDFs 24 hours, about $240

18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes

19. Relational Databases are hard to scale out

20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated

21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)

22. Why are RDBMSs so hard to scale out

23. Brewer’s CAP Theorem Consistency Availability Partition Tolerance ...pick any two

24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency

25. NoSQL NO-SQL Non-Relational Databases Better Different

26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)

27. RIPE NCC Experiences so far...

28. Those Big Numbers Again... 10 years of historical data in ﬂat ﬁles 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...

29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records

30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best ﬁt & performance: Column Oriented Row Column Name (!) Values (!)

31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation

32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation

33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second

34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how

35. Big Data ...Be Prepared !

36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>

Scaling Out With Hadoop And HBase

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (7)

Similar a Scaling Out With Hadoop And HBase

Similar a Scaling Out With Hadoop And HBase (20)

Último

Último (20)

Scaling Out With Hadoop And HBase