SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Scaling Out
Hadoop and NoSQL


    Age Mooij
An Introduction to Dealing with




Big Data
About me...




              @agemooij
Big Data
  ...and me
My Current Project...




           IP Address Registration for
           Europe, Middle East, Russia

           Ipv4:2 32   (4.3×109)addresses
           Ipv6: 2128 (3.4×1038) addresses
Challenge

10 years of historical registration/routing data in flat files
200+ billion (!) historical data records (25 TB)

                30 billion records per year (4 TB)
                80 million per day / 1,000 per second




        Make it searchable...
Big Data
  ...and you
Google             Yahoo          Amazon
                                                  eBay
            Facebookusers
                  300M           MySpace users
                                      264M         Wikipedia
LinkedInusers
                      Twitterusers
      50M

                           45M           Digg         Hyves
       Flickr users       YouTube
           32M
                                              Marktplaats 5.5M ads
                                                    6.5M users,
Scalability:

         Handling more load / requests
             Handling more data
          Handling more types of data



  ...without anything breaking or falling over
         ...and without going bankrupt
UP
          Out Out Out Out
          Out Out Out Out
          Out Out Out Out
     VS   Out Out Out Out
          Out Out Out Out
          Out Out Out Out
Scaling Out, Part 1

Processing Data
  a.k.a. Data Crunching
Map/Reduce

 Parallel Batch Processing of Data
     Break the data into chunks
       Distribute the chunks
    Process the chunks in parallel
         Merge the results
Reliable, Scalable, Distributed Computing




           (written in Java)
Distributed File System (DFS)

    Foundation for all Hadoop projects
        Automatic file replication
Automatic checksumming / error correction
   Based on Google’s File System (GFS)
Map / Reduce

Simple Java API
Powerful supporting framework
Powerful tools
Good support for non-java languages
4TB of raw image TIFF data (stored in S3)
       100 Amazon EC2 instances
          Hadoop Map/Reduce
        11 million finished PDFs
         24 hours, about $240
Scaling Out, Part 1I

Storing & Retrieving Data
       Reads and Writes
Relational Databases
are hard to scale out
Ways to Scale out an RDBMS (1)


    Replication
                       Good for scaling reads
     Master-Slave      Single point of failure
                       Single point of bottleneck
    Master-Master      Limited scaling of writes
                       Complicated
Ways to Scale out an RDBMS (2)


                           Partitioning
Vertical   : by function / table
Horizontal : by key / id (Sharding)


     Not truly Relational anymore (application joins)
      Limited Scalability (relocating, resharding)
Why are RDBMSs
so hard to
scale out
Brewer’s CAP Theorem

Consistency
Availability
Partition Tolerance   ...pick any two
Relational   Non-Relational



ACID vs      BASE
Atomic       Basic
Consistent   Availability
Isolated     Soft State
Durable      Eventual Consistency
NoSQL             NO-SQL

 Non-Relational Databases

    Better Different
Types of NOSQL
(Distributed) Key-Value
        Redis
        Voldemort             Document Oriented
        Scalaris (D)
                                            CouchDB
                                            MongoDB
                                            Riak (D)


  Column Oriented
       Cassandra (D)
       HBase (D)
                                  Graph Oriented
                                              Neo4J



                          (D) = Distributed (automatic out scaling)
RIPE NCC
Experiences so far...
Those Big Numbers Again...


10 years of historical data in flat files
200+ billion (!) historical data records (25 TB)

                  30 billion records per year (4 TB)
                  80 million per day / 1,000 per second




                       Make it searchable...
~ 200 000 000 000 records




        Map / Reduce




~ 15 000 000 000 records
Our Data is 3D

IP Address
             1     0..*
                           Record
                          Record
                                    1   0..*
                                                Timestamp
                                               Timestamp



       Best fit & performance:
                   Column Oriented


 Row             Column Name (!)               Values (!)
Facebook
Cassandra                                 Twitter
                                           Digg


  Tunable: Availability vs Consistency
  Very active community
  0.4.1
  No documentation
Yahoo Adobe
                      Meetup Tumblr
                       StumbleUpon
                          Streamy


Built on top of Hadoop DFS
Very active community
0.20.1
Good Documentation
Initial Results:
   Tested on an EC2 cluster of 8 XLarge instances


3.8 B (23 GB)                                        33 M (1 GB)
                            5 hours




33 M (1 GB)                                            15 GB
                                                 Record duplication: 6x

    75 minutes                        “Needle in a haystack” full on-disk table scan:
44000 inserts/second                             0.5 M records/second
In order to choose the right
  scaling tools, you need to:
       Understand your data
Know what you want to query and how
Big Data
   ...Be Prepared !
val shameless = <SelfPromotion>




    Try some Scala in the basement !



        </SelfPromotion>

Más contenido relacionado

La actualidad más candente

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
The dag representation of basic blocks
The dag representation of basic blocksThe dag representation of basic blocks
The dag representation of basic blocksShabeen Taj
 
Concurrency Control in Database Management System
Concurrency Control in Database Management SystemConcurrency Control in Database Management System
Concurrency Control in Database Management SystemJanki Shah
 
17. Recovery System in DBMS
17. Recovery System in DBMS17. Recovery System in DBMS
17. Recovery System in DBMSkoolkampus
 
Transaction processing ppt
Transaction processing pptTransaction processing ppt
Transaction processing pptJaved Khan
 
Inheritance in JAVA PPT
Inheritance  in JAVA PPTInheritance  in JAVA PPT
Inheritance in JAVA PPTPooja Jaiswal
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMSkoolkampus
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMSkoolkampus
 

La actualidad más candente (20)

PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Sql fundamentals
Sql fundamentalsSql fundamentals
Sql fundamentals
 
Concurrency control
Concurrency controlConcurrency control
Concurrency control
 
The dag representation of basic blocks
The dag representation of basic blocksThe dag representation of basic blocks
The dag representation of basic blocks
 
Concurrency Control in Database Management System
Concurrency Control in Database Management SystemConcurrency Control in Database Management System
Concurrency Control in Database Management System
 
17. Recovery System in DBMS
17. Recovery System in DBMS17. Recovery System in DBMS
17. Recovery System in DBMS
 
DBMS Keys
DBMS KeysDBMS Keys
DBMS Keys
 
Transaction processing ppt
Transaction processing pptTransaction processing ppt
Transaction processing ppt
 
Inheritance in JAVA PPT
Inheritance  in JAVA PPTInheritance  in JAVA PPT
Inheritance in JAVA PPT
 
Acid properties
Acid propertiesAcid properties
Acid properties
 
SQL(DDL & DML)
SQL(DDL & DML)SQL(DDL & DML)
SQL(DDL & DML)
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Python exception handling
Python   exception handlingPython   exception handling
Python exception handling
 
12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS12. Indexing and Hashing in DBMS
12. Indexing and Hashing in DBMS
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS
 
Abstract data types
Abstract data typesAbstract data types
Abstract data types
 
Hashing
HashingHashing
Hashing
 
Temporal data mining
Temporal data miningTemporal data mining
Temporal data mining
 
Introduction to java
Introduction to javaIntroduction to java
Introduction to java
 

Destacado

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellMichel Rijnders
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Alert Logic
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewGreat Wide Open
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015Ivan Glushkov
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMatthew Aslett
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven designRick van der Arend
 

Destacado (7)

An Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using HaskellAn Introduction to Functional Programming using Haskell
An Introduction to Functional Programming using Haskell
 
Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud Next-Generation SIEM: Delivered from the Cloud
Next-Generation SIEM: Delivered from the Cloud
 
Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An OverviewModern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An Overview
 
NewSQL overview, Feb 2015
NewSQL overview, Feb 2015NewSQL overview, Feb 2015
NewSQL overview, Feb 2015
 
Big data unit i
Big data unit iBig data unit i
Big data unit i
 
MySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey resultsMySQL vs. NoSQL and NewSQL - survey results
MySQL vs. NoSQL and NewSQL - survey results
 
Up to speed in domain driven design
Up to speed in domain driven designUp to speed in domain driven design
Up to speed in domain driven design
 

Similar a Scaling Out With Hadoop And HBase

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big DataPierre De Wilde
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesAmazon Web Services
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasThoughtworks
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed DatabaseEric Evans
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQLYan Cui
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless DatabasesDan Gunter
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, HowIgor Moochnick
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)Ashok Rangaswamy
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DBHeriyadi Janwar
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作James Chen
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At CraigslistJeremy Zawodny
 

Similar a Scaling Out With Hadoop And HBase (20)

Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Data Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web ServicesData Driven Innovation with Amazon Web Services
Data Driven Innovation with Amazon Web Services
 
Mongodb lab
Mongodb labMongodb lab
Mongodb lab
 
Next Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon ThomasNext Generation Data Platforms - Deon Thomas
Next Generation Data Platforms - Deon Thomas
 
The Cassandra Distributed Database
The Cassandra Distributed DatabaseThe Cassandra Distributed Database
The Cassandra Distributed Database
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Schemaless Databases
Schemaless DatabasesSchemaless Databases
Schemaless Databases
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
NO SQL: What, Why, How
NO SQL: What, Why, HowNO SQL: What, Why, How
NO SQL: What, Why, How
 
BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)BDI- The Beginning (Big data training in Coimbatore)
BDI- The Beginning (Big data training in Coimbatore)
 
Microsoft Openness Mongo DB
Microsoft Openness Mongo DBMicrosoft Openness Mongo DB
Microsoft Openness Mongo DB
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MySQL And Search At Craigslist
MySQL And Search At CraigslistMySQL And Search At Craigslist
MySQL And Search At Craigslist
 

Último

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Último (20)

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Scaling Out With Hadoop And HBase

  • 1. Scaling Out Hadoop and NoSQL Age Mooij
  • 2. An Introduction to Dealing with Big Data
  • 3. About me... @agemooij
  • 4. Big Data ...and me
  • 5. My Current Project... IP Address Registration for Europe, Middle East, Russia Ipv4:2 32 (4.3×109)addresses Ipv6: 2128 (3.4×1038) addresses
  • 6. Challenge 10 years of historical registration/routing data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 7. Big Data ...and you
  • 8. Google Yahoo Amazon eBay Facebookusers 300M MySpace users 264M Wikipedia LinkedInusers Twitterusers 50M 45M Digg Hyves Flickr users YouTube 32M Marktplaats 5.5M ads 6.5M users,
  • 9. Scalability: Handling more load / requests Handling more data Handling more types of data ...without anything breaking or falling over ...and without going bankrupt
  • 10. UP Out Out Out Out Out Out Out Out Out Out Out Out VS Out Out Out Out Out Out Out Out Out Out Out Out
  • 11. Scaling Out, Part 1 Processing Data a.k.a. Data Crunching
  • 12. Map/Reduce Parallel Batch Processing of Data Break the data into chunks Distribute the chunks Process the chunks in parallel Merge the results
  • 13. Reliable, Scalable, Distributed Computing (written in Java)
  • 14. Distributed File System (DFS) Foundation for all Hadoop projects Automatic file replication Automatic checksumming / error correction Based on Google’s File System (GFS)
  • 15. Map / Reduce Simple Java API Powerful supporting framework Powerful tools Good support for non-java languages
  • 16.
  • 17. 4TB of raw image TIFF data (stored in S3) 100 Amazon EC2 instances Hadoop Map/Reduce 11 million finished PDFs 24 hours, about $240
  • 18. Scaling Out, Part 1I Storing & Retrieving Data Reads and Writes
  • 20. Ways to Scale out an RDBMS (1) Replication Good for scaling reads Master-Slave Single point of failure Single point of bottleneck Master-Master Limited scaling of writes Complicated
  • 21. Ways to Scale out an RDBMS (2) Partitioning Vertical : by function / table Horizontal : by key / id (Sharding) Not truly Relational anymore (application joins) Limited Scalability (relocating, resharding)
  • 22. Why are RDBMSs so hard to scale out
  • 24. Relational Non-Relational ACID vs BASE Atomic Basic Consistent Availability Isolated Soft State Durable Eventual Consistency
  • 25. NoSQL NO-SQL Non-Relational Databases Better Different
  • 26. Types of NOSQL (Distributed) Key-Value Redis Voldemort Document Oriented Scalaris (D) CouchDB MongoDB Riak (D) Column Oriented Cassandra (D) HBase (D) Graph Oriented Neo4J (D) = Distributed (automatic out scaling)
  • 28. Those Big Numbers Again... 10 years of historical data in flat files 200+ billion (!) historical data records (25 TB) 30 billion records per year (4 TB) 80 million per day / 1,000 per second Make it searchable...
  • 29. ~ 200 000 000 000 records Map / Reduce ~ 15 000 000 000 records
  • 30. Our Data is 3D IP Address 1 0..* Record Record 1 0..* Timestamp Timestamp Best fit & performance: Column Oriented Row Column Name (!) Values (!)
  • 31. Facebook Cassandra Twitter Digg Tunable: Availability vs Consistency Very active community 0.4.1 No documentation
  • 32. Yahoo Adobe Meetup Tumblr StumbleUpon Streamy Built on top of Hadoop DFS Very active community 0.20.1 Good Documentation
  • 33. Initial Results: Tested on an EC2 cluster of 8 XLarge instances 3.8 B (23 GB) 33 M (1 GB) 5 hours 33 M (1 GB) 15 GB Record duplication: 6x 75 minutes “Needle in a haystack” full on-disk table scan: 44000 inserts/second 0.5 M records/second
  • 34. In order to choose the right scaling tools, you need to: Understand your data Know what you want to query and how
  • 35. Big Data ...Be Prepared !
  • 36. val shameless = <SelfPromotion> Try some Scala in the basement ! </SelfPromotion>