SlideShare a Scribd company logo
1 of 34
Design, Scale & Performance of the
        MapR Distribution

                       M.C. Srivas
             CTO, MapR Technologies, Inc.




 6/29/2011        © MapR Technologies, Inc.   1
Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
  • Scalability factors
  • Programming model
  • Distributed transactions in MapR
• Performance across a variety of loads


     6/29/2011      © MapR Technologies, Inc.   2
Complete Distribution
   Integrated, tested,
    hardened
   Super simple
   Unique advanced
    features
   100% compatible
    with MapReduce,
    HBase, HDFS APIs
   No recompile
    required, drop in
    and use now



          6/29/2011       © MapR Technologies, Inc.   3
MapR Areas of Development

                        HBase                       Map
                                                   Reduce
   Ecosystem


            Storage
                                               Management
            Services




6/29/2011              © MapR Technologies, Inc.            4
JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
  block-reports
• HDFS-222, 950 – Concatenate files into larger files
   • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
     million files, each using a block, would use about 3 gigabytes of memory.
     Scaling up much beyond this level is a problem with current hardware.
     Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
   • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
     only if there is complete separation between namespace management and
     block management."



         6/29/2011            © MapR Technologies, Inc.                           5
Observations on Apache Hadoop
   Inefficient HDFS-347                            1200

                                                                      MB/sec
   Scaling problems HDFS-273                       1000


       NameNode bottleneck HDFS-395                     800

       Limited number of files HDFS-222                 600                   READ
                                                                               WRITE
   Admin overhead significant                           400


   NameNode failure loses data                          200

       Not trusted as permanent store                     0
                                                               HARDWARE HDFS
   Write-once
       Data lost unless file closed
          hflush/hsync – unrealistic to expect folks will re-write apps


          6/29/2011          © MapR Technologies, Inc.                          6
MapR Approach
• Some are architectural issues
• Change at that level is a big deal
  – Will not be accepted unless proven
  – Hard to prove without building it first


• Build it and prove it
  – Improve reliability significantly
  – Make it tremendously faster at the same time
  – Enable new class of apps (eg, real-time analytics)
       6/29/2011      © MapR Technologies, Inc.          7
HDFS Architecture Review
   Files are broken into blocks
     Distributed across data-nodes
   NameNode holds (in memory)
       Directories, Files
                                                                     Files
       Block replica locations                                      sharded into
                                                                     blocks
   Data Nodes
       Serve blocks
       No idea about files/dirs
       All ops go to NN
                                                  DataNodes save Blocks
          6/29/2011          © MapR Technologies, Inc.                        8
HDFS Architecture Review
DataNode (DN) reports blocks to
                                                                NameNode
NameNode (NN)
      Large DN does 60K blocks/report
          256M x 60K = 15T = 5 disks @ 3T per
                                                         DataNode     DataNode
      >100K causes extreme load
      40GB NN restart takes 1-2 hours

Addressing Unit is an individual block
      Flat block-address forces DN's to send giant block-reports
      NN can hold about ~300M blocks max
         Limits cluster size to 10's of Petabytes
         Increasing block size negatively impacts map/reduce
           6/29/2011         © MapR Technologies, Inc.                       9
How to Scale
• Central meta server does not scale
  – Make every server a meta-data server too
  – But need memory for map/reduce
    •    Must page meta-data to disk
• Reduce size of block-reports
  – while increasing number of blocks per DN
• Reduce memory footprint of location service
  – cannot add memory indefinitely
• Need fast-restart (HA)
        6/29/2011      © MapR Technologies, Inc.   10
MapR Goal: Scale to 1000X
                       HDFS                         MapR
     # files           150 million                  1 trillion
     # data            10-50 PB                     1-10 Exabytes
     # nodes           2000                         10,000+

Full random read/write semantics
      export via NFS and other protocols
      with enterprise-class reliability: instant-restart, snapshots,
       mirrors, no-single-point-of-failure, …
Run close to hardware speeds
      On extreme scale, efficiency matters extremely
      exploit emerging technology like SSD, 10GE

       6/29/2011        © MapR Technologies, Inc.                   11
MapR's Distributed NameNode
                    Files/directories are sharded into blocks, which
                    are placed into mini NNs (containers ) on disks
                                                        Each container contains
                                                            Directories & files
                                                            Data blocks
                                                        Replicated on servers
Containers are 16-                                      No need to manage
32 GB segments of                                        directly
disk, placed on                                         Use MapR Volumes
nodes

                                                                Patent Pending

        6/29/2011            © MapR Technologies, Inc.                           12
MapR Volumes
                          Significant advantages over “Cluster-
/projects                 wide” or “File-level” approaches

        /tahoe              Volumes allow management attributes
        /yosemite           to be applied in a scalable way at a
                            very granular level and with flexibility
/user
        /msmith             •   Replication factor
                            •   Scheduled mirroring
        /bjohnson           •   Scheduled snapshots
                            •   Data placement control
100K volumes are OK,        •   Usage tracking
  create as many as         •   Administrative permissions
       desired!


          6/29/2011      © MapR Technologies, Inc.                13
MapR Distributed NameNode
Containers are tracked globally
•   Clients cache containers & server info for extended periods

NameNode Map

     S1, S2, S4
                                      Client                        S1
                      Fetches                          Contacts
     S1, S3
                      container                        server to
     S1, S4, S5       locations                        read data
     S2, S3, S5                                        from the
                                                                        S3
                                                       container
     S2, S4, S5
     S3

                                                              S4   S5
                                         S2


          6/29/2011        © MapR Technologies, Inc.                14
MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
         But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
            Map/reduce not affected


            6/29/2011        © MapR Technologies, Inc.         15
MapR Distr NameNode HA
MapR                           Apache Hadoop*
1. apt-get install mapr-cldb   1.    Stop cluster very carefully
while cluster is online        2.    Move fs.checkpoint.dir onto NAS (eg. NetApp)
                               3.    Install, configure DRBD + Heartbeat packages
                                       i. yum -y install drbd82 kmod-drbd82 heartbeat
                                      ii. chkconfig -add heartbeat (both machines)
                                      iii. edit /etc/drbd.conf on 2 machines
                                      iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
                                     it if drbd dies & try again
                                      xxxx. mkfs ext3 on it, mount /hadoop (both machines)
                                      xxxxi. install all rpms in /hadoop, but don't run them yet
                                     (chkconfig off)
                                      xxxxii. umount /hadoop (!!)
                                      xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
                               ...
                               40. Restart cluster. If any problems, start at
                                   /var/log/ha.log for hints on what went wrong.


*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
 Author: Christophe Bisciglia, Cloudera.
                6/29/2011       © MapR Technologies, Inc.                                 16
Step Back & Rethink Problem
Big disruption in hardware landscape
                             Year 2000                Year 2012
        # cores per box              2                   128
        DRAM per box               4GB                 512GB
        # disks per box           250+                   12
        Disk capacity             18GB                   6TB
        Cluster size               2-10                10,000

   No spin-locks / mutexes, 10,000+ threads
   Minimal footprint – preserve resources for App
   Rapid re-replication, scale to several Exabytes

        6/29/2011         © MapR Technologies, Inc.               17
MapR's Programming Model
Written in C++ and is asynchronous
       ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
       One OS thread per cpu-core
       Dispatch: map container-> queue -> cpu-core
       Callback guaranteed to be invoked on same core
             No mutexes needed
       When load increases, add cpu-core + move some queues
        to it
State machines on each queue
       'thread stack' is 4K, 10,000+ threads costs ~40M
       Context-switch is 3 instructions, 250K c.s./core/sec ok!
             6/29/2011       © MapR Technologies, Inc.             18
MapR on Linux
   User-space process, avoids system crashes
   Minimal footprint
        Preserves cpu, memory & resources for app
            uses only 1/5th of system memory
            runs on 1 or 2 cores, others left for app
        Emphasis on efficiency, avoids lots of layering
         raw devices, direct-IO, doesn't use Linux VM
   CPU/memory firewalls implemented
      runaway tasks no longer impact system processes




          6/29/2011        © MapR Technologies, Inc.       19
Random Writing in MapR
                              S1
             Ask for
Client
             64M block                                      NameNode Map
writing                                    Create cont.
 data                                                           S1, S2, S4
                         attach                                 S1, S3
Write                                                           S1, S4, S5
next chunk       S2
                                     Picks master               S2, S4, S5
to S2
                                     and 2 replica slaves       S3
                                                                S2, S3, S5




                                                     S4              S5
                         S3


     6/29/2011           © MapR Technologies, Inc.                         20
MapR's Distributed NameNode
   Distributed transactions to stitch containers together
   Each node uses write-ahead log
       Supports both value-logging and operational-logging
          Value log, record = { disk-offset, old, new }
          Op log, record = { op-details, undo-op, redo-op }
       Recovery in 2 seconds
       'global ids' enable participation in distributed
        transactions




             6/29/2011    © MapR Technologies, Inc.            21
2-Phase Commit Unsuitable
                                                                  App
• BeginTrans .. work .. Commit              C = coordinator
                                                                                Force
                                            P = participant        C            Log
   On app-commit
       C sends prepare to P                         P
        P sends prepare-ack,                                            P
                                                    C
        gives up right to abort                                         C
          Waits for C even across

            crashes/reboots                                             P
                                                         P
       P unlocks only when                                             C
        commit received
Too many message exchanges
                                                              P             P
Single failure can lock up entire cluster
          6/29/2011      © MapR Technologies, Inc.                              22
Quorum-completion Unsuitable
• BeginTrans .. work .. Commit                              C = coordinator
                                                            P = participant
   On app-commit
       C broadcasts prepare                                           P
       If majority responds,                         App
        C commits                                      C
       If not, cluster goes                                               P
        into election mode
       If no majority found, all fails                P
Update throughput very poor                                        P

Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!
          6/29/2011       © MapR Technologies, Inc.                            23
MapR Lockless Transactions
• BeginTrans + work + Commit
   No explicit commit                                       NN1
                                                            NN1
                                                            NN1
   Uses rollback
       confirm callback, piggy-backed
       Undo on confirmed failure                     NN4
                                                      NN4          NN2
                                                                    NN2
       Any replica can confirm                                      NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK                                     NN3
                                                              NN3
                                                            NN3
Patent pending

           6/29/2011      © MapR Technologies, Inc.                    24
Small Files (Apache Hadoop, 10 nodes)

                                     Out of box
                                                                     Op: - create file
Rate (files/sec)




                                                                         - write 100 bytes
                                                      Tuned              - close
                                                                     Notes:
                                                                     - NN not replicated
                                                                     - NN uses 20G DRAM
                                                                     - DN uses 2G DRAM



                                  # of files (m)

                      6/29/2011          © MapR Technologies, Inc.                         25
MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …

                                                     Test
                                                   stopped
 Create                                              here
  Rate

100-byte
files/sec




                           # of files (millions)
          6/29/2011    © MapR Technologies, Inc.             26
MapR's Data Integrity
   End-to-end check-sums on all data (not optional)
      Computed in client's memory, written to disk at server
      On read, validated at both client & server


   RPC packets have own independent check-sum
      Detects RPC msg corruption


   Transactional with ACID semantics
      Meta data incl. log itself is check-summed


      Allocation bitmaps written to two places (dual blocks)


   Automatic compression built-in


        6/29/2011      © MapR Technologies, Inc.           27
MapR’s Random-Write Eases Data Import
With MapR, use NFS                  Otherwise, use Flume/Scribe
1. mount /mapr                      1. Set up sinks (find unused
   real-time, HA                        machines??)
                                    2. Set up intrusive agents
                                        i. tail(“xxx”), tailDir(“y”)
                                        ii. agentBESink
                                    3. All reliability levels lose data
                                        i. best-effort
                                        ii. one-shot
                                        iii. disk fail-over
                                        iv. end-to-end
                                    4. Data not available now


          6/29/2011   © MapR Technologies, Inc.                           28
MapR's Streaming Performance
        2250                                      2250
                     11 x 7200rpm SATA                                    11 x 15Krpm SAS
        2000                                      2000
        1750                                      1750
        1500                                      1500
        1250                                      1250                              Hardware
                                                                                    MapR
        1000                                      1000
MB                                                                                  Hadoop
         750                                       750
per
sec      500                                       500
         250                                       250
           0                                          0
                   Read          Write                          Read   Write
                                         Higher is better


      Tests:    i. 16 streams x 120GB            ii. 2000 streams x 1GB

               6/29/2011            © MapR Technologies, Inc.                         29
HBase on MapR
                YCSB Insert with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          600

          500

          400
 1000
records   300                                                MapR
  per                                                        Apache
second    200

          100

            0
                      WAL off                    WAL on     Higher is better


          6/29/2011             © MapR Technologies, Inc.                      30
HBase on MapR
          YCSB Random Read with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          25000

          20000

Records   15000
  per                                                       MapR
second                                                      Apache
          10000

           5000

               0
                       Zipfian                 Uniform    Higher is better


          6/29/2011          © MapR Technologies, Inc.                       31
Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
          60                                      300

          50                                      250

          40                                      200

Elapsed                                           150
                                                                         MapR
          30
time                                                                     Hadoop
(mins)    20                                      100

          10                                       50


           0                                         0
                           1.0 TB                               3.5 TB

                                         Lower is better

               6/29/2011            © MapR Technologies, Inc.            32
PigMix on MapR
       4000

       3500

       3000

       2500

       2000
Time                                                     MapR
in     1500                                              Hadoop
Sec
       1000

        500

          0




                                 Lower is better
              6/29/2011      © MapR Technologies, Inc.    33
Summary
   Fully HA
       JobTracker, Snapshot, Mirrors, multi-cluster capable
   Super simple to manage
       NFS mountable
   Complete read/write semantics
       Can see file contents immediately
   MapR has signed Apache CCLA
       Zookeeper, Mahout, YCSB, HBase fixes contributed
       Continue to contribute more and more
   Download it at www.mapr.com
         6/29/2011      © MapR Technologies, Inc.              34

More Related Content

What's hot

What's hot (20)

What is Object storage ?
What is Object storage ?What is Object storage ?
What is Object storage ?
 
Hbase Kullanım Senaryoları
Hbase Kullanım SenaryolarıHbase Kullanım Senaryoları
Hbase Kullanım Senaryoları
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Rise of the Data Cloud
Rise of the Data CloudRise of the Data Cloud
Rise of the Data Cloud
 
Breakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data StoreBreakout: Hadoop and the Operational Data Store
Breakout: Hadoop and the Operational Data Store
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Big data-cheat-sheet
Big data-cheat-sheetBig data-cheat-sheet
Big data-cheat-sheet
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)Introduction to Microsoft’s Hadoop solution (HDInsight)
Introduction to Microsoft’s Hadoop solution (HDInsight)
 
Splunk 소개서 2015_06
Splunk 소개서 2015_06Splunk 소개서 2015_06
Splunk 소개서 2015_06
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Getting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and KafkaGetting started with Hadoop, Hive, Spark and Kafka
Getting started with Hadoop, Hive, Spark and Kafka
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
MySQL - NDB Cluster
MySQL - NDB ClusterMySQL - NDB Cluster
MySQL - NDB Cluster
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
No sql database
No sql databaseNo sql database
No sql database
 
Apache hive
Apache hiveApache hive
Apache hive
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 

Viewers also liked

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 

Viewers also liked (10)

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop

TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
Open Stack
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Design, Scale and Performance of MapR's Distribution for Hadoop

  • 1. Design, Scale & Performance of the MapR Distribution M.C. Srivas CTO, MapR Technologies, Inc. 6/29/2011 © MapR Technologies, Inc. 1
  • 2. Outline of Talk • What does MapR do? • Motivation: why build this? • Distributed NameNode Architecture • Scalability factors • Programming model • Distributed transactions in MapR • Performance across a variety of loads 6/29/2011 © MapR Technologies, Inc. 2
  • 3. Complete Distribution  Integrated, tested, hardened  Super simple  Unique advanced features  100% compatible with MapReduce, HBase, HDFS APIs  No recompile required, drop in and use now 6/29/2011 © MapR Technologies, Inc. 3
  • 4. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services 6/29/2011 © MapR Technologies, Inc. 4
  • 5. JIRAs Open For Year(s) • HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal • HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize block-reports • HDFS-222, 950 – Concatenate files into larger files • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible." • HDFS Append – no 'blessed' Apache Hadoop distro has fix • HDFS-233 – 25/Jun/08 – Snapshot support • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly only if there is complete separation between namespace management and block management." 6/29/2011 © MapR Technologies, Inc. 5
  • 6. Observations on Apache Hadoop  Inefficient HDFS-347 1200 MB/sec  Scaling problems HDFS-273 1000  NameNode bottleneck HDFS-395 800  Limited number of files HDFS-222 600 READ WRITE  Admin overhead significant 400  NameNode failure loses data 200  Not trusted as permanent store 0 HARDWARE HDFS  Write-once  Data lost unless file closed  hflush/hsync – unrealistic to expect folks will re-write apps 6/29/2011 © MapR Technologies, Inc. 6
  • 7. MapR Approach • Some are architectural issues • Change at that level is a big deal – Will not be accepted unless proven – Hard to prove without building it first • Build it and prove it – Improve reliability significantly – Make it tremendously faster at the same time – Enable new class of apps (eg, real-time analytics) 6/29/2011 © MapR Technologies, Inc. 7
  • 8. HDFS Architecture Review  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in memory)  Directories, Files Files  Block replica locations sharded into blocks  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN DataNodes save Blocks 6/29/2011 © MapR Technologies, Inc. 8
  • 9. HDFS Architecture Review DataNode (DN) reports blocks to NameNode NameNode (NN)  Large DN does 60K blocks/report  256M x 60K = 15T = 5 disks @ 3T per DataNode DataNode  >100K causes extreme load  40GB NN restart takes 1-2 hours Addressing Unit is an individual block  Flat block-address forces DN's to send giant block-reports  NN can hold about ~300M blocks max  Limits cluster size to 10's of Petabytes  Increasing block size negatively impacts map/reduce 6/29/2011 © MapR Technologies, Inc. 9
  • 10. How to Scale • Central meta server does not scale – Make every server a meta-data server too – But need memory for map/reduce • Must page meta-data to disk • Reduce size of block-reports – while increasing number of blocks per DN • Reduce memory footprint of location service – cannot add memory indefinitely • Need fast-restart (HA) 6/29/2011 © MapR Technologies, Inc. 10
  • 11. MapR Goal: Scale to 1000X HDFS MapR # files 150 million 1 trillion # data 10-50 PB 1-10 Exabytes # nodes 2000 10,000+ Full random read/write semantics  export via NFS and other protocols  with enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, … Run close to hardware speeds  On extreme scale, efficiency matters extremely  exploit emerging technology like SSD, 10GE 6/29/2011 © MapR Technologies, Inc. 11
  • 12. MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on  Use MapR Volumes nodes Patent Pending 6/29/2011 © MapR Technologies, Inc. 12
  • 13. MapR Volumes Significant advantages over “Cluster- /projects wide” or “File-level” approaches /tahoe Volumes allow management attributes /yosemite to be applied in a scalable way at a very granular level and with flexibility /user /msmith • Replication factor • Scheduled mirroring /bjohnson • Scheduled snapshots • Data placement control 100K volumes are OK, • Usage tracking create as many as • Administrative permissions desired! 6/29/2011 © MapR Technologies, Inc. 13
  • 14. MapR Distributed NameNode Containers are tracked globally • Clients cache containers & server info for extended periods NameNode Map S1, S2, S4 Client S1 Fetches Contacts S1, S3 container server to S1, S4, S5 locations read data S2, S3, S5 from the S3 container S2, S4, S5 S3 S4 S5 S2 6/29/2011 © MapR Technologies, Inc. 14
  • 15. MapR's Distr NameNode Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected 6/29/2011 © MapR Technologies, Inc. 15
  • 16. MapR Distr NameNode HA MapR Apache Hadoop* 1. apt-get install mapr-cldb 1. Stop cluster very carefully while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp) 3. Install, configure DRBD + Heartbeat packages i. yum -y install drbd82 kmod-drbd82 heartbeat ii. chkconfig -add heartbeat (both machines) iii. edit /etc/drbd.conf on 2 machines iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero it if drbd dies & try again xxxx. mkfs ext3 on it, mount /hadoop (both machines) xxxxi. install all rpms in /hadoop, but don't run them yet (chkconfig off) xxxxii. umount /hadoop (!!) xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat ... 40. Restart cluster. If any problems, start at /var/log/ha.log for hints on what went wrong. *As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration Author: Christophe Bisciglia, Cloudera. 6/29/2011 © MapR Technologies, Inc. 16
  • 17. Step Back & Rethink Problem Big disruption in hardware landscape Year 2000 Year 2012 # cores per box 2 128 DRAM per box 4GB 512GB # disks per box 250+ 12 Disk capacity 18GB 6TB Cluster size 2-10 10,000  No spin-locks / mutexes, 10,000+ threads  Minimal footprint – preserve resources for App  Rapid re-replication, scale to several Exabytes 6/29/2011 © MapR Technologies, Inc. 17
  • 18. MapR's Programming Model Written in C++ and is asynchronous ioMgr->read(…, callbackFunc, void *arg) Each module runs requests from its request-queue  One OS thread per cpu-core  Dispatch: map container-> queue -> cpu-core  Callback guaranteed to be invoked on same core  No mutexes needed  When load increases, add cpu-core + move some queues to it State machines on each queue  'thread stack' is 4K, 10,000+ threads costs ~40M  Context-switch is 3 instructions, 250K c.s./core/sec ok! 6/29/2011 © MapR Technologies, Inc. 18
  • 19. MapR on Linux  User-space process, avoids system crashes  Minimal footprint  Preserves cpu, memory & resources for app  uses only 1/5th of system memory  runs on 1 or 2 cores, others left for app  Emphasis on efficiency, avoids lots of layering raw devices, direct-IO, doesn't use Linux VM  CPU/memory firewalls implemented  runaway tasks no longer impact system processes 6/29/2011 © MapR Technologies, Inc. 19
  • 20. Random Writing in MapR S1 Ask for Client 64M block NameNode Map writing Create cont. data S1, S2, S4 attach S1, S3 Write S1, S4, S5 next chunk S2 Picks master S2, S4, S5 to S2 and 2 replica slaves S3 S2, S3, S5 S4 S5 S3 6/29/2011 © MapR Technologies, Inc. 20
  • 21. MapR's Distributed NameNode  Distributed transactions to stitch containers together  Each node uses write-ahead log  Supports both value-logging and operational-logging  Value log, record = { disk-offset, old, new }  Op log, record = { op-details, undo-op, redo-op }  Recovery in 2 seconds  'global ids' enable participation in distributed transactions 6/29/2011 © MapR Technologies, Inc. 21
  • 22. 2-Phase Commit Unsuitable App • BeginTrans .. work .. Commit C = coordinator Force P = participant C Log  On app-commit  C sends prepare to P P P sends prepare-ack, P  C gives up right to abort C  Waits for C even across crashes/reboots P P  P unlocks only when C commit received Too many message exchanges P P Single failure can lock up entire cluster 6/29/2011 © MapR Technologies, Inc. 22
  • 23. Quorum-completion Unsuitable • BeginTrans .. work .. Commit C = coordinator P = participant  On app-commit  C broadcasts prepare P  If majority responds, App C commits C  If not, cluster goes P into election mode  If no majority found, all fails P Update throughput very poor P Does not work with < N/2 nodes Monolithic. Hierarchical? Cycles? Oh No!! 6/29/2011 © MapR Technologies, Inc. 23
  • 24. MapR Lockless Transactions • BeginTrans + work + Commit  No explicit commit NN1 NN1 NN1  Uses rollback  confirm callback, piggy-backed  Undo on confirmed failure NN4 NN4 NN2 NN2  Any replica can confirm NN2 Update throughput very high No locks held across messages Crash resistant, cycles OK NN3 NN3 NN3 Patent pending 6/29/2011 © MapR Technologies, Inc. 24
  • 25. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file Rate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m) 6/29/2011 © MapR Technologies, Inc. 25
  • 26. MapR Distributed NameNode Same 10 nodes, but with 3x replication added … Test stopped Create here Rate 100-byte files/sec # of files (millions) 6/29/2011 © MapR Technologies, Inc. 26
  • 27. MapR's Data Integrity  End-to-end check-sums on all data (not optional)  Computed in client's memory, written to disk at server  On read, validated at both client & server  RPC packets have own independent check-sum  Detects RPC msg corruption  Transactional with ACID semantics  Meta data incl. log itself is check-summed  Allocation bitmaps written to two places (dual blocks)  Automatic compression built-in 6/29/2011 © MapR Technologies, Inc. 27
  • 28. MapR’s Random-Write Eases Data Import With MapR, use NFS Otherwise, use Flume/Scribe 1. mount /mapr 1. Set up sinks (find unused real-time, HA machines??) 2. Set up intrusive agents i. tail(“xxx”), tailDir(“y”) ii. agentBESink 3. All reliability levels lose data i. best-effort ii. one-shot iii. disk fail-over iv. end-to-end 4. Data not available now 6/29/2011 © MapR Technologies, Inc. 28
  • 29. MapR's Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB Hadoop 750 750 per sec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 6/29/2011 © MapR Technologies, Inc. 29
  • 30. HBase on MapR YCSB Insert with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 600 500 400 1000 records 300 MapR per Apache second 200 100 0 WAL off WAL on Higher is better 6/29/2011 © MapR Technologies, Inc. 30
  • 31. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000 Records 15000 per MapR second Apache 10000 5000 0 Zipfian Uniform Higher is better 6/29/2011 © MapR Technologies, Inc. 31
  • 32. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200 Elapsed 150 MapR 30 time Hadoop (mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better 6/29/2011 © MapR Technologies, Inc. 32
  • 33. PigMix on MapR 4000 3500 3000 2500 2000 Time MapR in 1500 Hadoop Sec 1000 500 0 Lower is better 6/29/2011 © MapR Technologies, Inc. 33
  • 34. Summary  Fully HA  JobTracker, Snapshot, Mirrors, multi-cluster capable  Super simple to manage  NFS mountable  Complete read/write semantics  Can see file contents immediately  MapR has signed Apache CCLA  Zookeeper, Mahout, YCSB, HBase fixes contributed  Continue to contribute more and more  Download it at www.mapr.com 6/29/2011 © MapR Technologies, Inc. 34