SlideShare a Scribd company logo
1 of 34
Design, Scale & Performance of the
        MapR Distribution

                       M.C. Srivas
             CTO, MapR Technologies, Inc.




 6/29/2011        © MapR Technologies, Inc.   1
Outline of Talk
• What does MapR do?
• Motivation: why build this?
• Distributed NameNode Architecture
  • Scalability factors
  • Programming model
  • Distributed transactions in MapR
• Performance across a variety of loads


     6/29/2011      © MapR Technologies, Inc.   2
Complete Distribution
   Integrated, tested,
    hardened
   Super simple
   Unique advanced
    features
   100% compatible
    with MapReduce,
    HBase, HDFS APIs
   No recompile
    required, drop in
    and use now



          6/29/2011       © MapR Technologies, Inc.   3
MapR Areas of Development

                        HBase                       Map
                                                   Reduce
   Ecosystem


            Storage
                                               Management
            Services




6/29/2011              © MapR Technologies, Inc.            4
JIRAs Open For Year(s)
• HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal
• HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize
  block-reports
• HDFS-222, 950 – Concatenate files into larger files
   • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10
     million files, each using a block, would use about 3 gigabytes of memory.
     Scaling up much beyond this level is a problem with current hardware.
     Certainly a billion files is not feasible."
• HDFS Append – no 'blessed' Apache Hadoop distro has fix
• HDFS-233 – 25/Jun/08 – Snapshot support
   • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly
     only if there is complete separation between namespace management and
     block management."



         6/29/2011            © MapR Technologies, Inc.                           5
Observations on Apache Hadoop
   Inefficient HDFS-347                            1200

                                                                      MB/sec
   Scaling problems HDFS-273                       1000


       NameNode bottleneck HDFS-395                     800

       Limited number of files HDFS-222                 600                   READ
                                                                               WRITE
   Admin overhead significant                           400


   NameNode failure loses data                          200

       Not trusted as permanent store                     0
                                                               HARDWARE HDFS
   Write-once
       Data lost unless file closed
          hflush/hsync – unrealistic to expect folks will re-write apps


          6/29/2011          © MapR Technologies, Inc.                          6
MapR Approach
• Some are architectural issues
• Change at that level is a big deal
  – Will not be accepted unless proven
  – Hard to prove without building it first


• Build it and prove it
  – Improve reliability significantly
  – Make it tremendously faster at the same time
  – Enable new class of apps (eg, real-time analytics)
       6/29/2011      © MapR Technologies, Inc.          7
HDFS Architecture Review
   Files are broken into blocks
     Distributed across data-nodes
   NameNode holds (in memory)
       Directories, Files
                                                                     Files
       Block replica locations                                      sharded into
                                                                     blocks
   Data Nodes
       Serve blocks
       No idea about files/dirs
       All ops go to NN
                                                  DataNodes save Blocks
          6/29/2011          © MapR Technologies, Inc.                        8
HDFS Architecture Review
DataNode (DN) reports blocks to
                                                                NameNode
NameNode (NN)
      Large DN does 60K blocks/report
          256M x 60K = 15T = 5 disks @ 3T per
                                                         DataNode     DataNode
      >100K causes extreme load
      40GB NN restart takes 1-2 hours

Addressing Unit is an individual block
      Flat block-address forces DN's to send giant block-reports
      NN can hold about ~300M blocks max
         Limits cluster size to 10's of Petabytes
         Increasing block size negatively impacts map/reduce
           6/29/2011         © MapR Technologies, Inc.                       9
How to Scale
• Central meta server does not scale
  – Make every server a meta-data server too
  – But need memory for map/reduce
    •    Must page meta-data to disk
• Reduce size of block-reports
  – while increasing number of blocks per DN
• Reduce memory footprint of location service
  – cannot add memory indefinitely
• Need fast-restart (HA)
        6/29/2011      © MapR Technologies, Inc.   10
MapR Goal: Scale to 1000X
                       HDFS                         MapR
     # files           150 million                  1 trillion
     # data            10-50 PB                     1-10 Exabytes
     # nodes           2000                         10,000+

Full random read/write semantics
      export via NFS and other protocols
      with enterprise-class reliability: instant-restart, snapshots,
       mirrors, no-single-point-of-failure, …
Run close to hardware speeds
      On extreme scale, efficiency matters extremely
      exploit emerging technology like SSD, 10GE

       6/29/2011        © MapR Technologies, Inc.                   11
MapR's Distributed NameNode
                    Files/directories are sharded into blocks, which
                    are placed into mini NNs (containers ) on disks
                                                        Each container contains
                                                            Directories & files
                                                            Data blocks
                                                        Replicated on servers
Containers are 16-                                      No need to manage
32 GB segments of                                        directly
disk, placed on                                         Use MapR Volumes
nodes

                                                                Patent Pending

        6/29/2011            © MapR Technologies, Inc.                           12
MapR Volumes
                          Significant advantages over “Cluster-
/projects                 wide” or “File-level” approaches

        /tahoe              Volumes allow management attributes
        /yosemite           to be applied in a scalable way at a
                            very granular level and with flexibility
/user
        /msmith             •   Replication factor
                            •   Scheduled mirroring
        /bjohnson           •   Scheduled snapshots
                            •   Data placement control
100K volumes are OK,        •   Usage tracking
  create as many as         •   Administrative permissions
       desired!


          6/29/2011      © MapR Technologies, Inc.                13
MapR Distributed NameNode
Containers are tracked globally
•   Clients cache containers & server info for extended periods

NameNode Map

     S1, S2, S4
                                      Client                        S1
                      Fetches                          Contacts
     S1, S3
                      container                        server to
     S1, S4, S5       locations                        read data
     S2, S3, S5                                        from the
                                                                        S3
                                                       container
     S2, S4, S5
     S3

                                                              S4   S5
                                         S2


          6/29/2011        © MapR Technologies, Inc.                14
MapR's Distr NameNode Scaling
Containers represent 16 - 32GB of data
      Each can hold up to 1 Billion files and directories
      100M containers = ~ 2 Exabytes (a very large cluster)
250 bytes DRAM to cache a container
      25GB to cache all containers for 2EB cluster
         But not necessary, can page to disk
      Typical large 10PB cluster needs 2GB
Container-reports are 100x - 1000x < HDFS block-reports
      Serve 100x more data-nodes
      Increase container size to 64G to serve 4EB cluster
            Map/reduce not affected


            6/29/2011        © MapR Technologies, Inc.         15
MapR Distr NameNode HA
MapR                           Apache Hadoop*
1. apt-get install mapr-cldb   1.    Stop cluster very carefully
while cluster is online        2.    Move fs.checkpoint.dir onto NAS (eg. NetApp)
                               3.    Install, configure DRBD + Heartbeat packages
                                       i. yum -y install drbd82 kmod-drbd82 heartbeat
                                      ii. chkconfig -add heartbeat (both machines)
                                      iii. edit /etc/drbd.conf on 2 machines
                                      iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero
                                     it if drbd dies & try again
                                      xxxx. mkfs ext3 on it, mount /hadoop (both machines)
                                      xxxxi. install all rpms in /hadoop, but don't run them yet
                                     (chkconfig off)
                                      xxxxii. umount /hadoop (!!)
                                      xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat
                               ...
                               40. Restart cluster. If any problems, start at
                                   /var/log/ha.log for hints on what went wrong.


*As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration
 Author: Christophe Bisciglia, Cloudera.
                6/29/2011       © MapR Technologies, Inc.                                 16
Step Back & Rethink Problem
Big disruption in hardware landscape
                             Year 2000                Year 2012
        # cores per box              2                   128
        DRAM per box               4GB                 512GB
        # disks per box           250+                   12
        Disk capacity             18GB                   6TB
        Cluster size               2-10                10,000

   No spin-locks / mutexes, 10,000+ threads
   Minimal footprint – preserve resources for App
   Rapid re-replication, scale to several Exabytes

        6/29/2011         © MapR Technologies, Inc.               17
MapR's Programming Model
Written in C++ and is asynchronous
       ioMgr->read(…, callbackFunc, void *arg)
Each module runs requests from its request-queue
       One OS thread per cpu-core
       Dispatch: map container-> queue -> cpu-core
       Callback guaranteed to be invoked on same core
             No mutexes needed
       When load increases, add cpu-core + move some queues
        to it
State machines on each queue
       'thread stack' is 4K, 10,000+ threads costs ~40M
       Context-switch is 3 instructions, 250K c.s./core/sec ok!
             6/29/2011       © MapR Technologies, Inc.             18
MapR on Linux
   User-space process, avoids system crashes
   Minimal footprint
        Preserves cpu, memory & resources for app
            uses only 1/5th of system memory
            runs on 1 or 2 cores, others left for app
        Emphasis on efficiency, avoids lots of layering
         raw devices, direct-IO, doesn't use Linux VM
   CPU/memory firewalls implemented
      runaway tasks no longer impact system processes




          6/29/2011        © MapR Technologies, Inc.       19
Random Writing in MapR
                              S1
             Ask for
Client
             64M block                                      NameNode Map
writing                                    Create cont.
 data                                                           S1, S2, S4
                         attach                                 S1, S3
Write                                                           S1, S4, S5
next chunk       S2
                                     Picks master               S2, S4, S5
to S2
                                     and 2 replica slaves       S3
                                                                S2, S3, S5




                                                     S4              S5
                         S3


     6/29/2011           © MapR Technologies, Inc.                         20
MapR's Distributed NameNode
   Distributed transactions to stitch containers together
   Each node uses write-ahead log
       Supports both value-logging and operational-logging
          Value log, record = { disk-offset, old, new }
          Op log, record = { op-details, undo-op, redo-op }
       Recovery in 2 seconds
       'global ids' enable participation in distributed
        transactions




             6/29/2011    © MapR Technologies, Inc.            21
2-Phase Commit Unsuitable
                                                                  App
• BeginTrans .. work .. Commit              C = coordinator
                                                                                Force
                                            P = participant        C            Log
   On app-commit
       C sends prepare to P                         P
        P sends prepare-ack,                                            P
                                                    C
        gives up right to abort                                         C
          Waits for C even across

            crashes/reboots                                             P
                                                         P
       P unlocks only when                                             C
        commit received
Too many message exchanges
                                                              P             P
Single failure can lock up entire cluster
          6/29/2011      © MapR Technologies, Inc.                              22
Quorum-completion Unsuitable
• BeginTrans .. work .. Commit                              C = coordinator
                                                            P = participant
   On app-commit
       C broadcasts prepare                                           P
       If majority responds,                         App
        C commits                                      C
       If not, cluster goes                                               P
        into election mode
       If no majority found, all fails                P
Update throughput very poor                                        P

Does not work with < N/2 nodes
Monolithic. Hierarchical? Cycles? Oh No!!
          6/29/2011       © MapR Technologies, Inc.                            23
MapR Lockless Transactions
• BeginTrans + work + Commit
   No explicit commit                                       NN1
                                                            NN1
                                                            NN1
   Uses rollback
       confirm callback, piggy-backed
       Undo on confirmed failure                     NN4
                                                      NN4          NN2
                                                                    NN2
       Any replica can confirm                                      NN2
Update throughput very high
No locks held across messages
Crash resistant, cycles OK                                     NN3
                                                              NN3
                                                            NN3
Patent pending

           6/29/2011      © MapR Technologies, Inc.                    24
Small Files (Apache Hadoop, 10 nodes)

                                     Out of box
                                                                     Op: - create file
Rate (files/sec)




                                                                         - write 100 bytes
                                                      Tuned              - close
                                                                     Notes:
                                                                     - NN not replicated
                                                                     - NN uses 20G DRAM
                                                                     - DN uses 2G DRAM



                                  # of files (m)

                      6/29/2011          © MapR Technologies, Inc.                         25
MapR Distributed NameNode
Same 10 nodes, but with 3x replication added …

                                                     Test
                                                   stopped
 Create                                              here
  Rate

100-byte
files/sec




                           # of files (millions)
          6/29/2011    © MapR Technologies, Inc.             26
MapR's Data Integrity
   End-to-end check-sums on all data (not optional)
      Computed in client's memory, written to disk at server
      On read, validated at both client & server


   RPC packets have own independent check-sum
      Detects RPC msg corruption


   Transactional with ACID semantics
      Meta data incl. log itself is check-summed


      Allocation bitmaps written to two places (dual blocks)


   Automatic compression built-in


        6/29/2011      © MapR Technologies, Inc.           27
MapR’s Random-Write Eases Data Import
With MapR, use NFS                  Otherwise, use Flume/Scribe
1. mount /mapr                      1. Set up sinks (find unused
   real-time, HA                        machines??)
                                    2. Set up intrusive agents
                                        i. tail(“xxx”), tailDir(“y”)
                                        ii. agentBESink
                                    3. All reliability levels lose data
                                        i. best-effort
                                        ii. one-shot
                                        iii. disk fail-over
                                        iv. end-to-end
                                    4. Data not available now


          6/29/2011   © MapR Technologies, Inc.                           28
MapR's Streaming Performance
        2250                                      2250
                     11 x 7200rpm SATA                                    11 x 15Krpm SAS
        2000                                      2000
        1750                                      1750
        1500                                      1500
        1250                                      1250                              Hardware
                                                                                    MapR
        1000                                      1000
MB                                                                                  Hadoop
         750                                       750
per
sec      500                                       500
         250                                       250
           0                                          0
                   Read          Write                          Read   Write
                                         Higher is better


      Tests:    i. 16 streams x 120GB            ii. 2000 streams x 1GB

               6/29/2011            © MapR Technologies, Inc.                         29
HBase on MapR
                YCSB Insert with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          600

          500

          400
 1000
records   300                                                MapR
  per                                                        Apache
second    200

          100

            0
                      WAL off                    WAL on     Higher is better


          6/29/2011             © MapR Technologies, Inc.                      30
HBase on MapR
          YCSB Random Read with 1 billion 1K records
          10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM
          25000

          20000

Records   15000
  per                                                       MapR
second                                                      Apache
          10000

           5000

               0
                       Zipfian                 Uniform    Higher is better


          6/29/2011          © MapR Technologies, Inc.                       31
Terasort on MapR
      10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
          60                                      300

          50                                      250

          40                                      200

Elapsed                                           150
                                                                         MapR
          30
time                                                                     Hadoop
(mins)    20                                      100

          10                                       50


           0                                         0
                           1.0 TB                               3.5 TB

                                         Lower is better

               6/29/2011            © MapR Technologies, Inc.            32
PigMix on MapR
       4000

       3500

       3000

       2500

       2000
Time                                                     MapR
in     1500                                              Hadoop
Sec
       1000

        500

          0




                                 Lower is better
              6/29/2011      © MapR Technologies, Inc.    33
Summary
   Fully HA
       JobTracker, Snapshot, Mirrors, multi-cluster capable
   Super simple to manage
       NFS mountable
   Complete read/write semantics
       Can see file contents immediately
   MapR has signed Apache CCLA
       Zookeeper, Mahout, YCSB, HBase fixes contributed
       Continue to contribute more and more
   Download it at www.mapr.com
         6/29/2011      © MapR Technologies, Inc.              34

More Related Content

What's hot

Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
DataStax
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
DataWorks Summit
 

What's hot (20)

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
Efficient Data Storage for Analytics with Parquet 2.0 - Hadoop Summit 2014
 
Design cube in Apache Kylin
Design cube in Apache KylinDesign cube in Apache Kylin
Design cube in Apache Kylin
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Webinar "Come digitalizzare la rete idrica: metodi avanzati per la regolazion...
Webinar "Come digitalizzare la rete idrica: metodi avanzati per la regolazion...Webinar "Come digitalizzare la rete idrica: metodi avanzati per la regolazion...
Webinar "Come digitalizzare la rete idrica: metodi avanzati per la regolazion...
 
Apache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARNApache Ambari: Managing Hadoop and YARN
Apache Ambari: Managing Hadoop and YARN
 
Memcached
MemcachedMemcached
Memcached
 
Uber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache FlinkUber Business Metrics Generation and Management Through Apache Flink
Uber Business Metrics Generation and Management Through Apache Flink
 
Yarn
YarnYarn
Yarn
 
OrientDB introduction - NoSQL
OrientDB introduction - NoSQLOrientDB introduction - NoSQL
OrientDB introduction - NoSQL
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Andrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at BaiduAndrew Ng, Chief Scientist at Baidu
Andrew Ng, Chief Scientist at Baidu
 
Training Deep Neural Nets
Training Deep Neural NetsTraining Deep Neural Nets
Training Deep Neural Nets
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Running Scylla on Kubernetes with Scylla Operator
Running Scylla on Kubernetes with Scylla OperatorRunning Scylla on Kubernetes with Scylla Operator
Running Scylla on Kubernetes with Scylla Operator
 
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
Building a Distributed Reservation System with Cassandra (Andrew Baker & Jeff...
 
Exadata master series_asm_2020
Exadata master series_asm_2020Exadata master series_asm_2020
Exadata master series_asm_2020
 
Lessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloudLessons learned processing 70 billion data points a day using the hybrid cloud
Lessons learned processing 70 billion data points a day using the hybrid cloud
 
Hadoop hdfs
Hadoop hdfsHadoop hdfs
Hadoop hdfs
 
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
 

Viewers also liked

The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
MongoDB
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB
 

Viewers also liked (10)

Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
The Aggregation Framework
The Aggregation FrameworkThe Aggregation Framework
The Aggregation Framework
 
Back to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica SetsBack to Basics Webinar 3: Introduction to Replica Sets
Back to Basics Webinar 3: Introduction to Replica Sets
 
Seattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapRSeattle Scalability Meetup - Ted Dunning - MapR
Seattle Scalability Meetup - Ted Dunning - MapR
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
MongoDB Analytics: Learn Aggregation by Example - Exploratory Analytics and V...
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
Webinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDBWebinar: Working with Graph Data in MongoDB
Webinar: Working with Graph Data in MongoDB
 
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
MongoDB for Time Series Data Part 2: Analyzing Time Series Data Using the Agg...
 
Back to Basics: My First MongoDB Application
Back to Basics: My First MongoDB ApplicationBack to Basics: My First MongoDB Application
Back to Basics: My First MongoDB Application
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop

TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
Open Stack
 

Similar to Design, Scale and Performance of MapR's Distribution for Hadoop (20)

Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
 
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
Data Storage & I/O Performance: Solving I/O Slowdown: The "Noisy Neighbor" Pr...
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Unit-3.pptx
Unit-3.pptxUnit-3.pptx
Unit-3.pptx
 
Gluster open stack dev summit 042011
Gluster open stack dev summit 042011Gluster open stack dev summit 042011
Gluster open stack dev summit 042011
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Design, Scale and Performance of MapR's Distribution for Hadoop

  • 1. Design, Scale & Performance of the MapR Distribution M.C. Srivas CTO, MapR Technologies, Inc. 6/29/2011 © MapR Technologies, Inc. 1
  • 2. Outline of Talk • What does MapR do? • Motivation: why build this? • Distributed NameNode Architecture • Scalability factors • Programming model • Distributed transactions in MapR • Performance across a variety of loads 6/29/2011 © MapR Technologies, Inc. 2
  • 3. Complete Distribution  Integrated, tested, hardened  Super simple  Unique advanced features  100% compatible with MapReduce, HBase, HDFS APIs  No recompile required, drop in and use now 6/29/2011 © MapR Technologies, Inc. 3
  • 4. MapR Areas of Development HBase Map Reduce Ecosystem Storage Management Services 6/29/2011 © MapR Technologies, Inc. 4
  • 5. JIRAs Open For Year(s) • HDFS-347 – 7/Dec/08 - Streaming perf sub-optimal • HDFS-273, 395 – 7/Mar/07 – DFS Scalability problems, optimize block-reports • HDFS-222, 950 – Concatenate files into larger files • Tom White on 2/Jan/09: "Small files are a big problem for Hadoop ... 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible." • HDFS Append – no 'blessed' Apache Hadoop distro has fix • HDFS-233 – 25/Jun/08 – Snapshot support • Dhruba Borthakur on 10/Feb/09 "...snapshots can be designed very elegantly only if there is complete separation between namespace management and block management." 6/29/2011 © MapR Technologies, Inc. 5
  • 6. Observations on Apache Hadoop  Inefficient HDFS-347 1200 MB/sec  Scaling problems HDFS-273 1000  NameNode bottleneck HDFS-395 800  Limited number of files HDFS-222 600 READ WRITE  Admin overhead significant 400  NameNode failure loses data 200  Not trusted as permanent store 0 HARDWARE HDFS  Write-once  Data lost unless file closed  hflush/hsync – unrealistic to expect folks will re-write apps 6/29/2011 © MapR Technologies, Inc. 6
  • 7. MapR Approach • Some are architectural issues • Change at that level is a big deal – Will not be accepted unless proven – Hard to prove without building it first • Build it and prove it – Improve reliability significantly – Make it tremendously faster at the same time – Enable new class of apps (eg, real-time analytics) 6/29/2011 © MapR Technologies, Inc. 7
  • 8. HDFS Architecture Review  Files are broken into blocks  Distributed across data-nodes  NameNode holds (in memory)  Directories, Files Files  Block replica locations sharded into blocks  Data Nodes  Serve blocks  No idea about files/dirs  All ops go to NN DataNodes save Blocks 6/29/2011 © MapR Technologies, Inc. 8
  • 9. HDFS Architecture Review DataNode (DN) reports blocks to NameNode NameNode (NN)  Large DN does 60K blocks/report  256M x 60K = 15T = 5 disks @ 3T per DataNode DataNode  >100K causes extreme load  40GB NN restart takes 1-2 hours Addressing Unit is an individual block  Flat block-address forces DN's to send giant block-reports  NN can hold about ~300M blocks max  Limits cluster size to 10's of Petabytes  Increasing block size negatively impacts map/reduce 6/29/2011 © MapR Technologies, Inc. 9
  • 10. How to Scale • Central meta server does not scale – Make every server a meta-data server too – But need memory for map/reduce • Must page meta-data to disk • Reduce size of block-reports – while increasing number of blocks per DN • Reduce memory footprint of location service – cannot add memory indefinitely • Need fast-restart (HA) 6/29/2011 © MapR Technologies, Inc. 10
  • 11. MapR Goal: Scale to 1000X HDFS MapR # files 150 million 1 trillion # data 10-50 PB 1-10 Exabytes # nodes 2000 10,000+ Full random read/write semantics  export via NFS and other protocols  with enterprise-class reliability: instant-restart, snapshots, mirrors, no-single-point-of-failure, … Run close to hardware speeds  On extreme scale, efficiency matters extremely  exploit emerging technology like SSD, 10GE 6/29/2011 © MapR Technologies, Inc. 11
  • 12. MapR's Distributed NameNode Files/directories are sharded into blocks, which are placed into mini NNs (containers ) on disks  Each container contains  Directories & files  Data blocks  Replicated on servers Containers are 16-  No need to manage 32 GB segments of directly disk, placed on  Use MapR Volumes nodes Patent Pending 6/29/2011 © MapR Technologies, Inc. 12
  • 13. MapR Volumes Significant advantages over “Cluster- /projects wide” or “File-level” approaches /tahoe Volumes allow management attributes /yosemite to be applied in a scalable way at a very granular level and with flexibility /user /msmith • Replication factor • Scheduled mirroring /bjohnson • Scheduled snapshots • Data placement control 100K volumes are OK, • Usage tracking create as many as • Administrative permissions desired! 6/29/2011 © MapR Technologies, Inc. 13
  • 14. MapR Distributed NameNode Containers are tracked globally • Clients cache containers & server info for extended periods NameNode Map S1, S2, S4 Client S1 Fetches Contacts S1, S3 container server to S1, S4, S5 locations read data S2, S3, S5 from the S3 container S2, S4, S5 S3 S4 S5 S2 6/29/2011 © MapR Technologies, Inc. 14
  • 15. MapR's Distr NameNode Scaling Containers represent 16 - 32GB of data  Each can hold up to 1 Billion files and directories  100M containers = ~ 2 Exabytes (a very large cluster) 250 bytes DRAM to cache a container  25GB to cache all containers for 2EB cluster  But not necessary, can page to disk  Typical large 10PB cluster needs 2GB Container-reports are 100x - 1000x < HDFS block-reports  Serve 100x more data-nodes  Increase container size to 64G to serve 4EB cluster  Map/reduce not affected 6/29/2011 © MapR Technologies, Inc. 15
  • 16. MapR Distr NameNode HA MapR Apache Hadoop* 1. apt-get install mapr-cldb 1. Stop cluster very carefully while cluster is online 2. Move fs.checkpoint.dir onto NAS (eg. NetApp) 3. Install, configure DRBD + Heartbeat packages i. yum -y install drbd82 kmod-drbd82 heartbeat ii. chkconfig -add heartbeat (both machines) iii. edit /etc/drbd.conf on 2 machines iv-xxxix. make raid-0 md, ask drbd to manage raid md, zero it if drbd dies & try again xxxx. mkfs ext3 on it, mount /hadoop (both machines) xxxxi. install all rpms in /hadoop, but don't run them yet (chkconfig off) xxxxii. umount /hadoop (!!) xxxxiii. edit 3 files /etc/ha.d/* to configure heartbeat ... 40. Restart cluster. If any problems, start at /var/log/ha.log for hints on what went wrong. *As described in www.cloudera.com/blog/2009/07/hadoop-ha-configuration Author: Christophe Bisciglia, Cloudera. 6/29/2011 © MapR Technologies, Inc. 16
  • 17. Step Back & Rethink Problem Big disruption in hardware landscape Year 2000 Year 2012 # cores per box 2 128 DRAM per box 4GB 512GB # disks per box 250+ 12 Disk capacity 18GB 6TB Cluster size 2-10 10,000  No spin-locks / mutexes, 10,000+ threads  Minimal footprint – preserve resources for App  Rapid re-replication, scale to several Exabytes 6/29/2011 © MapR Technologies, Inc. 17
  • 18. MapR's Programming Model Written in C++ and is asynchronous ioMgr->read(…, callbackFunc, void *arg) Each module runs requests from its request-queue  One OS thread per cpu-core  Dispatch: map container-> queue -> cpu-core  Callback guaranteed to be invoked on same core  No mutexes needed  When load increases, add cpu-core + move some queues to it State machines on each queue  'thread stack' is 4K, 10,000+ threads costs ~40M  Context-switch is 3 instructions, 250K c.s./core/sec ok! 6/29/2011 © MapR Technologies, Inc. 18
  • 19. MapR on Linux  User-space process, avoids system crashes  Minimal footprint  Preserves cpu, memory & resources for app  uses only 1/5th of system memory  runs on 1 or 2 cores, others left for app  Emphasis on efficiency, avoids lots of layering raw devices, direct-IO, doesn't use Linux VM  CPU/memory firewalls implemented  runaway tasks no longer impact system processes 6/29/2011 © MapR Technologies, Inc. 19
  • 20. Random Writing in MapR S1 Ask for Client 64M block NameNode Map writing Create cont. data S1, S2, S4 attach S1, S3 Write S1, S4, S5 next chunk S2 Picks master S2, S4, S5 to S2 and 2 replica slaves S3 S2, S3, S5 S4 S5 S3 6/29/2011 © MapR Technologies, Inc. 20
  • 21. MapR's Distributed NameNode  Distributed transactions to stitch containers together  Each node uses write-ahead log  Supports both value-logging and operational-logging  Value log, record = { disk-offset, old, new }  Op log, record = { op-details, undo-op, redo-op }  Recovery in 2 seconds  'global ids' enable participation in distributed transactions 6/29/2011 © MapR Technologies, Inc. 21
  • 22. 2-Phase Commit Unsuitable App • BeginTrans .. work .. Commit C = coordinator Force P = participant C Log  On app-commit  C sends prepare to P P P sends prepare-ack, P  C gives up right to abort C  Waits for C even across crashes/reboots P P  P unlocks only when C commit received Too many message exchanges P P Single failure can lock up entire cluster 6/29/2011 © MapR Technologies, Inc. 22
  • 23. Quorum-completion Unsuitable • BeginTrans .. work .. Commit C = coordinator P = participant  On app-commit  C broadcasts prepare P  If majority responds, App C commits C  If not, cluster goes P into election mode  If no majority found, all fails P Update throughput very poor P Does not work with < N/2 nodes Monolithic. Hierarchical? Cycles? Oh No!! 6/29/2011 © MapR Technologies, Inc. 23
  • 24. MapR Lockless Transactions • BeginTrans + work + Commit  No explicit commit NN1 NN1 NN1  Uses rollback  confirm callback, piggy-backed  Undo on confirmed failure NN4 NN4 NN2 NN2  Any replica can confirm NN2 Update throughput very high No locks held across messages Crash resistant, cycles OK NN3 NN3 NN3 Patent pending 6/29/2011 © MapR Technologies, Inc. 24
  • 25. Small Files (Apache Hadoop, 10 nodes) Out of box Op: - create file Rate (files/sec) - write 100 bytes Tuned - close Notes: - NN not replicated - NN uses 20G DRAM - DN uses 2G DRAM # of files (m) 6/29/2011 © MapR Technologies, Inc. 25
  • 26. MapR Distributed NameNode Same 10 nodes, but with 3x replication added … Test stopped Create here Rate 100-byte files/sec # of files (millions) 6/29/2011 © MapR Technologies, Inc. 26
  • 27. MapR's Data Integrity  End-to-end check-sums on all data (not optional)  Computed in client's memory, written to disk at server  On read, validated at both client & server  RPC packets have own independent check-sum  Detects RPC msg corruption  Transactional with ACID semantics  Meta data incl. log itself is check-summed  Allocation bitmaps written to two places (dual blocks)  Automatic compression built-in 6/29/2011 © MapR Technologies, Inc. 27
  • 28. MapR’s Random-Write Eases Data Import With MapR, use NFS Otherwise, use Flume/Scribe 1. mount /mapr 1. Set up sinks (find unused real-time, HA machines??) 2. Set up intrusive agents i. tail(“xxx”), tailDir(“y”) ii. agentBESink 3. All reliability levels lose data i. best-effort ii. one-shot iii. disk fail-over iv. end-to-end 4. Data not available now 6/29/2011 © MapR Technologies, Inc. 28
  • 29. MapR's Streaming Performance 2250 2250 11 x 7200rpm SATA 11 x 15Krpm SAS 2000 2000 1750 1750 1500 1500 1250 1250 Hardware MapR 1000 1000 MB Hadoop 750 750 per sec 500 500 250 250 0 0 Read Write Read Write Higher is better Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB 6/29/2011 © MapR Technologies, Inc. 29
  • 30. HBase on MapR YCSB Insert with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 600 500 400 1000 records 300 MapR per Apache second 200 100 0 WAL off WAL on Higher is better 6/29/2011 © MapR Technologies, Inc. 30
  • 31. HBase on MapR YCSB Random Read with 1 billion 1K records 10+1 node cluster: 8 core, 24GB DRAM, 11 x 1TB 7200 RPM 25000 20000 Records 15000 per MapR second Apache 10000 5000 0 Zipfian Uniform Higher is better 6/29/2011 © MapR Technologies, Inc. 31
  • 32. Terasort on MapR 10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm 60 300 50 250 40 200 Elapsed 150 MapR 30 time Hadoop (mins) 20 100 10 50 0 0 1.0 TB 3.5 TB Lower is better 6/29/2011 © MapR Technologies, Inc. 32
  • 33. PigMix on MapR 4000 3500 3000 2500 2000 Time MapR in 1500 Hadoop Sec 1000 500 0 Lower is better 6/29/2011 © MapR Technologies, Inc. 33
  • 34. Summary  Fully HA  JobTracker, Snapshot, Mirrors, multi-cluster capable  Super simple to manage  NFS mountable  Complete read/write semantics  Can see file contents immediately  MapR has signed Apache CCLA  Zookeeper, Mahout, YCSB, HBase fixes contributed  Continue to contribute more and more  Download it at www.mapr.com 6/29/2011 © MapR Technologies, Inc. 34