SlideShare una empresa de Scribd logo
1 de 48
Descargar para leer sin conexión
Apache Hadoop,
                  Big Data, and You
                                   Philip Zeyliger
                               philip@cloudera.com
                               @philz42 @cloudera
                                November 18, 2009




Wednesday, November 18, 2009
Hi there!

                          Software Engineer
                          Worked at




Wednesday, November 18, 2009
I work on stuff...




Wednesday, November 18, 2009
Outline
                          Why should you care? (Intro)
                          Challenging yesteryear’s assumptions
                          The MapReduce Model
                          HDFS, Hadoop Map/Reduce
                          The Hadoop Ecosystem
                          Questions

Wednesday, November 18, 2009
Data is everywhere.

                        Data is important.

Wednesday, November 18, 2009
Wednesday, November 18, 2009
Wednesday, November 18, 2009
Wednesday, November 18, 2009
“I keep saying that the sexy job
                in the next 10 years will be
             statisticians, and I’m not kidding.”

                                                Hal Varian
                               (Google’s chief economist)




Wednesday, November 18, 2009
So, what’s Hadoop?


                               The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry




Wednesday, November 18, 2009
Apache Hadoop is an open-source
    system (written in Java!) to store and
                  process
                                    gobs of data
       across many commodity computers.




                               The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry


Wednesday, November 18, 2009
Two Big
                                 Components
                               HDFS      Map/Reduce

                Self-healing high-
                   bandwidth               Fault-tolerant
               clustered storage.     distributed computing.



Wednesday, November 18, 2009
Challenging some of
                     yesteryear’s
                    assumptions...

Wednesday, November 18, 2009
Assumption 1: Machines can be reliable...




 Image: MadMan the Mighty CC BY-NC-SA
Wednesday, November 18, 2009
Hadoop Goal:

                  Separate distributed
                 system fault-tolerance
               code from application logic.
                                  Systems
                                Programmers   Statisticians

Wednesday, November 18, 2009
Assumption 2: Machines have identities...
                                             Image:Laughing Squid CC BY-
                                             NC-SA
Wednesday, November 18, 2009
Hadoop Goal:

                 Users should interact with
                  clusters, not machines.




Wednesday, November 18, 2009
Assumption 3: A data set fits on one machine...
                                                  Image: Matthew J. Stinson CC-
                                                  BY-NC
Wednesday, November 18, 2009
Hadoop Goal:

                           System should scale
                        linearly (or better) with
                                data size.


Wednesday, November 18, 2009
The M/R
                  Programming Model




Wednesday, November 18, 2009
You specify map()
                            and reduce()
                             functions.

               The framework does
                     the rest.
Wednesday, November 18, 2009
map()
                          map: K₁,V₁→list K₂,V₂

   public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
     /**
       * Called once for each key/value pair in the input split. Most applications
       * should override this, but the default is the identity function.
       */
     protected void map(KEYIN key, VALUEIN value,
                         Context context) throws IOException,
                         InterruptedException {
        // context.write() can be called many times
        // this is default “identity mapper” implementation
        context.write((KEYOUT) key, (VALUEOUT) value);
     }
   }




Wednesday, November 18, 2009
(the shuffle)

             map output is assigned to a “reducer”

             map output is sorted by key




Wednesday, November 18, 2009
reduce()
                      K₂, iter(V₂)→list(K₃,V₃)

      public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
        /**
          * This method is called once for each key. Most applications will define
          * their reduce class by overriding this method. The default implementation
          * is an identity function.
          */
        @SuppressWarnings("unchecked")
        protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context
                               ) throws IOException, InterruptedException {
           for(VALUEIN value: values) {
             context.write((KEYOUT) key, (VALUEOUT) value);
           }
        }
      }




Wednesday, November 18, 2009
Putting it together...
  Logical


                                  Physical Flow



 Physical




Wednesday, November 18, 2009
Some samples...
                          Build an inverted index.
                          Summarize data grouped by a key.
                          Build map tiles from geographic data.
                          OCRing many images.
                          Learning ML models. (e.g., Naive Bayes
                          for text classification)
                          Augment traditional BI/DW
                          technologies (by archiving raw data).
Wednesday, November 18, 2009
There’s more than the
                          Java API
           Streaming               Pig                Hive
            perl, python,      Higher-level       SQL interface.
            ruby, whatever.    dataflow language
                               for easy ad-hoc    Great for
            stdin/stdout/      analysis.          analysts.
            stderr
                               Developed at       Developed at
                               Yahoo!             Facebook

                                                         Friday,
                                                         @10:10
Wednesday, November 18, 2009
A typical look...
                          Commodity servers (8-core, 8-16GB
                          RAM, 4-12 TB, 2x1 gE NIC)
                          2-level network architecture
                               20-40 nodes per rack




Wednesday, November 18, 2009
The cast...
       Starring...

                           NameNode (metadata server and database)

                               SecondaryNameNode (assistant to NameNode)
                               JobTracker (scheduler)

     The Chorus…
                DataNodes                                       TaskTrackers
              (block storage)                                 (task execution)


                                                        Thanks to Zak Stone for earmuff image!
Wednesday, November 18, 2009
HDFS
                                                 3x64MB file, 3 rep
                                                 4x64MB file, 3 rep
                Namenode
                                                 Small file, 7 rep
  Datanodes




Wednesday, November 18, 2009
                               One Rack      A Different Rack
HDFS Write Path




Wednesday, November 18, 2009
HDFS Failures?
               Datanode crash?
                     Clients read another copy
                     Background rebalance
               Namenode crash?
                     uh-oh



Wednesday, November 18, 2009
M/R
                                                        Job on stars
     Tasktrackers on the same                          Different job
      machines as datanodes                            Idle




Wednesday, November 18, 2009
                               One Rack         A Different Rack
M/R




Wednesday, November 18, 2009
M/R Failures
                          Task fails
                               Try again?
                               Try again somewhere else?
                               Report failure
                          Retries possible because of idempotence


Wednesday, November 18, 2009
Hadoop in the Wild

                         Yahoo! Hadoop Clusters: > 82PB, >25k machines
                         (Eric14, HadoopWorld NYC ’09)

                         Google: 40 GB/s GFS read/write load (Jeff Dean,
                         LADIS ’09) [~3,500 TB/day]

                         Facebook: 4TB new data per day; DW: 4800 cores, 5.5
                         PB (Dhruba Borthakur, HadoopWorld)




Wednesday, November 18, 2009
The Hadoop Ecosystem
                                          ETL Tools       BI Reporting       RDBMS

                                        Pig (Data Flow)    Hive (SQL)         Sqoop
              Zookeepr (Coordination)




                                                                                      Avro (Serialization)
                                        MapReduce (Job Scheduling/Execution System)

                                        HBase (Key-Value store)


                                                             HDFS
                                                (Hadoop Distributed File System)



Wednesday, November 18, 2009
Ok, fine, what next?
                          Get Hadoop!
                               http://hadoop.apache.org/
                               Cloudera Distribution for Hadoop
                          Try it out! (Locally, or on EC2)




Wednesday, November 18, 2009
Just one slide...

                          Software: Cloudera Distribution for
                          Hadoop, Cloudera Desktop, more…
                          Training and certification…
                          Free on-line training materials
                          (including video)
                          Support & Professional Services
                          @cloudera, blog, etc.
Wednesday, November 18, 2009
Questions?

                               philip@cloudera.com




Wednesday, November 18, 2009
Backup Slides




Wednesday, November 18, 2009
Important APIs
                                        → is 1:many
               Input Format       data→K₁,V₁
                                                       Writable
                        Mapper    K₁,V₁→K₂,V₂
                                                       JobClient
  M/R Flow




                                                                    Other
                    Combiner      K₂,iter(V₂)→K₂,V₂

                   Partitioner    K₂,V₂→int            *Context

                      Reducer     K₂, iter(V₂)→K₃,V₃   Filesystem
                 Out. Format K₃,V₃→data
Wednesday, November 18, 2009
$ cat input.txt
    adams dunster kirkland dunster
    kirland dudley dunster
    adams dunster winthrop

    $ bin/hadoop jar hadoop-0.18.3-
    examples.jar grep input.txt output1
    'dunster|adams'

    $ cat output1/part-00000
    4 dunster
    2 adams



Wednesday, November 18, 2009
public int run(String[] args)
    throws Exception {                    grepJob.setReducerClass(LongSumRedu   FileOutputFormat.setOutputPath(sort
      if (args.length < 3) {              cer.class);                           Job, new Path(args[1]));
        System.out.println("Grep                                                    // sort by decreasing freq
    <inDir> <outDir> <regex>
    [<group>]");                          FileOutputFormat.setOutputPath(grep   sortJob.setOutputKeyComparatorClass
                                          Job, tempDir);                        (LongWritable.DecreasingComparator.
    ToolRunner.printGenericCommandUsage                                         class);
    (System.out);                         grepJob.setOutputFormat(SequenceFil
        return -1;                        eOutputFormat.class);                     JobClient.runJob(sortJob);
      }                                                                           } finally {
      Path tempDir = new Path("grep-      grepJob.setOutputKeyClass(Text.clas
    temp-"+Integer.toString(new           s);                                   FileSystem.get(grepJob).delete(temp
    Random().nextInt(Integer.MAX_VALUE)                                         Dir, true);
    ));                                   grepJob.setOutputValueClass(LongWri     }
      JobConf grepJob = new               table.class);                           return 0;
    JobConf(getConf(), Grep.class);                                             }
      try {                                   JobClient.runJob(grepJob);
        grepJob.setJobName("grep-
    search");                                 JobConf sortJob = new
                                          JobConf(Grep.class);
                                              sortJob.setJobName("grep-
    FileInputFormat.setInputPaths(grepJ   sort");




                                                                                the “grep”
    ob, args[0]);

                                          FileInputFormat.setInputPaths(sortJ
    grepJob.setMapperClass(RegexMapper.   ob, tempDir);
    class);




                                                                                 example
                                          sortJob.setInputFormat(SequenceFile
    grepJob.set("mapred.mapper.regex",    InputFormat.class);
    args[2]);
        if (args.length == 4)
                                          sortJob.setMapperClass(InverseMappe
    grepJob.set("mapred.mapper.regex.gr   r.class);
    oup", args[3]);
                                              // write a single file
                                              sortJob.setNumReduceTasks(1);
    grepJob.setCombinerClass(LongSumRed
    ucer.class);




Wednesday, November 18, 2009
JobConf grepJob = new JobConf(getConf(), Grep.class);
        try {
          grepJob.setJobName("grep-search");

            FileInputFormat.setInputPaths(grepJob, args[0]);




                                                                    Job
            grepJob.setMapperClass(RegexMapper.class);
            grepJob.set("mapred.mapper.regex", args[2]);
            if (args.length == 4)
              grepJob.set("mapred.mapper.regex.group", args[3]);

            grepJob.setCombinerClass(LongSumReducer.class);
            grepJob.setReducerClass(LongSumReducer.class);
                                                                   1of 2
            FileOutputFormat.setOutputPath(grepJob, tempDir);
            grepJob.setOutputFormat(SequenceFileOutputFormat.class);
            grepJob.setOutputKeyClass(Text.class);
            grepJob.setOutputValueClass(LongWritable.class);

          JobClient.runJob(grepJob);
        } ...


Wednesday, November 18, 2009
JobConf sortJob = new JobConf(Grep.class);
            sortJob.setJobName("grep-sort");

            FileInputFormat.setInputPaths(sortJob, tempDir);


                                                                 Job
            sortJob.setInputFormat(SequenceFileInputFormat.class);

            sortJob.setMapperClass(InverseMapper.class);
                               (implicit identity reducer)
            // write a single file
            sortJob.setNumReduceTasks(1);                       2 of 2
            FileOutputFormat.setOutputPath(sortJob, new Path(args[1]));
            // sort by decreasing freq
            sortJob.setOutputKeyComparatorClass(
               LongWritable.DecreasingComparator.class);

          JobClient.runJob(sortJob);
        } finally {
          FileSystem.get(grepJob).delete(tempDir, true);
        }
        return 0;
    }


Wednesday, November 18, 2009
The types there...
                           ?, Text

                    Text, Long

             Text, list(Long)

                    Text, Long

                    Long, Text

Wednesday, November 18, 2009
A Simple Join
                                          Id         Last       First
                       People

                                          1       Washington   George
                                          2        Lincoln     Abraham
                       Key Entry Log




                                       Location       Id        Time
                                       Dunster        1        11:00am
                                       Dunster        2        11:02am
                                       Kirkland       2        11:08am


                You want to track individuals throughout the day.
                  How would you do this in M/R, if you had to?
Wednesday, November 18, 2009

Más contenido relacionado

Similar a Apache Hadoop Talk at QCon

Erlang for video delivery
Erlang for video deliveryErlang for video delivery
Erlang for video deliveryHugh Watkins
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Moving Biodiversity to the Cloud @TDWG09
Moving Biodiversity to the Cloud @TDWG09Moving Biodiversity to the Cloud @TDWG09
Moving Biodiversity to the Cloud @TDWG09Javier de la Torre
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applicationsLuke Cawood
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...cwensel
 
Hands on with Ruby & MongoDB
Hands on with Ruby & MongoDBHands on with Ruby & MongoDB
Hands on with Ruby & MongoDBWynn Netherland
 
The Open-PC - OpenSourceExpo 2009
The Open-PC - OpenSourceExpo 2009The Open-PC - OpenSourceExpo 2009
The Open-PC - OpenSourceExpo 2009Frank Karlitschek
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack Foundation
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"Randy Bias
 
Kim Hammar - Spotify ML Guild Meetup - Feature Stores
Kim Hammar - Spotify ML Guild Meetup - Feature StoresKim Hammar - Spotify ML Guild Meetup - Feature Stores
Kim Hammar - Spotify ML Guild Meetup - Feature StoresKim Hammar
 
HOW TO DO AI IN 2013 from Roadmap 2012
HOW TO DO AI IN 2013 from Roadmap 2012HOW TO DO AI IN 2013 from Roadmap 2012
HOW TO DO AI IN 2013 from Roadmap 2012Gigaom
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascadingcwensel
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenPatrick Chanezon
 
Kim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
Kim Hammar - Distributed Deep Learning - RISE Learning Machines MeetupKim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
Kim Hammar - Distributed Deep Learning - RISE Learning Machines MeetupKim Hammar
 

Similar a Apache Hadoop Talk at QCon (20)

21 Polishing
21 Polishing21 Polishing
21 Polishing
 
Erlang for video delivery
Erlang for video deliveryErlang for video delivery
Erlang for video delivery
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Everyday - mongodb
Everyday - mongodbEveryday - mongodb
Everyday - mongodb
 
Moving Biodiversity to the Cloud @TDWG09
Moving Biodiversity to the Cloud @TDWG09Moving Biodiversity to the Cloud @TDWG09
Moving Biodiversity to the Cloud @TDWG09
 
Btree Nosql Oak
Btree Nosql OakBtree Nosql Oak
Btree Nosql Oak
 
Triage: real-world error logging for web applications
Triage: real-world error logging for web applicationsTriage: real-world error logging for web applications
Triage: real-world error logging for web applications
 
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
SAM SIG: Hadoop architecture, MapReduce patterns, and best practices with Cas...
 
Developing Distributed Semantic Systems
Developing Distributed Semantic SystemsDeveloping Distributed Semantic Systems
Developing Distributed Semantic Systems
 
Hands on with Ruby & MongoDB
Hands on with Ruby & MongoDBHands on with Ruby & MongoDB
Hands on with Ruby & MongoDB
 
The Open-PC - OpenSourceExpo 2009
The Open-PC - OpenSourceExpo 2009The Open-PC - OpenSourceExpo 2009
The Open-PC - OpenSourceExpo 2009
 
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdfOpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
OpenStack-Design-Summit-HA-Pairs-Are-Not-The-Only-Answer copy.pdf
 
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
OpenStack Summit :: Redundancy Doesn't Always Mean "HA" or "Cluster"
 
Kim Hammar - Spotify ML Guild Meetup - Feature Stores
Kim Hammar - Spotify ML Guild Meetup - Feature StoresKim Hammar - Spotify ML Guild Meetup - Feature Stores
Kim Hammar - Spotify ML Guild Meetup - Feature Stores
 
Rubypalooza 2009
Rubypalooza 2009Rubypalooza 2009
Rubypalooza 2009
 
HOW TO DO AI IN 2013 from Roadmap 2012
HOW TO DO AI IN 2013 from Roadmap 2012HOW TO DO AI IN 2013 from Roadmap 2012
HOW TO DO AI IN 2013 from Roadmap 2012
 
Building Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and CascadingBuilding Scale Free Applications with Hadoop and Cascading
Building Scale Free Applications with Hadoop and Cascading
 
All The Little Pieces
All The Little PiecesAll The Little Pieces
All The Little Pieces
 
CloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heavenCloudFoundry and MongoDb, a marriage made in heaven
CloudFoundry and MongoDb, a marriage made in heaven
 
Kim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
Kim Hammar - Distributed Deep Learning - RISE Learning Machines MeetupKim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
Kim Hammar - Distributed Deep Learning - RISE Learning Machines Meetup
 

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...lizamodels9
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizharallensay1
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwaitdaisycvs
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...amitlee9823
 
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLWhitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLkapoorjyoti4444
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfAdmir Softic
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Centuryrwgiffor
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...allensay1
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...rajveerescorts2022
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...amitlee9823
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...lizamodels9
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPanhandleOilandGas
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon investment
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with CultureSeta Wicaksana
 
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Sheetaleventcompany
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...daisycvs
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityEric T. Tung
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Sheetaleventcompany
 

Último (20)

Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
Russian Call Girls In Rajiv Chowk Gurgaon ❤️8448577510 ⊹Best Escorts Service ...
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai KuwaitThe Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
The Abortion pills for sale in Qatar@Doha [+27737758557] []Deira Dubai Kuwait
 
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
Call Girls Kengeri Satellite Town Just Call 👗 7737669865 👗 Top Class Call Gir...
 
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRLWhitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
Whitefield CALL GIRL IN 98274*61493 ❤CALL GIRLS IN ESCORT SERVICE❤CALL GIRL
 
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdfDr. Admir Softic_ presentation_Green Club_ENG.pdf
Dr. Admir Softic_ presentation_Green Club_ENG.pdf
 
Famous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st CenturyFamous Olympic Siblings from the 21st Century
Famous Olympic Siblings from the 21st Century
 
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabiunwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
unwanted pregnancy Kit [+918133066128] Abortion Pills IN Dubai UAE Abudhabi
 
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
Call Girls Service In Old Town Dubai ((0551707352)) Old Town Dubai Call Girl ...
 
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
👉Chandigarh Call Girls 👉9878799926👉Just Call👉Chandigarh Call Girl In Chandiga...
 
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
Call Girls Electronic City Just Call 👗 7737669865 👗 Top Class Call Girl Servi...
 
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
Call Girls From Pari Chowk Greater Noida ❤️8448577510 ⊹Best Escorts Service I...
 
PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Falcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business PotentialFalcon Invoice Discounting: Unlock Your Business Potential
Falcon Invoice Discounting: Unlock Your Business Potential
 
Organizational Transformation Lead with Culture
Organizational Transformation Lead with CultureOrganizational Transformation Lead with Culture
Organizational Transformation Lead with Culture
 
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
Call Girls Zirakpur👧 Book Now📱7837612180 📞👉Call Girl Service In Zirakpur No A...
 
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
Quick Doctor In Kuwait +2773`7758`557 Kuwait Doha Qatar Dubai Abu Dhabi Sharj...
 
Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024Marel Q1 2024 Investor Presentation from May 8, 2024
Marel Q1 2024 Investor Presentation from May 8, 2024
 
How to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League CityHow to Get Started in Social Media for Art League City
How to Get Started in Social Media for Art League City
 
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
Chandigarh Escorts Service 📞8868886958📞 Just📲 Call Nihal Chandigarh Call Girl...
 

Apache Hadoop Talk at QCon

  • 1. Apache Hadoop, Big Data, and You Philip Zeyliger philip@cloudera.com @philz42 @cloudera November 18, 2009 Wednesday, November 18, 2009
  • 2. Hi there! Software Engineer Worked at Wednesday, November 18, 2009
  • 3. I work on stuff... Wednesday, November 18, 2009
  • 4. Outline Why should you care? (Intro) Challenging yesteryear’s assumptions The MapReduce Model HDFS, Hadoop Map/Reduce The Hadoop Ecosystem Questions Wednesday, November 18, 2009
  • 5. Data is everywhere. Data is important. Wednesday, November 18, 2009
  • 9. “I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” Hal Varian (Google’s chief economist) Wednesday, November 18, 2009
  • 10. So, what’s Hadoop? The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009
  • 11. Apache Hadoop is an open-source system (written in Java!) to store and process gobs of data across many commodity computers. The Little Prince, Antoine de Saint-Exupéry, Irene Testot-Ferry Wednesday, November 18, 2009
  • 12. Two Big Components HDFS Map/Reduce Self-healing high- bandwidth Fault-tolerant clustered storage. distributed computing. Wednesday, November 18, 2009
  • 13. Challenging some of yesteryear’s assumptions... Wednesday, November 18, 2009
  • 14. Assumption 1: Machines can be reliable... Image: MadMan the Mighty CC BY-NC-SA Wednesday, November 18, 2009
  • 15. Hadoop Goal: Separate distributed system fault-tolerance code from application logic. Systems Programmers Statisticians Wednesday, November 18, 2009
  • 16. Assumption 2: Machines have identities... Image:Laughing Squid CC BY- NC-SA Wednesday, November 18, 2009
  • 17. Hadoop Goal: Users should interact with clusters, not machines. Wednesday, November 18, 2009
  • 18. Assumption 3: A data set fits on one machine... Image: Matthew J. Stinson CC- BY-NC Wednesday, November 18, 2009
  • 19. Hadoop Goal: System should scale linearly (or better) with data size. Wednesday, November 18, 2009
  • 20. The M/R Programming Model Wednesday, November 18, 2009
  • 21. You specify map() and reduce() functions. The framework does the rest. Wednesday, November 18, 2009
  • 22. map() map: K₁,V₁→list K₂,V₂ public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> { /** * Called once for each key/value pair in the input split. Most applications * should override this, but the default is the identity function. */ protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException { // context.write() can be called many times // this is default “identity mapper” implementation context.write((KEYOUT) key, (VALUEOUT) value); } } Wednesday, November 18, 2009
  • 23. (the shuffle) map output is assigned to a “reducer” map output is sorted by key Wednesday, November 18, 2009
  • 24. reduce() K₂, iter(V₂)→list(K₃,V₃) public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> { /** * This method is called once for each key. Most applications will define * their reduce class by overriding this method. The default implementation * is an identity function. */ @SuppressWarnings("unchecked") protected void reduce(KEYIN key, Iterable<VALUEIN> values, Context context ) throws IOException, InterruptedException { for(VALUEIN value: values) { context.write((KEYOUT) key, (VALUEOUT) value); } } } Wednesday, November 18, 2009
  • 25. Putting it together... Logical Physical Flow Physical Wednesday, November 18, 2009
  • 26. Some samples... Build an inverted index. Summarize data grouped by a key. Build map tiles from geographic data. OCRing many images. Learning ML models. (e.g., Naive Bayes for text classification) Augment traditional BI/DW technologies (by archiving raw data). Wednesday, November 18, 2009
  • 27. There’s more than the Java API Streaming Pig Hive perl, python, Higher-level SQL interface. ruby, whatever. dataflow language for easy ad-hoc Great for stdin/stdout/ analysis. analysts. stderr Developed at Developed at Yahoo! Facebook Friday, @10:10 Wednesday, November 18, 2009
  • 28. A typical look... Commodity servers (8-core, 8-16GB RAM, 4-12 TB, 2x1 gE NIC) 2-level network architecture 20-40 nodes per rack Wednesday, November 18, 2009
  • 29. The cast... Starring... NameNode (metadata server and database) SecondaryNameNode (assistant to NameNode) JobTracker (scheduler) The Chorus… DataNodes TaskTrackers (block storage) (task execution) Thanks to Zak Stone for earmuff image! Wednesday, November 18, 2009
  • 30. HDFS 3x64MB file, 3 rep 4x64MB file, 3 rep Namenode Small file, 7 rep Datanodes Wednesday, November 18, 2009 One Rack A Different Rack
  • 31. HDFS Write Path Wednesday, November 18, 2009
  • 32. HDFS Failures? Datanode crash? Clients read another copy Background rebalance Namenode crash? uh-oh Wednesday, November 18, 2009
  • 33. M/R Job on stars Tasktrackers on the same Different job machines as datanodes Idle Wednesday, November 18, 2009 One Rack A Different Rack
  • 35. M/R Failures Task fails Try again? Try again somewhere else? Report failure Retries possible because of idempotence Wednesday, November 18, 2009
  • 36. Hadoop in the Wild Yahoo! Hadoop Clusters: > 82PB, >25k machines (Eric14, HadoopWorld NYC ’09) Google: 40 GB/s GFS read/write load (Jeff Dean, LADIS ’09) [~3,500 TB/day] Facebook: 4TB new data per day; DW: 4800 cores, 5.5 PB (Dhruba Borthakur, HadoopWorld) Wednesday, November 18, 2009
  • 37. The Hadoop Ecosystem ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) Sqoop Zookeepr (Coordination) Avro (Serialization) MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Wednesday, November 18, 2009
  • 38. Ok, fine, what next? Get Hadoop! http://hadoop.apache.org/ Cloudera Distribution for Hadoop Try it out! (Locally, or on EC2) Wednesday, November 18, 2009
  • 39. Just one slide... Software: Cloudera Distribution for Hadoop, Cloudera Desktop, more… Training and certification… Free on-line training materials (including video) Support & Professional Services @cloudera, blog, etc. Wednesday, November 18, 2009
  • 40. Questions? philip@cloudera.com Wednesday, November 18, 2009
  • 42. Important APIs → is 1:many Input Format data→K₁,V₁ Writable Mapper K₁,V₁→K₂,V₂ JobClient M/R Flow Other Combiner K₂,iter(V₂)→K₂,V₂ Partitioner K₂,V₂→int *Context Reducer K₂, iter(V₂)→K₃,V₃ Filesystem Out. Format K₃,V₃→data Wednesday, November 18, 2009
  • 43. $ cat input.txt adams dunster kirkland dunster kirland dudley dunster adams dunster winthrop $ bin/hadoop jar hadoop-0.18.3- examples.jar grep input.txt output1 'dunster|adams' $ cat output1/part-00000 4 dunster 2 adams Wednesday, November 18, 2009
  • 44. public int run(String[] args) throws Exception { grepJob.setReducerClass(LongSumRedu FileOutputFormat.setOutputPath(sort if (args.length < 3) { cer.class); Job, new Path(args[1])); System.out.println("Grep // sort by decreasing freq <inDir> <outDir> <regex> [<group>]"); FileOutputFormat.setOutputPath(grep sortJob.setOutputKeyComparatorClass Job, tempDir); (LongWritable.DecreasingComparator. ToolRunner.printGenericCommandUsage class); (System.out); grepJob.setOutputFormat(SequenceFil return -1; eOutputFormat.class); JobClient.runJob(sortJob); } } finally { Path tempDir = new Path("grep- grepJob.setOutputKeyClass(Text.clas temp-"+Integer.toString(new s); FileSystem.get(grepJob).delete(temp Random().nextInt(Integer.MAX_VALUE) Dir, true); )); grepJob.setOutputValueClass(LongWri } JobConf grepJob = new table.class); return 0; JobConf(getConf(), Grep.class); } try { JobClient.runJob(grepJob); grepJob.setJobName("grep- search"); JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep- FileInputFormat.setInputPaths(grepJ sort"); the “grep” ob, args[0]); FileInputFormat.setInputPaths(sortJ grepJob.setMapperClass(RegexMapper. ob, tempDir); class); example sortJob.setInputFormat(SequenceFile grepJob.set("mapred.mapper.regex", InputFormat.class); args[2]); if (args.length == 4) sortJob.setMapperClass(InverseMappe grepJob.set("mapred.mapper.regex.gr r.class); oup", args[3]); // write a single file sortJob.setNumReduceTasks(1); grepJob.setCombinerClass(LongSumRed ucer.class); Wednesday, November 18, 2009
  • 45. JobConf grepJob = new JobConf(getConf(), Grep.class); try { grepJob.setJobName("grep-search"); FileInputFormat.setInputPaths(grepJob, args[0]); Job grepJob.setMapperClass(RegexMapper.class); grepJob.set("mapred.mapper.regex", args[2]); if (args.length == 4) grepJob.set("mapred.mapper.regex.group", args[3]); grepJob.setCombinerClass(LongSumReducer.class); grepJob.setReducerClass(LongSumReducer.class); 1of 2 FileOutputFormat.setOutputPath(grepJob, tempDir); grepJob.setOutputFormat(SequenceFileOutputFormat.class); grepJob.setOutputKeyClass(Text.class); grepJob.setOutputValueClass(LongWritable.class); JobClient.runJob(grepJob); } ... Wednesday, November 18, 2009
  • 46. JobConf sortJob = new JobConf(Grep.class); sortJob.setJobName("grep-sort"); FileInputFormat.setInputPaths(sortJob, tempDir); Job sortJob.setInputFormat(SequenceFileInputFormat.class); sortJob.setMapperClass(InverseMapper.class); (implicit identity reducer) // write a single file sortJob.setNumReduceTasks(1); 2 of 2 FileOutputFormat.setOutputPath(sortJob, new Path(args[1])); // sort by decreasing freq sortJob.setOutputKeyComparatorClass( LongWritable.DecreasingComparator.class); JobClient.runJob(sortJob); } finally { FileSystem.get(grepJob).delete(tempDir, true); } return 0; } Wednesday, November 18, 2009
  • 47. The types there... ?, Text Text, Long Text, list(Long) Text, Long Long, Text Wednesday, November 18, 2009
  • 48. A Simple Join Id Last First People 1 Washington George 2 Lincoln Abraham Key Entry Log Location Id Time Dunster 1 11:00am Dunster 2 11:02am Kirkland 2 11:08am You want to track individuals throughout the day. How would you do this in M/R, if you had to? Wednesday, November 18, 2009