SlideShare una empresa de Scribd logo
1 de 13
Descargar para leer sin conexión
Weekly meeting
         Week 5


    Le Duc Tung

 PRL group, NII, Tokyo


    Nov 21, 2012




 Le Duc Tung   Weekly meeting
Last week




     Described Quicksort using MapReduce, however,
         Required many MapReduce tasks: (log n) to n steps.




                           Le Duc Tung   Weekly meeting
Terasort using Hadoop



     Written by Owen O’Malley and Arun Murthy at Yahoo Inc.
     Won the annual general purpose terabyte sort benchmark in 2008
     and 2009.
         In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB
         memory). Sorting 1TB data in 3.48 minutes.
         In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA).
         Sorting 100TB data in 173 minutes.
     Terasort includes 3 MapReduce applications:
         Teragen: generates the data.
         Terasort: samples the input data and uses them with MapReduce to
         sort the data.
         Teravalidate: validates the output is sorted.




                           Le Duc Tung   Weekly meeting
A closer look at MapReduce’s implementation model




  ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html”
                           Le Duc Tung   Weekly meeting
Teragen




     Input: The number of rows and the output directory.
     Output format:




                           Le Duc Tung   Weekly meeting
MapReduce for Teragen


     Preparation of data for Map task.
                                    InputSplit
         The number of rows            →         splits of (startid, num of rows).
                                      RecordReader
         (startid, num of rows)            →         key-value pairs of (rowid, null)
     Example:
         We need generate data of 100 rows
         We use 2 mappers
         We generate 2 splits for 2 mappers.
                Split 0: consists of two values: startid = 0, number of rows = 50.
                Split 1: consists of two values: startid = 50, number of rows = 50.
         Key-value pairs generated from split 0:
                (0, null), (1, null), ..., (49, null)
         Key-value pairs generated from split 1:
                (50, null), (51, null), ..., (99, null)




                                  Le Duc Tung      Weekly meeting
Map and Reduce task


     Map
           Input is a pair of (rowid, null)
           Output is a pair of (key, value), in which
                Step 1: Create a random generator based on ”rowid”.
                Step 2: Create a key that is a text of 10 random bytes.
                Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters +
                1 block of 8 characters”
                Characters are got from [’A’ .. ’Z’]
     Reduce task
           Does nothing.




                               Le Duc Tung   Weekly meeting
Terasort



     Step 1: (Before submitting job) Generate the sample keys by
     sampling the input. It is a sorted list of sampled keys.
     Step 2: (Using Partitioner) Distribute keys to the correspondent
     reducers using sample keys.
     Example: If we have N reducers.
           Generate (N-1) sample keys.
           All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to
           reduce i
           The above guarantees that the output of reduce i are all less than
           the output of reduce (i + 1)




                              Le Duc Tung   Weekly meeting
Generate sample keys

     From the input, create an array R containing all keys or the
     maximum of 100.000 keys (1MB). However, we can specify how
     many elements the array has.
     Using Quicksort to sort keys in the array.
     Create an array of (N-1) sample keys from the array R, in which N is
     the number of reducers, such that,
         (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the
         original array.
         stepSize is of (No. of elements of R)/(No. of reducers)
     List of sampled keys is written into HDFS.




                             Le Duc Tung   Weekly meeting
Partitioner

  public interface Partitioner<K2, V2> extends JobConfigurable {
    /**
     * Get the partition number for a given key (hence record)
     * given the total number of partitions
     * i.e. number of reduce-tasks for the job.
     *
     * <p>Typically a hash function on a all or a subset of
     * the key.</p>
     *
     * @param key the key to be partitioned.
     * @param value the entry value.
     * @param numPartitions the total number of partitions.
     * @return the partition number for the <code>key</code>.
     */
    int getPartition(K2 key, V2 value, int numPartitions);
  }


                        Le Duc Tung   Weekly meeting
Partitioner and Trie tree




    getPartition() function will
    return the position of k in the
    trie tree.                              ”source:
                                            http://en.wikipedia.org/wiki/Trie”


                              Le Duc Tung   Weekly meeting
Trie for sample keys
      Assume that we have three sample keys:
           :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120
           respectively.
           LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115
           respectively.
           le5awB.$sm, in which, ASCII code of l and e is 108 and 101
           respectively.
      Using the 2 first bytes of keys for building a 2-level trie tree.




                              Le Duc Tung   Weekly meeting
Default sort




    By default, data
    will be sorted
    before passing to
    reducers.
    Reduce task
    does nothing, it
    just outputs the
    (K, V) pairs to
    output files.




                        Le Duc Tung   Weekly meeting

Más contenido relacionado

Similar a TeraSort

AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
jeremylockett77
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
g3_nittala
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
Kenta Sato
 
AssignmentThe purpose of this assignment is to get you familiar .docx
AssignmentThe purpose of this assignment is to get you familiar .docxAssignmentThe purpose of this assignment is to get you familiar .docx
AssignmentThe purpose of this assignment is to get you familiar .docx
ssuser562afc1
 

Similar a TeraSort (20)

Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Itroroduction to R language
Itroroduction to R languageItroroduction to R language
Itroroduction to R language
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Python lecture 05
Python lecture 05Python lecture 05
Python lecture 05
 
Introduction2R
Introduction2RIntroduction2R
Introduction2R
 
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docxLiterary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
Literary Genre MatrixPart 1 Matrix FictionNon-fiction.docx
 
data structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysoredata structures using C 2 sem BCA univeristy of mysore
data structures using C 2 sem BCA univeristy of mysore
 
Basic concept of MATLAB.ppt
Basic concept of MATLAB.pptBasic concept of MATLAB.ppt
Basic concept of MATLAB.ppt
 
Ida python intro
Ida python introIda python intro
Ida python intro
 
Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
sonam Kumari python.ppt
sonam Kumari python.pptsonam Kumari python.ppt
sonam Kumari python.ppt
 
Time and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptxTime and Space Complexity Analysis.pptx
Time and Space Complexity Analysis.pptx
 
Julia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, StrongerJulia - Easier, Better, Faster, Stronger
Julia - Easier, Better, Faster, Stronger
 
Python Day1
Python Day1Python Day1
Python Day1
 
1.2 matlab numerical data
1.2  matlab numerical data1.2  matlab numerical data
1.2 matlab numerical data
 
NUMPY-2.pptx
NUMPY-2.pptxNUMPY-2.pptx
NUMPY-2.pptx
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Arrays C#
Arrays C#Arrays C#
Arrays C#
 
AssignmentThe purpose of this assignment is to get you familiar .docx
AssignmentThe purpose of this assignment is to get you familiar .docxAssignmentThe purpose of this assignment is to get you familiar .docx
AssignmentThe purpose of this assignment is to get you familiar .docx
 
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
Apache Spark - Basics of RDD & RDD Operations | Big Data Hadoop Spark Tutoria...
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

TeraSort

  • 1. Weekly meeting Week 5 Le Duc Tung PRL group, NII, Tokyo Nov 21, 2012 Le Duc Tung Weekly meeting
  • 2. Last week Described Quicksort using MapReduce, however, Required many MapReduce tasks: (log n) to n steps. Le Duc Tung Weekly meeting
  • 3. Terasort using Hadoop Written by Owen O’Malley and Arun Murthy at Yahoo Inc. Won the annual general purpose terabyte sort benchmark in 2008 and 2009. In 2008, 910 nodes x (4 dual-core processors, 4 disks, 8 GB memory). Sorting 1TB data in 3.48 minutes. In 2009, 3452 nodes x (2 Quadcore Xeons, 8 GB memory, 4 SATA). Sorting 100TB data in 173 minutes. Terasort includes 3 MapReduce applications: Teragen: generates the data. Terasort: samples the input data and uses them with MapReduce to sort the data. Teravalidate: validates the output is sorted. Le Duc Tung Weekly meeting
  • 4. A closer look at MapReduce’s implementation model ”source: http://developer.yahoo.com/hadoop/tutorial/module4.html” Le Duc Tung Weekly meeting
  • 5. Teragen Input: The number of rows and the output directory. Output format: Le Duc Tung Weekly meeting
  • 6. MapReduce for Teragen Preparation of data for Map task. InputSplit The number of rows → splits of (startid, num of rows). RecordReader (startid, num of rows) → key-value pairs of (rowid, null) Example: We need generate data of 100 rows We use 2 mappers We generate 2 splits for 2 mappers. Split 0: consists of two values: startid = 0, number of rows = 50. Split 1: consists of two values: startid = 50, number of rows = 50. Key-value pairs generated from split 0: (0, null), (1, null), ..., (49, null) Key-value pairs generated from split 1: (50, null), (51, null), ..., (99, null) Le Duc Tung Weekly meeting
  • 7. Map and Reduce task Map Input is a pair of (rowid, null) Output is a pair of (key, value), in which Step 1: Create a random generator based on ”rowid”. Step 2: Create a key that is a text of 10 random bytes. Step 3: The value is a text of: ”rowid + 7 blocks of 10 characters + 1 block of 8 characters” Characters are got from [’A’ .. ’Z’] Reduce task Does nothing. Le Duc Tung Weekly meeting
  • 8. Terasort Step 1: (Before submitting job) Generate the sample keys by sampling the input. It is a sorted list of sampled keys. Step 2: (Using Partitioner) Distribute keys to the correspondent reducers using sample keys. Example: If we have N reducers. Generate (N-1) sample keys. All keys such that sample[i − 1] ≤ key ≤ sample[i] are sent to reduce i The above guarantees that the output of reduce i are all less than the output of reduce (i + 1) Le Duc Tung Weekly meeting
  • 9. Generate sample keys From the input, create an array R containing all keys or the maximum of 100.000 keys (1MB). However, we can specify how many elements the array has. Using Quicksort to sort keys in the array. Create an array of (N-1) sample keys from the array R, in which N is the number of reducers, such that, (i − 1)-th element in the new array is (i ∗ stepSize)-th element in the original array. stepSize is of (No. of elements of R)/(No. of reducers) List of sampled keys is written into HDFS. Le Duc Tung Weekly meeting
  • 10. Partitioner public interface Partitioner<K2, V2> extends JobConfigurable { /** * Get the partition number for a given key (hence record) * given the total number of partitions * i.e. number of reduce-tasks for the job. * * <p>Typically a hash function on a all or a subset of * the key.</p> * * @param key the key to be partitioned. * @param value the entry value. * @param numPartitions the total number of partitions. * @return the partition number for the <code>key</code>. */ int getPartition(K2 key, V2 value, int numPartitions); } Le Duc Tung Weekly meeting
  • 11. Partitioner and Trie tree getPartition() function will return the position of k in the trie tree. ”source: http://en.wikipedia.org/wiki/Trie” Le Duc Tung Weekly meeting
  • 12. Trie for sample keys Assume that we have three sample keys: :x=P[ Q%D¡, in which, ASCII code of : and x is 58 and 120 respectively. LsS8)—.ZLD, in which, ASCII code of L and s is 76 and 115 respectively. le5awB.$sm, in which, ASCII code of l and e is 108 and 101 respectively. Using the 2 first bytes of keys for building a 2-level trie tree. Le Duc Tung Weekly meeting
  • 13. Default sort By default, data will be sorted before passing to reducers. Reduce task does nothing, it just outputs the (K, V) pairs to output files. Le Duc Tung Weekly meeting