SlideShare a Scribd company logo
1 of 31
Download to read offline
Hadoop Design and k-Means Clustering

                                    Kenneth Heafield

                                           Google Inc


                                    January 15, 2008



Example code from Hadoop 0.13.1 used under the Apache License Version 2.0
and modified for presentation. Except as otherwise noted, the content of this
presentation is licensed under the Creative Commons Attribution 2.5 License.


 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   1 / 31
Outline


Hadoop Design

1    Fault Tolerance

2    Data Flow
       Input
       Output

3    MapTask
      Map
      Partition

4    ReduceTask
       Fetch and Sort
       Reduce

Later in this talk: Performance and k-Means Clustering

    Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   2 / 31
Fault Tolerance


Managing Tasks

                            JobTracker

        TaskTracker                       TaskTracker

ReduceTask MapTask                 MapTask            MapTask

Design
    TaskTracker reports status or requests work every 10 seconds
     MapTask and ReduceTask report progress every 10 seconds

Issues
  + Detects failures and slow workers quickly
  - JobTracker is a single point of failure

 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   3 / 31
Fault Tolerance


Coping With Failure



Failed Tasks
Rerun map and reduce as necessary.

Slow Tasks
Start a second backup instance of the same task.

Consistency
    Any MapTask or ReduceTask might be run multiple times
     Map and Reduce should be functional




 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   4 / 31
Fault Tolerance


Use of Random Numbers

Purpose
Support randomized algorithms while remaining consistent

Sampling Mapper
private Random rand;
void configure(JobConf conf) {
  rand.setSeed((long)conf.getInt("mapred.task.partition"));
}
void map(WritableComparable key, Writable value,
          OutputCollector output, Reporter reporter) {
  if (rand.nextFloat() < 0.1) {
    output.collect(key, value);
  }
}

 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   5 / 31
Data Flow


Data Flow


 HDFS Input                     InputFormat splits and reads files

    Mapper

 Local Output                   SequenceFileOutputFormat writes serialized values

 HTTP Input                     Map outputs are retrieved over HTTP and merged

    Reduce

HDFS Output                     OutputFormat writes a SequenceFile or text



 Kenneth Heafield (Google Inc)       Hadoop Design and k-Means Clustering   January 15, 2008   6 / 31
Data Flow    Input


InputSplit


Purpose
Locate a single map task’s input.

Important Functions
Path FileSplit.getPath();

Implementations
    MultiFileSplit is a list of small files to be concatenated.
     FileSplit is a file path, offset, and length.
     TableSplit is a table name, start row, and end row.



 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   7 / 31
Data Flow    Input


RecordReader

Purpose
Parse input specified by InputSplit into keys and values. Handle records
on split boundaries.

Important Functions
boolean next(Writable key, Writable value);

Implementations
    LineRecordReader reads lines. Key is an offset, value is the text.
     KeyValueLineRecordReader reads delimited key-value pairs.
     SequenceFileRecordReader reads a SequenceFile, Hadoop’s
     binary representation of key-value pairs.


 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   8 / 31
Data Flow    Input


InputFormat

Purpose
Specifies input file format by constructing InputSplit and
RecordReader.

Important Functions
RecordReader getRecordReader(InputSplit split, JobConf job,
                             Reporter reporter);
InputSplit[] getSplits(JobConf job, int numSplits);

Implementations
    TextInputFormat reads text files.
     TableInputFormat reads from a table.


 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   9 / 31
Data Flow    Output


OutputFormat


Purpose
    Machine or human readable output.
     Makes RecordWriter, which is analogous to RecordReader

Important Functions
RecordWriter getRecordWriter(FileSystem fs, JobConf job,
  String name, Progressable progress);

Formats
    SequenceFileOutputFormat writes a binary SequenceFile
     TextOutputFormat writes text files



 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   10 / 31
MapTask


MapTask
Default Setup

        InputFormat             Split files and read records


       MapRunnable              Map all records in the task


            Mapper              Map a record


      OutputCollector           Consult Partitioner and save files


          Partitioner           Assign key-value pairs to reducers


  Reducer            Reducer    Reducers retrieve files over HTTP
 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   11 / 31
MapTask     Map


MapRunnable
Purpose
Sequence of map operations

Default Implementation
public void run(RecordReader input, OutputCollector output,
                  Reporter reporter) throws IOException {
  try {
    WritableComparable key = input.createKey();
    Writable value = input.createValue();
    while (input.next(key, value)) {
       mapper.map(key, value, output, reporter);
    }
  } finally {
    mapper.close();
  }
}
 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   12 / 31
MapTask     Map


Mapper

Purpose
Single map operation

Important Functions
void map(WritableComparable key, Writable value,
          OutputCollector output, Reporter reporter);

Pre-defined Mappers
    IdentityMapper
     InverseMapper flips key and value.
     RegexMapper matches regular expressions set in job.
     TokenCountMapper implements word count map.


 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   13 / 31
MapTask     Partition


Partitioner


Purpose
Decide which reducer handles map output.

Important Functions
int getPartition(WritableComparable key, Writable value,
                    int numReduceTasks);

Implementations
    HashPartitioner uses key.hashCode() % numReduceTasks.
     KeyFieldBasedPartitioner hashes only part of key.




 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   14 / 31
ReduceTask    Fetch and Sort


Fetch and Sort

Fetch
    TaskTracker tells Reducer where mappers are
     Reducer requests input files from mappers via HTTP

Merge Sort
    Recursively merges 10 files at a time
     100 MB in-memory sort buffer
     Calls key’s Comparator, which defaults to key.compareTo

Important Functions
int WritableComparable.compareTo(Object o);
int WritableComparator.compare(WritableComparable a,
                               WritableComparable b);

 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   15 / 31
ReduceTask    Reduce


Reduce


Important Functions
void reduce(WritableComparable key, Iterator values,
              OutputCollector output, Reporter reporter);

Pre-defined Reducers
    IdentityReducer
     LongSumReducer sums LongWritable values

Behavior
Reduce cannot start until all Mappers finish and their output is merged.




 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   16 / 31
Using Hadoop




5    Performance
       Combiners


6    k-Means Clustering
       Algorithm
       Implementation




    Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   17 / 31
Performance


Performance




Why We Care
   ≥ 10, 000 programs
     Average 100, 000 jobs/day
     ≥ 20 petabytes/day

Source: Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified
Data Processing on Large Clusters. Commun. ACM 51 (2008), 107–113.




 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   18 / 31
Performance


Barriers


Concept
Barriers wait for N things to happen

Examples
    Reduce waits for all Mappers to finish
     Job waits for all Reducers to finish
     Search engine assembles pieces of results

Moral
Worry about the maximum time. This implies balance.



 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   19 / 31
Performance    Combiners


Combiner

Purpose
Lessen network traffic by combining repeated keys in MapTask.

Important Functions
void reduce(WritableComparable key, Iterator values,
              OutputCollector output, Reporter reporter);

Example Implementation
    LongSumReducer adds LongWritable values

Behavior
    Framework decides when to call.
     Uses Reducer interface, but called with partial list of values.

 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   20 / 31
Performance    Combiners


Extended Combining


Problem
    1000 map outputs are buffered before combining.
      Keys can still be repeated enough to unbalance a reduce.

Two Phase Reduce
 1 Run a MapReduce to combine values

             Use Partitioner to balance a key over Reducers
             Run Combiner in Mapper and Reducer
  2   Run a MapReduce to reduce values
             Map with IdentityMapper
             Partition normally
             Reduce normally



 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   21 / 31
Performance    General Advice


General Advice

Small Work Units
    More inputs than Mappers
     Ideally, more reduce tasks than Reducers
     Too many tasks increases overhead
     Aim for constant-memory Mappers and Reducers

Map Only
   Skip IdentityReducer by setting numReduceTasks to -1

Outside Tables
    Increase HDFS replication before launching
     Keep random access tables in memory
     Use multithreading to share memory

 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   22 / 31
k-Means Clustering   Data


Netflix data

Goal
Find similar movies from ratings provided by users

Vector Model
    Give each movie a vector
     Make one dimension per user
     Put origin at average rating (so poor is negative)
     Normalize all vectors to unit length
             Often called cosine similarity

Issues
   - Users are biased in the movies they rate
 + Addresses different numbers of raters

 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   23 / 31
k-Means Clustering   Algorithm


k-Means Clustering

Two Dimensional Clusters


                                       Goal
                                       Cluster similar data points

                                       Approach
                                       Given data points x[i] and distance d:
                                              Select k centers c
                                              Assign x[i] to closest center c[i]
                                              Minimize           i   d(x[i], c[i])


d is sum of squares

 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering            January 15, 2008   24 / 31
k-Means Clustering   Algorithm


Lloyd’s Algorithm


Algorithm
  1   Randomly pick centers, possibly from data points
  2   Assign points to closest center
  3   Average assigned points to obtain new centers
  4   Repeat 2 and 3 until nothing changes

Issues
   - Takes superpolynomial time on some inputs
  - Not guaranteed to find optimal solution
 + Converges quickly in practice



 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   25 / 31
k-Means Clustering   Implementation


Lloyd’s Algorithm in MapReduce


Reformatting Data
Create a SequenceFile for fast reading. Partition as you see fit.

Initialization
Use a seeded random number generator to pick initial centers.

Iteration
Load centers table in MapRunnable or Mapper.

Termination
Use TextOutputFormat to list movies in each cluster.



 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   26 / 31
k-Means Clustering    Implementation


Iterative MapReduce

           Centers Version i

 Points                           Points


 Mapper                          Mapper              Find Nearest Center

                                                     Key is Center, Value is Movie

Reducer                          Reducer             Average Ratings


        Centers Version i + 1


 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering      January 15, 2008   27 / 31
k-Means Clustering   Implementation


Direct Implementation


Mapper
   Load all centers into RAM off HDFS
     For each movie, measure distance to each center
     Output key identifying the closest center

Reducer
    Output average ratings of movies

Issues
   - Brute force distance and all centers in memory
  - Unbalanced reduce, possibly even for large k


 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   28 / 31
k-Means Clustering   Implementation


Two Phase Reduce

Implementation
 1   Combine
             Mapper key identifies closest center, value is point.
             Partitioner balances centers over reducers.
             Combiner and Reducer add and count points.
 2   Recenter
             IdentityMapper
             Reducer averages values

Issues
  + Balanced reduce
  - Two phases
  - Mapper still has all k centers in memory

 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   29 / 31
k-Means Clustering   Implementation


Large k

Implementation
    Map task responsible for part of movies and part of k centers.
             For each movie, finds closest of known centers.
             Output key is point, value identifies center and distance.
     Reducer takes minimum distance center.
             Output key identifies center, value is movie.
     Second phase averages points in each center.

Issues
  + Large k while still fitting in RAM
  - Reads data points multiple times
  - Startup and intermediate storage costs


 Kenneth Heafield (Google Inc)    Hadoop Design and k-Means Clustering   January 15, 2008   30 / 31
Exercises


Exercises




k-Means
   Run on part of Netflix to cluster movies
     Read about and implement Canopies:
     http://www.kamalnigam.com/papers/canopy-kdd00.pdf




 Kenneth Heafield (Google Inc)   Hadoop Design and k-Means Clustering   January 15, 2008   31 / 31

More Related Content

What's hot

An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
Frane Bandov
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 

What's hot (19)

MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Analytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table FunctionsAnalytical Queries with Hive: SQL Windowing and Table Functions
Analytical Queries with Hive: SQL Windowing and Table Functions
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 

Viewers also liked

Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
Tianwei Liu
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
Edureka!
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
gothicane
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
Tianwei Liu
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma
 

Viewers also liked (20)

Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
Hadoop installation and Running KMeans Clustering with MapReduce Program on H...
 
Data clustering using map reduce
Data clustering using map reduceData clustering using map reduce
Data clustering using map reduce
 
Parallel-kmeans
Parallel-kmeansParallel-kmeans
Parallel-kmeans
 
Kmeans in-hadoop
Kmeans in-hadoopKmeans in-hadoop
Kmeans in-hadoop
 
K means Clustering Algorithm
K means Clustering AlgorithmK means Clustering Algorithm
K means Clustering Algorithm
 
K means Clustering
K means ClusteringK means Clustering
K means Clustering
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File SystemFredrick Ishengoma -  HDFS+- Erasure Coding Based Hadoop Distributed File System
Fredrick Ishengoma - HDFS+- Erasure Coding Based Hadoop Distributed File System
 
211 l oral
211 l oral211 l oral
211 l oral
 
25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers25 Machine Learning Unsupervised Learaning K-means K-centers
25 Machine Learning Unsupervised Learaning K-means K-centers
 
Kmeans
KmeansKmeans
Kmeans
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Agile software developement
Agile software developementAgile software developement
Agile software developement
 
Dinoflagellates
DinoflagellatesDinoflagellates
Dinoflagellates
 
Dinoflagellate
DinoflagellateDinoflagellate
Dinoflagellate
 
Mr
MrMr
Mr
 
Lec4 Clustering
Lec4 ClusteringLec4 Clustering
Lec4 Clustering
 
Implementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduceImplementing page rank algorithm using hadoop map reduce
Implementing page rank algorithm using hadoop map reduce
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 

Similar to Hadoop Design and k -Means Clustering

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 

Similar to Hadoop Design and k -Means Clustering (20)

Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
MapReduce
MapReduceMapReduce
MapReduce
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON ACADILD:: HADOOP LESSON
ACADILD:: HADOOP LESSON
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Big data
Big dataBig data
Big data
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
HadoopHadoop
Hadoop
 

More from George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
George Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
George Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
George Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
George Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
George Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
George Ang
 

More from George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Hadoop Design and k -Means Clustering

  • 1. Hadoop Design and k-Means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 1 / 31
  • 2. Outline Hadoop Design 1 Fault Tolerance 2 Data Flow Input Output 3 MapTask Map Partition 4 ReduceTask Fetch and Sort Reduce Later in this talk: Performance and k-Means Clustering Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 2 / 31
  • 3. Fault Tolerance Managing Tasks JobTracker TaskTracker TaskTracker ReduceTask MapTask MapTask MapTask Design TaskTracker reports status or requests work every 10 seconds MapTask and ReduceTask report progress every 10 seconds Issues + Detects failures and slow workers quickly - JobTracker is a single point of failure Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 3 / 31
  • 4. Fault Tolerance Coping With Failure Failed Tasks Rerun map and reduce as necessary. Slow Tasks Start a second backup instance of the same task. Consistency Any MapTask or ReduceTask might be run multiple times Map and Reduce should be functional Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 4 / 31
  • 5. Fault Tolerance Use of Random Numbers Purpose Support randomized algorithms while remaining consistent Sampling Mapper private Random rand; void configure(JobConf conf) { rand.setSeed((long)conf.getInt("mapred.task.partition")); } void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) { if (rand.nextFloat() < 0.1) { output.collect(key, value); } } Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 5 / 31
  • 6. Data Flow Data Flow HDFS Input InputFormat splits and reads files Mapper Local Output SequenceFileOutputFormat writes serialized values HTTP Input Map outputs are retrieved over HTTP and merged Reduce HDFS Output OutputFormat writes a SequenceFile or text Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 6 / 31
  • 7. Data Flow Input InputSplit Purpose Locate a single map task’s input. Important Functions Path FileSplit.getPath(); Implementations MultiFileSplit is a list of small files to be concatenated. FileSplit is a file path, offset, and length. TableSplit is a table name, start row, and end row. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 7 / 31
  • 8. Data Flow Input RecordReader Purpose Parse input specified by InputSplit into keys and values. Handle records on split boundaries. Important Functions boolean next(Writable key, Writable value); Implementations LineRecordReader reads lines. Key is an offset, value is the text. KeyValueLineRecordReader reads delimited key-value pairs. SequenceFileRecordReader reads a SequenceFile, Hadoop’s binary representation of key-value pairs. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 8 / 31
  • 9. Data Flow Input InputFormat Purpose Specifies input file format by constructing InputSplit and RecordReader. Important Functions RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter); InputSplit[] getSplits(JobConf job, int numSplits); Implementations TextInputFormat reads text files. TableInputFormat reads from a table. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 9 / 31
  • 10. Data Flow Output OutputFormat Purpose Machine or human readable output. Makes RecordWriter, which is analogous to RecordReader Important Functions RecordWriter getRecordWriter(FileSystem fs, JobConf job, String name, Progressable progress); Formats SequenceFileOutputFormat writes a binary SequenceFile TextOutputFormat writes text files Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 10 / 31
  • 11. MapTask MapTask Default Setup InputFormat Split files and read records MapRunnable Map all records in the task Mapper Map a record OutputCollector Consult Partitioner and save files Partitioner Assign key-value pairs to reducers Reducer Reducer Reducers retrieve files over HTTP Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 11 / 31
  • 12. MapTask Map MapRunnable Purpose Sequence of map operations Default Implementation public void run(RecordReader input, OutputCollector output, Reporter reporter) throws IOException { try { WritableComparable key = input.createKey(); Writable value = input.createValue(); while (input.next(key, value)) { mapper.map(key, value, output, reporter); } } finally { mapper.close(); } } Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 12 / 31
  • 13. MapTask Map Mapper Purpose Single map operation Important Functions void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter); Pre-defined Mappers IdentityMapper InverseMapper flips key and value. RegexMapper matches regular expressions set in job. TokenCountMapper implements word count map. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 13 / 31
  • 14. MapTask Partition Partitioner Purpose Decide which reducer handles map output. Important Functions int getPartition(WritableComparable key, Writable value, int numReduceTasks); Implementations HashPartitioner uses key.hashCode() % numReduceTasks. KeyFieldBasedPartitioner hashes only part of key. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 14 / 31
  • 15. ReduceTask Fetch and Sort Fetch and Sort Fetch TaskTracker tells Reducer where mappers are Reducer requests input files from mappers via HTTP Merge Sort Recursively merges 10 files at a time 100 MB in-memory sort buffer Calls key’s Comparator, which defaults to key.compareTo Important Functions int WritableComparable.compareTo(Object o); int WritableComparator.compare(WritableComparable a, WritableComparable b); Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 15 / 31
  • 16. ReduceTask Reduce Reduce Important Functions void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter); Pre-defined Reducers IdentityReducer LongSumReducer sums LongWritable values Behavior Reduce cannot start until all Mappers finish and their output is merged. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 16 / 31
  • 17. Using Hadoop 5 Performance Combiners 6 k-Means Clustering Algorithm Implementation Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 17 / 31
  • 18. Performance Performance Why We Care ≥ 10, 000 programs Average 100, 000 jobs/day ≥ 20 petabytes/day Source: Dean, Jeffrey and Ghemawat, Sanjay. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51 (2008), 107–113. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 18 / 31
  • 19. Performance Barriers Concept Barriers wait for N things to happen Examples Reduce waits for all Mappers to finish Job waits for all Reducers to finish Search engine assembles pieces of results Moral Worry about the maximum time. This implies balance. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 19 / 31
  • 20. Performance Combiners Combiner Purpose Lessen network traffic by combining repeated keys in MapTask. Important Functions void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter); Example Implementation LongSumReducer adds LongWritable values Behavior Framework decides when to call. Uses Reducer interface, but called with partial list of values. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 20 / 31
  • 21. Performance Combiners Extended Combining Problem 1000 map outputs are buffered before combining. Keys can still be repeated enough to unbalance a reduce. Two Phase Reduce 1 Run a MapReduce to combine values Use Partitioner to balance a key over Reducers Run Combiner in Mapper and Reducer 2 Run a MapReduce to reduce values Map with IdentityMapper Partition normally Reduce normally Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 21 / 31
  • 22. Performance General Advice General Advice Small Work Units More inputs than Mappers Ideally, more reduce tasks than Reducers Too many tasks increases overhead Aim for constant-memory Mappers and Reducers Map Only Skip IdentityReducer by setting numReduceTasks to -1 Outside Tables Increase HDFS replication before launching Keep random access tables in memory Use multithreading to share memory Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 22 / 31
  • 23. k-Means Clustering Data Netflix data Goal Find similar movies from ratings provided by users Vector Model Give each movie a vector Make one dimension per user Put origin at average rating (so poor is negative) Normalize all vectors to unit length Often called cosine similarity Issues - Users are biased in the movies they rate + Addresses different numbers of raters Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 23 / 31
  • 24. k-Means Clustering Algorithm k-Means Clustering Two Dimensional Clusters Goal Cluster similar data points Approach Given data points x[i] and distance d: Select k centers c Assign x[i] to closest center c[i] Minimize i d(x[i], c[i]) d is sum of squares Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 24 / 31
  • 25. k-Means Clustering Algorithm Lloyd’s Algorithm Algorithm 1 Randomly pick centers, possibly from data points 2 Assign points to closest center 3 Average assigned points to obtain new centers 4 Repeat 2 and 3 until nothing changes Issues - Takes superpolynomial time on some inputs - Not guaranteed to find optimal solution + Converges quickly in practice Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 25 / 31
  • 26. k-Means Clustering Implementation Lloyd’s Algorithm in MapReduce Reformatting Data Create a SequenceFile for fast reading. Partition as you see fit. Initialization Use a seeded random number generator to pick initial centers. Iteration Load centers table in MapRunnable or Mapper. Termination Use TextOutputFormat to list movies in each cluster. Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 26 / 31
  • 27. k-Means Clustering Implementation Iterative MapReduce Centers Version i Points Points Mapper Mapper Find Nearest Center Key is Center, Value is Movie Reducer Reducer Average Ratings Centers Version i + 1 Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 27 / 31
  • 28. k-Means Clustering Implementation Direct Implementation Mapper Load all centers into RAM off HDFS For each movie, measure distance to each center Output key identifying the closest center Reducer Output average ratings of movies Issues - Brute force distance and all centers in memory - Unbalanced reduce, possibly even for large k Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 28 / 31
  • 29. k-Means Clustering Implementation Two Phase Reduce Implementation 1 Combine Mapper key identifies closest center, value is point. Partitioner balances centers over reducers. Combiner and Reducer add and count points. 2 Recenter IdentityMapper Reducer averages values Issues + Balanced reduce - Two phases - Mapper still has all k centers in memory Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 29 / 31
  • 30. k-Means Clustering Implementation Large k Implementation Map task responsible for part of movies and part of k centers. For each movie, finds closest of known centers. Output key is point, value identifies center and distance. Reducer takes minimum distance center. Output key identifies center, value is movie. Second phase averages points in each center. Issues + Large k while still fitting in RAM - Reads data points multiple times - Startup and intermediate storage costs Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 30 / 31
  • 31. Exercises Exercises k-Means Run on part of Netflix to cluster movies Read about and implement Canopies: http://www.kamalnigam.com/papers/canopy-kdd00.pdf Kenneth Heafield (Google Inc) Hadoop Design and k-Means Clustering January 15, 2008 31 / 31