SlideShare una empresa de Scribd logo
1 de 57
Big Data Analytics with
    Apache Hadoop
           Milind Bhandarkar




 @techmilind                             @milindb
               Data Computing Division
Speaker Profile
Milind Bhandarkar was the founding member of the team at Yahoo!
that took Apache Hadoop from 20-node prototype to datacenter-
scale production system, and has been contributing and working with
Hadoop since version 0.1.0. He started the Yahoo! Grid solutions
team focused on training, consulting, and supporting hundreds of new
migrants to Hadoop. Parallel programming languages and paradigms
has been his area of focus for over 20 years, and his area of
specialization for PhD (Computer Science) from University of Illinois
at Urbana-Champaign. He worked at the Center for Development of
Advanced       Computing    (C-DAC),       National    Center     for
Supercomputing Applications (NCSA), Center for Simulation of
Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by
QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist,
Machine Learning Platforms at Greenplum, a division of EMC.


                   Data Computing Division
Outline

• Intro to Hadoop (10 mins)
• MapReduce (15 mins)
• Hadoop Examples (30 mins)
• Q &A

               Data Computing Division
Apache Hadoop
    Data Computing Division
Apache Hadoop

• January 2006: Subproject of Lucene
• January 2008: Top-level Apache project
• Stable Version: 1.0.3
• Latest Version: 2.0 alpha

                Data Computing Division
Apache Hadoop

• Reliable, Performant Distributed file system
• MapReduce Programming framework
• Ecosystem: HBase, Hive, Pig, Howl, Oozie,
  Zookeeper, Chukwa, Mahout, Cascading,
  Scribe, Cassandra, Hypertable, Voldemort,
  Azkaban, Sqoop, Flume, Avro ...


                 Data Computing Division
Problem: Bandwidth to Data

• Scan 100TB Datasets on 1000 node cluster
 • Remote storage @ 10MB/s = 165 mins
 • Local storage @ 50-200MB/s = 33-8 mins
• Moving computation is more efficient than
  moving data
 • Need visibility into data placement
                Data Computing Division
Problem: Scaling
          Reliably
• Failure is not an option, it’s a rule !
 • 1000 nodes, MTBF < 1 day
 • 4000 disks, 8000 cores, 25 switches, 1000
    NICs, 2000 DIMMS (16TB RAM)
• Need fault tolerant store with reasonable
  availability guarantees
  • Handle hardware faults transparently
                 Data Computing Division
Hadoop Goals
•   Scalable: Petabytes (1015 Bytes) of data on
    thousands on nodes
• Economical: Commodity components only
• Reliable
 • Engineering reliability into every
      application is expensive


                   Data Computing Division
Hadoop MapReduce



     Data Computing Division
Think MapReduce

• Record = (Key,Value)
• Key : Comparable, Serializable
• Value: Serializable
• Input, Map, Shuffle, Reduce, Output

                 Data Computing Division
Seems Familiar ?


cat /var/log/auth.log* | 
grep “session opened” | cut -d’ ‘ -f10 | 
sort | 
uniq -c > 
~/userlist




                       Data Computing Division
Map

• Input: (Key ,Value )
             1          1


• Output: List(Key ,Value )
                    2              2


• Projections, Filtering, Transformation

                 Data Computing Division
Shuffle

• Input: List(Key ,Value )
                 2                2


• Output
 • Sort(Partition(List(Key , List(Value ))))
                                           2   2


• Provided by Hadoop

                     Data Computing Division
Reduce

• Input: (Key , List(Value ))
              2                  2


• Output: List(Key ,Value )
                     3               3


• Aggregation

                  Data Computing Division
Hadoop Streaming
• Hadoop is written in Java
 • Java MapReduce code is “native”
• What about Non-Java Programmers ?
 • Perl, Python, Shell, R
 • grep, sed, awk, uniq as Mappers/Reducers
• Text Input and Output
                Data Computing Division
Hadoop Streaming
• Thin Java wrapper for Map & Reduce Tasks
• Forks actual Mapper & Reducer
• IPC via stdin, stdout, stderr
• Key.toString() t Value.toString() n
• Slower than Java programs
 • Allows for quick prototyping / debugging
                Data Computing Division
Hadoop Streaming

$ bin/hadoop jar hadoop-streaming.jar 
      -input in-files -output out-dir 
      -mapper mapper.sh -reducer reducer.sh
# mapper.sh

sed -e 's/ /n/g' | grep .

# reducer.sh

uniq -c | awk '{print $2 "t" $1}'




                      Data Computing Division
Hadoop Examples



     Data Computing Division
Example: Standard
         Deviation

•   Takeaway: Changing
    algorithm to suit
    architecture yields the
    best implementation




                        Data Computing Division
Implementation 1

• Two Map-Reduce stages
• First stage computes Mean
• Second stage computes standard deviation

                Data Computing Division
Stage 1: Compute Mean

• Map Input (x for i = 1 ..N )
             i                             m


• Map Output (N , Mean(x ))
                   m                   1..Nm


• Single Reducer
• Reduce Input (Group(Map Output))
• Reduce Output (Mean(x ))            1..N




                 Data Computing Division
Stage 2: Compute
   Standard Deviation
• Map Input (x for i = 1 ..N ) & Mean(x )
              i                             m       1..N


• Map Output (Sum(x – Mean(x)) for i =
                             i
                                                2

  1 ..Nm
• Single Reducer
• Reduce Input (Group (Map Output)) & N
• Reduce Output (σ)
                  Data Computing Division
Standard Deviation

•   Algebraically equivalent

•   Be careful about
    numerical accuracy,
    though




                          Data Computing Division
Implementation 2

• Map Input (x for i = 1 ..N )
             i                             m


• Map Output (N , [Sum(x ),Mean(x )])
                   m
                                      2
                                          1..Nm   1..Nm


• Single Reducer
• Reduce Input (Group (Map Output))
• Reduce Output (σ)
                 Data Computing Division
NGrams
 Data Computing Division
Bigrams
• Input: A large text corpus
• Output: List(word , Top (word ))
                      1          K       2


• Two Stages:
 • Generate all possible bigrams
 • Find most frequent K bigrams for each
    word

               Data Computing Division
Bigrams: Stage 1
           Map
• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
 • In each sentence, or each “word word ”            1       2


 • Output (word , word ), (word , word )
                  1              2           2           1


• Partition & Sort by (word , word )     1       2



               Data Computing Division
pairs.pl
while(<STDIN>) {
     chomp;
     $_ =~ s/[^a-zA-Z]+/ /g ;
     $_ =~ s/^s+//g ;
     $_ =~ s/s+$//g ;
     $_ =~ tr/A-Z/a-z/;
     my @words = split(/s+/, $_);
     for (my $i = 0; $i < $#words - 1; ++$i) {
          print "$words[$i]:$words[$i+1]n”;
          print "$words[$i+1]:$words[$i]n”;
     }
}




                      Data Computing Division
Bigrams: Stage 1
          Reduce

• Input: List(word , word ) sorted and
                   1              2
  partitioned
• Output: List(word , [freq, word ])
                        1                  2


• Counting similar to Unigrams example

                 Data Computing Division
count.pl
$_ = <STDIN>;
chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
  chomp;
  my ($w1, $w2) = split(/:/, $_);
  if ($w1 eq $pw1 && $w2 eq $pw2) {
    $count++;
  } else {
    print "$pw1:$count:$pw2n";
    $pw1 = $w1;
    $pw2 = $w2;
    $count = 1;
  }
}
print "$pw1:$count:$pw2n";


                      Data Computing Division
Bigrams: Stage 2
           Map
• Input: List(word , [freq,word ])
                   1                           2


• Output: List(word , [freq, word ])
                        1                          2


• Identity Mapper (/bin/cat)
• Partition by word    1


• Sort descending by (word , freq)         1




                 Data Computing Division
Bigrams: Stage 2
          Reduce
• Input: List(word , [freq,word ])
                   1                       2


 • partitioned by word           1


 • sorted descending by (word , freq)          1


• Output: Top (List(word , [freq, word ]))
             K                       1             2


• For each word, throw away after K records
                 Data Computing Division
firstN.pl
$N = 5;
$_ = <STDIN>;
chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1t$pw2,$count;";
while(<STDIN>) {
  chomp;
  my ($w1, $c, $w2) = split(/:/, $_);
  if ($w1 eq $pw1) {
    if ($idx < $N) {
      $out .= "$w2,$c;";
      $idx++;
    }
  } else {
    print "$outn";
    $pw1 = $w1;
    $idx = 1;
    $out = "$pw1t$w2,$c;";
  }
}
print "$outn";



                             Data Computing Division
Partitioner
• By default, evenly distributes keys
 • hashcode(key) % NumReducers
• Overriding partitioner
 • Skew in map-outputs
 • Restrictions on reduce outputs
   • All URLs in a domain together
                 Data Computing Division
Partitioner

// JobConf.setPartitionerClass(className)

public interface Partitioner <K, V>
    extends JobConfigurable {
  int getPartition(K key, V value, int maxPartitions);
}




                      Data Computing Division
Fully Sorted Output
• By contract, reducer gets input sorted on
  key
• Typically reducer output order is the same
  as input order
  • Each output file (part file) is sorted
• How to make sure that Keys in part i are
  all less than keys in part i+1 ?

                   Data Computing Division
Fully Sorted Output

• Use single reducer for small output
• Insight: Reducer input must be fully sorted
• Partitioner should provide fully sorted
  reduce input
• Sampling + Histogram equalization

                 Data Computing Division
Number of Maps
• Number of Input Splits
 • Number of HDFS blocks
• mapred.map.tasks
• Minimum Split Size (mapred.min.split.size)
• split_size = max(min(hdfs_block_size,
  data_size/#maps), min_split_size)

                 Data Computing Division
Parameter Sweeps
• External program processes data based on
  command-line parameters
• ./prog –params=“0.1,0.3” < in.dat > out.dat
• Objective: Run an instance of ./prog for each
  parameter combination
• Number of Mappers = Number of different
  parameter combinations

                 Data Computing Division
Parameter Sweeps
• Input File: params.txt
 • Each line contains one combination of
    parameters
• Input format is NLineInputFormat (N=1)
• Number of maps = Number of splits =
  Number of lines in params.txt


                 Data Computing Division
Auxiliary Files

• -file auxFile.dat
• Job submitter adds file to job.jar
• Unjarred on the task tracker
• Available to task as $cwd/auxFile.dat
• Not suitable for large / frequently used files
                  Data Computing Division
Auxiliary Files
• Tasks need to access “side” files
 • Read-only Dictionaries (such as for porn
    filtering)
 • Dynamically linked libraries
• Tasks themselves can fetch files from HDFS
 • Not Always ! (Hint: Unresolved symbols)
                 Data Computing Division
Distributed Cache

• Specify “side” files via –cacheFile
• If lot of such files needed
 • Create a tar.gz archive
 • Upload to HDFS
 • Specify via –cacheArchive
                  Data Computing Division
Distributed Cache

• TaskTracker downloads these files “once”
• Untars archives
• Accessible in task’s $cwd before task starts
• Cached across multiple tasks
• Cleaned up upon exit
                 Data Computing Division
Joining Multiple
           Datasets
• Datasets are streams of key-value pairs
• Could be split across multiple files in a
  single directory
• Join could be on Key, or any field in Value
• Join could be inner, outer, left outer, cross
  product etc
• Join is a natural Reduce operation
                  Data Computing Division
Example

• A = (id, name), B = (id, address)
• A is in /path/to/A/part-*
• B is in /path/to/B/part-*
• Select A.name, B.address where A.id == B.id

                  Data Computing Division
Map in Join

• Input: (Key ,Value ) from A or B
           1          1


 • map.input.file indicates A or B
    • MAP_INPUT_FILE in Streaming
• Output: (Key , [Value , A|B])
               2             2


 • Key is the Join Key
       2




                   Data Computing Division
Reduce in Join

• Input: Groups of [Value , A|B] for each Key
                                    2            2


• Operation depends on which kind of join
 • Inner join checks if key has values from
    both A & B
• Output: (Key , JoinFunction(Value ,…))
               2                             2




                   Data Computing Division
MR Join Performance
• Map Input = Total of A & B
• Map output = Total of A & B
• Shuffle & Sort
• Reduce input = Total of A & B
• Reduce output = Size of Joined dataset
• Filter and Project in Map
                Data Computing Division
Join Special Cases
• Fragment-Replicate
 • 100GB dataset with 100 MB dataset
• Equipartitioned Datasets
 • Identically Keyed
 • Equal Number of partitions
 • Each partition locally sorted
               Data Computing Division
Fragment-Replicate
• Fragment larger dataset
 • Specify as Map input
• Replicate smaller dataset
 • Use Distributed Cache
• Map-Only computation
 • No shuffle / sort
                Data Computing Division
Equipartitioned Join
• Available since Hadoop 0.16
• Datasets joined “before” input to mappers
• Input format: CompositeInputFormat
• mapred.join.expr
• Simpler to use in Java, but can be used in
  Streaming

                 Data Computing Division
Example

mapred.join.expr =
  inner (
    tbl (
       ....SequenceFileInputFormat.class,
"hdfs://namenode:8020/path/to/data/A"
    ),
    tbl (
       ....SequenceFileInputFormat.class,
"hdfs://namenode:8020/path/to/data/B"
    )
  )




                       Data Computing Division
Get Social @EMCAcademics




         Data Computing Division
Next Session:
Technology Lecture Series : Classic D
            Center
              on
         16 Aug 2012



            Data Computing Division
Questions ?



   Data Computing Division

Más contenido relacionado

La actualidad más candente

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Robert Metzger
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache CalciteJulian Hyde
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into CatalystCheng Lian
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsCheng Lian
 
07 2
07 207 2
07 2a_b_g
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryDatabricks
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RaySpark Summit
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangaloreappaji intelhunt
 

La actualidad más candente (20)

Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)Stratosphere Intro (Java and Scala Interface)
Stratosphere Intro (Java and Scala Interface)
 
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Key-Value RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache Spark & Streaming
Apache Spark & StreamingApache Spark & Streaming
Apache Spark & Streaming
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Scio
ScioScio
Scio
 
Om nom nom nom
Om nom nom nomOm nom nom nom
Om nom nom nom
 
Intro to Map Reduce
Intro to Map ReduceIntro to Map Reduce
Intro to Map Reduce
 
Dive into Catalyst
Dive into CatalystDive into Catalyst
Dive into Catalyst
 
DTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime InternalsDTCC '14 Spark Runtime Internals
DTCC '14 Spark Runtime Internals
 
MapReduce
MapReduceMapReduce
MapReduce
 
07 2
07 207 2
07 2
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
User Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love StoryUser Defined Aggregation in Apache Spark: A Love Story
User Defined Aggregation in Apache Spark: A Love Story
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Pivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew RayPivoting Data with SparkSQL by Andrew Ray
Pivoting Data with SparkSQL by Andrew Ray
 
Hadoop trainingin bangalore
Hadoop trainingin bangaloreHadoop trainingin bangalore
Hadoop trainingin bangalore
 

Destacado

A new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingA new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingThibault Dory
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4yyooooon
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceEdureka!
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveDataWorks Summit
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsGord Sissons
 
Harnessing hadoop for big data analytics v0.1
Harnessing hadoop for big data analytics v0.1Harnessing hadoop for big data analytics v0.1
Harnessing hadoop for big data analytics v0.1jobinwilson
 
Big Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersBig Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersEdureka!
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 

Destacado (12)

A new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarkingA new methodology for large scale nosql benchmarking
A new methodology for large scale nosql benchmarking
 
[150824]symposium v4
[150824]symposium v4[150824]symposium v4
[150824]symposium v4
 
Top 3 design patterns in Map Reduce
Top 3 design patterns in Map ReduceTop 3 design patterns in Map Reduce
Top 3 design patterns in Map Reduce
 
Hivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache HiveHivemail: Scalable Machine Learning Library for Apache Hive
Hivemail: Scalable Machine Learning Library for Apache Hive
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsightsUse cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
Use cases for Hadoop and Big Data Analytics - InfoSphere BigInsights
 
Harnessing hadoop for big data analytics v0.1
Harnessing hadoop for big data analytics v0.1Harnessing hadoop for big data analytics v0.1
Harnessing hadoop for big data analytics v0.1
 
Big Data Analytics for Non Programmers
Big Data Analytics for Non ProgrammersBig Data Analytics for Non Programmers
Big Data Analytics for Non Programmers
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Customer Journey Analytics and Big Data
Customer Journey Analytics and Big DataCustomer Journey Analytics and Big Data
Customer Journey Analytics and Big Data
 

Similar a Big Data Analytics with Hadoop with @techmilind

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examplesYoshitomo Matsubara
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Konstantin V. Shvachko
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...Adrian Florea
 

Similar a Big Data Analytics with Hadoop with @techmilind (20)

Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
iot.pptx
iot.pptxiot.pptx
iot.pptx
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
Enar short course
Enar short courseEnar short course
Enar short course
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Geoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel ProcessingGeoff Rothman Presentation on Parallel Processing
Geoff Rothman Presentation on Parallel Processing
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ..."MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
"MapReduce: Simplified Data Processing on Large Clusters" Paper Presentation ...
 

Más de EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDEMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOEMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremioEMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereEMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewEMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeEMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityEMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeEMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsEMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookEMC
 

Más de EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Último

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Último (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Big Data Analytics with Hadoop with @techmilind

  • 1. Big Data Analytics with Apache Hadoop Milind Bhandarkar @techmilind @milindb Data Computing Division
  • 2. Speaker Profile Milind Bhandarkar was the founding member of the team at Yahoo! that took Apache Hadoop from 20-node prototype to datacenter- scale production system, and has been contributing and working with Hadoop since version 0.1.0. He started the Yahoo! Grid solutions team focused on training, consulting, and supporting hundreds of new migrants to Hadoop. Parallel programming languages and paradigms has been his area of focus for over 20 years, and his area of specialization for PhD (Computer Science) from University of Illinois at Urbana-Champaign. He worked at the Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems, Pathscale Inc. (acquired by QLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist, Machine Learning Platforms at Greenplum, a division of EMC. Data Computing Division
  • 3. Outline • Intro to Hadoop (10 mins) • MapReduce (15 mins) • Hadoop Examples (30 mins) • Q &A Data Computing Division
  • 4. Apache Hadoop Data Computing Division
  • 5. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Stable Version: 1.0.3 • Latest Version: 2.0 alpha Data Computing Division
  • 6. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... Data Computing Division
  • 7. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement Data Computing Division
  • 8. Problem: Scaling Reliably • Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently Data Computing Division
  • 9. Hadoop Goals • Scalable: Petabytes (1015 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive Data Computing Division
  • 10. Hadoop MapReduce Data Computing Division
  • 11. Think MapReduce • Record = (Key,Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output Data Computing Division
  • 12. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist Data Computing Division
  • 13. Map • Input: (Key ,Value ) 1 1 • Output: List(Key ,Value ) 2 2 • Projections, Filtering, Transformation Data Computing Division
  • 14. Shuffle • Input: List(Key ,Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop Data Computing Division
  • 15. Reduce • Input: (Key , List(Value )) 2 2 • Output: List(Key ,Value ) 3 3 • Aggregation Data Computing Division
  • 16. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is “native” • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output Data Computing Division
  • 17. Hadoop Streaming • Thin Java wrapper for Map & Reduce Tasks • Forks actual Mapper & Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging Data Computing Division
  • 18. Hadoop Streaming $ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 "t" $1}' Data Computing Division
  • 19. Hadoop Examples Data Computing Division
  • 20. Example: Standard Deviation • Takeaway: Changing algorithm to suit architecture yields the best implementation Data Computing Division
  • 21. Implementation 1 • Two Map-Reduce stages • First stage computes Mean • Second stage computes standard deviation Data Computing Division
  • 22. Stage 1: Compute Mean • Map Input (x for i = 1 ..N ) i m • Map Output (N , Mean(x )) m 1..Nm • Single Reducer • Reduce Input (Group(Map Output)) • Reduce Output (Mean(x )) 1..N Data Computing Division
  • 23. Stage 2: Compute Standard Deviation • Map Input (x for i = 1 ..N ) & Mean(x ) i m 1..N • Map Output (Sum(x – Mean(x)) for i = i 2 1 ..Nm • Single Reducer • Reduce Input (Group (Map Output)) & N • Reduce Output (σ) Data Computing Division
  • 24. Standard Deviation • Algebraically equivalent • Be careful about numerical accuracy, though Data Computing Division
  • 25. Implementation 2 • Map Input (x for i = 1 ..N ) i m • Map Output (N , [Sum(x ),Mean(x )]) m 2 1..Nm 1..Nm • Single Reducer • Reduce Input (Group (Map Output)) • Reduce Output (σ) Data Computing Division
  • 27. Bigrams • Input: A large text corpus • Output: List(word , Top (word )) 1 K 2 • Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Data Computing Division
  • 28. Bigrams: Stage 1 Map • Generate all possible Bigrams • Map Input: Large text corpus • Map computation • In each sentence, or each “word word ” 1 2 • Output (word , word ), (word , word ) 1 2 2 1 • Partition & Sort by (word , word ) 1 2 Data Computing Division
  • 29. pairs.pl while(<STDIN>) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^s+//g ; $_ =~ s/s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]n”; print "$words[$i+1]:$words[$i]n”; } } Data Computing Division
  • 30. Bigrams: Stage 1 Reduce • Input: List(word , word ) sorted and 1 2 partitioned • Output: List(word , [freq, word ]) 1 2 • Counting similar to Unigrams example Data Computing Division
  • 31. count.pl $_ = <STDIN>; chomp; my ($pw1, $pw2) = split(/:/, $_); $count = 1; while(<STDIN>) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2n"; $pw1 = $w1; $pw2 = $w2; $count = 1; } } print "$pw1:$count:$pw2n"; Data Computing Division
  • 32. Bigrams: Stage 2 Map • Input: List(word , [freq,word ]) 1 2 • Output: List(word , [freq, word ]) 1 2 • Identity Mapper (/bin/cat) • Partition by word 1 • Sort descending by (word , freq) 1 Data Computing Division
  • 33. Bigrams: Stage 2 Reduce • Input: List(word , [freq,word ]) 1 2 • partitioned by word 1 • sorted descending by (word , freq) 1 • Output: Top (List(word , [freq, word ])) K 1 2 • For each word, throw away after K records Data Computing Division
  • 34. firstN.pl $N = 5; $_ = <STDIN>; chomp; my ($pw1, $count, $pw2) = split(/:/, $_); $idx = 1; $out = "$pw1t$pw2,$count;"; while(<STDIN>) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$outn"; $pw1 = $w1; $idx = 1; $out = "$pw1t$w2,$c;"; } } print "$outn"; Data Computing Division
  • 35. Partitioner • By default, evenly distributes keys • hashcode(key) % NumReducers • Overriding partitioner • Skew in map-outputs • Restrictions on reduce outputs • All URLs in a domain together Data Computing Division
  • 36. Partitioner // JobConf.setPartitionerClass(className) public interface Partitioner <K, V> extends JobConfigurable { int getPartition(K key, V value, int maxPartitions); } Data Computing Division
  • 37. Fully Sorted Output • By contract, reducer gets input sorted on key • Typically reducer output order is the same as input order • Each output file (part file) is sorted • How to make sure that Keys in part i are all less than keys in part i+1 ? Data Computing Division
  • 38. Fully Sorted Output • Use single reducer for small output • Insight: Reducer input must be fully sorted • Partitioner should provide fully sorted reduce input • Sampling + Histogram equalization Data Computing Division
  • 39. Number of Maps • Number of Input Splits • Number of HDFS blocks • mapred.map.tasks • Minimum Split Size (mapred.min.split.size) • split_size = max(min(hdfs_block_size, data_size/#maps), min_split_size) Data Computing Division
  • 40. Parameter Sweeps • External program processes data based on command-line parameters • ./prog –params=“0.1,0.3” < in.dat > out.dat • Objective: Run an instance of ./prog for each parameter combination • Number of Mappers = Number of different parameter combinations Data Computing Division
  • 41. Parameter Sweeps • Input File: params.txt • Each line contains one combination of parameters • Input format is NLineInputFormat (N=1) • Number of maps = Number of splits = Number of lines in params.txt Data Computing Division
  • 42. Auxiliary Files • -file auxFile.dat • Job submitter adds file to job.jar • Unjarred on the task tracker • Available to task as $cwd/auxFile.dat • Not suitable for large / frequently used files Data Computing Division
  • 43. Auxiliary Files • Tasks need to access “side” files • Read-only Dictionaries (such as for porn filtering) • Dynamically linked libraries • Tasks themselves can fetch files from HDFS • Not Always ! (Hint: Unresolved symbols) Data Computing Division
  • 44. Distributed Cache • Specify “side” files via –cacheFile • If lot of such files needed • Create a tar.gz archive • Upload to HDFS • Specify via –cacheArchive Data Computing Division
  • 45. Distributed Cache • TaskTracker downloads these files “once” • Untars archives • Accessible in task’s $cwd before task starts • Cached across multiple tasks • Cleaned up upon exit Data Computing Division
  • 46. Joining Multiple Datasets • Datasets are streams of key-value pairs • Could be split across multiple files in a single directory • Join could be on Key, or any field in Value • Join could be inner, outer, left outer, cross product etc • Join is a natural Reduce operation Data Computing Division
  • 47. Example • A = (id, name), B = (id, address) • A is in /path/to/A/part-* • B is in /path/to/B/part-* • Select A.name, B.address where A.id == B.id Data Computing Division
  • 48. Map in Join • Input: (Key ,Value ) from A or B 1 1 • map.input.file indicates A or B • MAP_INPUT_FILE in Streaming • Output: (Key , [Value , A|B]) 2 2 • Key is the Join Key 2 Data Computing Division
  • 49. Reduce in Join • Input: Groups of [Value , A|B] for each Key 2 2 • Operation depends on which kind of join • Inner join checks if key has values from both A & B • Output: (Key , JoinFunction(Value ,…)) 2 2 Data Computing Division
  • 50. MR Join Performance • Map Input = Total of A & B • Map output = Total of A & B • Shuffle & Sort • Reduce input = Total of A & B • Reduce output = Size of Joined dataset • Filter and Project in Map Data Computing Division
  • 51. Join Special Cases • Fragment-Replicate • 100GB dataset with 100 MB dataset • Equipartitioned Datasets • Identically Keyed • Equal Number of partitions • Each partition locally sorted Data Computing Division
  • 52. Fragment-Replicate • Fragment larger dataset • Specify as Map input • Replicate smaller dataset • Use Distributed Cache • Map-Only computation • No shuffle / sort Data Computing Division
  • 53. Equipartitioned Join • Available since Hadoop 0.16 • Datasets joined “before” input to mappers • Input format: CompositeInputFormat • mapred.join.expr • Simpler to use in Java, but can be used in Streaming Data Computing Division
  • 54. Example mapred.join.expr = inner ( tbl ( ....SequenceFileInputFormat.class, "hdfs://namenode:8020/path/to/data/A" ), tbl ( ....SequenceFileInputFormat.class, "hdfs://namenode:8020/path/to/data/B" ) ) Data Computing Division
  • 55. Get Social @EMCAcademics Data Computing Division
  • 56. Next Session: Technology Lecture Series : Classic D Center on 16 Aug 2012 Data Computing Division
  • 57. Questions ? Data Computing Division