SlideShare una empresa de Scribd logo
1 de 21
Processing Megadata With Python and Hadoop July 2010 TriHUG Ryan Cox www.asciiarmor.com
0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999 0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999 0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999 0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999 0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999 0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999 0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999 0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999 0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999 0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999 0029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF108991999999999999999999 0029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF108991999999999999999999 0029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF104991999999999999999999 0029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF104991999999999999999999 0029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF104991999999999999999999 0029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF103991999999999999999999 0029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF100991999999999999999999 0029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF100991999999999999999999 0029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF100991999999999999999999 0029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF100991999999999999999999 0029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF100991999999999999999999 0029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF108991999999999999999999 0029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF108991999999999999999999 0035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW1701 0029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF108991999999999999999999 0029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF108991999999999999999999 0029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF108991999999999999999999 0029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF100991999999999999999999 0029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF100991999999999999999999 0029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF100991999999999999999999 0029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF100991999999999999999999 0029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF100991999999999999999999 0029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF100991999999999999999999 0029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF104991999999999999999999 0029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF103991999999999999999999 0029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF100991999999999999999999 0029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF100991999999999999999999 0029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF100991999999999999999999 0029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF100991999999999999999999 0029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF100991999999999999999999 0029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF100991999999999999999999 0029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF100991999999999999999999 0029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF108991999999999999999999 0029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF108991999999999999999999 0035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721 ~130 GB NCDC climate Dataset
high_temp=0 forline inopen('1901'):    line =line.strip()    (year, temp, quality) =    (line[15:19], line[87:92], line[92:93])   if(temp !="+9999"and quality in"01459"):  high_temp=max(high_temp,float(temp))  printhigh_temp How can we make this scale? ( and do more interesting things )
JeffREY DEAN – Google - 2004 “Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
defmapper(line):   line =line.strip()    (year, temp, quality) =    (line[15:19], line[87:92], line[92:93])  if(temp !="+9999"and quality in"01459"):  returnfloat(temp)  returnNone output =map(mapper,open('1901'))  printreduce(max,output) Mapreduce in pure python
for line in sys.stdin: val = line.strip()   (year, temp, q) = (val[15:19], val[87:92], val[92:93])   if (temp != "+9999" and re.match("[01459]", q)):     print "%s%s" % (year, temp) mapper.py (last_key, max_val) = (None, 0) for line in sys.stdin:   (key, val) = line.strip().split("")   if last_key and last_key != key:     print "%s%s" % (last_key, max_val)     (last_key, max_val) = (key, int(val))   else:     (last_key, max_val) = (key, max(max_val, int(val))) if last_key:   print "%s%s" % (last_key, max_val) reduer.py cat dataFile | mapper.py | sort | reducer.py Hadoop Streaming
Dumbo
def mapper(key,value):   line = value.strip()   (year, temp, quality) =     (line[15:19], line[87:92], line[92:93])   if (temp != "+9999" and quality in "01459"):     yield year, int(temp) def reducer(key,values):     yield key,max(values) if __name__ == "__main__":     import dumbo dumbo.run(mapper,reducer,reducer) Dumbo
[object Object]
Job / Iteration Abstraction
Counter / Status Abstraction
Simplified Joining mechanism
Ability to use non-Java combiners
Built-in library of mappers / reducers
Excellent way to model MR algorithmsDumbo
CLI – API – Web Console Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Elastic Map REduce
Elastic Map REduce
Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’ Elastic Map REduce
CloudWatch metrics Elastic Map REduce

Más contenido relacionado

La actualidad más candente

The Uncertain Enterprise
The Uncertain EnterpriseThe Uncertain Enterprise
The Uncertain EnterpriseClarkTony
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package ExamplesDr. Volkan OBAN
 
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)MeetupDataScienceRoma
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic EncryptionVictor Pereira
 
STRING IN PYTHON WITH METHODS
STRING IN PYTHON WITH METHODSSTRING IN PYTHON WITH METHODS
STRING IN PYTHON WITH METHODSvikram mahendra
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assemblyZhuyi Xue
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data ScienceAlbert Bifet
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyAndrii Gakhov
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Dr. Volkan OBAN
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph MotifAMR koura
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashingDmitriy Selivanov
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2zukun
 
ggplot2: An Extensible Platform for Publication-quality Graphics
ggplot2: An Extensible Platform for Publication-quality Graphicsggplot2: An Extensible Platform for Publication-quality Graphics
ggplot2: An Extensible Platform for Publication-quality GraphicsClaus Wilke
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.
 

La actualidad más candente (20)

The Uncertain Enterprise
The Uncertain EnterpriseThe Uncertain Enterprise
The Uncertain Enterprise
 
R-ggplot2 package Examples
R-ggplot2 package ExamplesR-ggplot2 package Examples
R-ggplot2 package Examples
 
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
Quantum Machine Learning and QEM for Gaussian mixture models (Alessandro Luongo)
 
Homomorphic Encryption
Homomorphic EncryptionHomomorphic Encryption
Homomorphic Encryption
 
Mate tarea - 2º
Mate   tarea - 2ºMate   tarea - 2º
Mate tarea - 2º
 
STRING IN PYTHON WITH METHODS
STRING IN PYTHON WITH METHODSSTRING IN PYTHON WITH METHODS
STRING IN PYTHON WITH METHODS
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 
Overlap Layout Consensus assembly
Overlap Layout Consensus assemblyOverlap Layout Consensus assembly
Overlap Layout Consensus assembly
 
Internet of Things Data Science
Internet of Things Data ScienceInternet of Things Data Science
Internet of Things Data Science
 
Probabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. FrequencyProbabilistic data structures. Part 3. Frequency
Probabilistic data structures. Part 3. Frequency
 
Seg code
Seg codeSeg code
Seg code
 
Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.Data Visualization with R.ggplot2 and its extensions examples.
Data Visualization with R.ggplot2 and its extensions examples.
 
parameterized complexity for graph Motif
parameterized complexity for graph Motifparameterized complexity for graph Motif
parameterized complexity for graph Motif
 
Finding similar items in high dimensional spaces locality sensitive hashing
Finding similar items in high dimensional spaces  locality sensitive hashingFinding similar items in high dimensional spaces  locality sensitive hashing
Finding similar items in high dimensional spaces locality sensitive hashing
 
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
ECCV2008: MAP Estimation Algorithms in Computer Vision - Part 2
 
ggplot2: An Extensible Platform for Publication-quality Graphics
ggplot2: An Extensible Platform for Publication-quality Graphicsggplot2: An Extensible Platform for Publication-quality Graphics
ggplot2: An Extensible Platform for Publication-quality Graphics
 
Fast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in PracticeFast Wavelet Tree Construction in Practice
Fast Wavelet Tree Construction in Practice
 
Neural Doodle
Neural DoodleNeural Doodle
Neural Doodle
 
Experement no 6
Experement no 6Experement no 6
Experement no 6
 
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group TestingFast Identification of Heavy Hitters by Cached and Packed Group Testing
Fast Identification of Heavy Hitters by Cached and Packed Group Testing
 

Similar a Here is how you could implement BFS using MapReduce:Map Phase:- The mapper takes each vertex and emits its adjacent vertices as key-value pairs, where the key is the neighbor and value is the vertex. - It also emits the source vertex with value 0.Reduce Phase: - The reducer takes each vertex and its list of distances. It outputs the minimum distance.Pseudocode:map(key: vertex, value: null): emit(neighbor, key) for each neighbor of key if key == source: emit(key, 0)reduce(key: vertex, values: list of distances): emit(key, min

Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupBadoo Development
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes LocallyDima Korolev
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyrRomain Francois
 
Introduction To PostGIS
Introduction To PostGISIntroduction To PostGIS
Introduction To PostGISmleslie
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsNtino Krampis
 
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)Amazon Web Services Korea
 
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkCrystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkJivan Nepali
 
[SI] Ada Lovelace Day 2014 - Tampon Run
[SI] Ada Lovelace Day 2014  - Tampon Run[SI] Ada Lovelace Day 2014  - Tampon Run
[SI] Ada Lovelace Day 2014 - Tampon RunMaja Kraljič
 
Machine Learning and Go. Go!
Machine Learning and Go. Go!Machine Learning and Go. Go!
Machine Learning and Go. Go!Diana Ortega
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with RYanchang Zhao
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018Zahari Dichev
 
[EN] Ada Lovelace Day 2014 - Tampon run
[EN] Ada Lovelace Day 2014  - Tampon run[EN] Ada Lovelace Day 2014  - Tampon run
[EN] Ada Lovelace Day 2014 - Tampon runMaja Kraljič
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with ClojureDmitry Buzdin
 
Handling Real-time Geostreams
Handling Real-time GeostreamsHandling Real-time Geostreams
Handling Real-time Geostreamsguest35660bc
 
Handling Real-time Geostreams
Handling Real-time GeostreamsHandling Real-time Geostreams
Handling Real-time GeostreamsRaffi Krikorian
 
Using Deep Learning (Computer Vision) to Search for Oil and Gas
Using Deep Learning (Computer Vision) to Search for Oil and GasUsing Deep Learning (Computer Vision) to Search for Oil and Gas
Using Deep Learning (Computer Vision) to Search for Oil and GasSorin Peste
 

Similar a Here is how you could implement BFS using MapReduce:Map Phase:- The mapper takes each vertex and emits its adjacent vertices as key-value pairs, where the key is the neighbor and value is the vertex. - It also emits the source vertex with value 0.Reduce Phase: - The reducer takes each vertex and its list of distances. It outputs the minimum distance.Pseudocode:map(key: vertex, value: null): emit(neighbor, key) for each neighbor of key if key == source: emit(key, 0)reduce(key: vertex, values: list of distances): emit(key, min (20)

Доклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang MeetupДоклад Антона Поварова "Go in Badoo" с Golang Meetup
Доклад Антона Поварова "Go in Badoo" с Golang Meetup
 
dplyr
dplyrdplyr
dplyr
 
Crunching Gigabytes Locally
Crunching Gigabytes LocallyCrunching Gigabytes Locally
Crunching Gigabytes Locally
 
Data manipulation with dplyr
Data manipulation with dplyrData manipulation with dplyr
Data manipulation with dplyr
 
Introduction To PostGIS
Introduction To PostGISIntroduction To PostGIS
Introduction To PostGIS
 
Groovy
GroovyGroovy
Groovy
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기  - 윤석찬 (AWS 테크에반젤리스트)
Amazon SageMaker을 통한 손쉬운 Jupyter Notebook 활용하기 - 윤석찬 (AWS 테크에반젤리스트)
 
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and SparkCrystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
Crystal Ball Event Prediction and Log Analysis with Hadoop MapReduce and Spark
 
CPP Homework Help
CPP Homework HelpCPP Homework Help
CPP Homework Help
 
[SI] Ada Lovelace Day 2014 - Tampon Run
[SI] Ada Lovelace Day 2014  - Tampon Run[SI] Ada Lovelace Day 2014  - Tampon Run
[SI] Ada Lovelace Day 2014 - Tampon Run
 
Machine Learning and Go. Go!
Machine Learning and Go. Go!Machine Learning and Go. Go!
Machine Learning and Go. Go!
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
Flink meetup
Flink meetupFlink meetup
Flink meetup
 
High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018High Performance Systems Without Tears - Scala Days Berlin 2018
High Performance Systems Without Tears - Scala Days Berlin 2018
 
[EN] Ada Lovelace Day 2014 - Tampon run
[EN] Ada Lovelace Day 2014  - Tampon run[EN] Ada Lovelace Day 2014  - Tampon run
[EN] Ada Lovelace Day 2014 - Tampon run
 
Refactoring to Macros with Clojure
Refactoring to Macros with ClojureRefactoring to Macros with Clojure
Refactoring to Macros with Clojure
 
Handling Real-time Geostreams
Handling Real-time GeostreamsHandling Real-time Geostreams
Handling Real-time Geostreams
 
Handling Real-time Geostreams
Handling Real-time GeostreamsHandling Real-time Geostreams
Handling Real-time Geostreams
 
Using Deep Learning (Computer Vision) to Search for Oil and Gas
Using Deep Learning (Computer Vision) to Search for Oil and GasUsing Deep Learning (Computer Vision) to Search for Oil and Gas
Using Deep Learning (Computer Vision) to Search for Oil and Gas
 

Más de ryancox

Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...ryancox
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 ryancox
 
Hadoop New And Note - December 2010 TriHUG
Hadoop New And Note - December 2010 TriHUGHadoop New And Note - December 2010 TriHUG
Hadoop New And Note - December 2010 TriHUGryancox
 
Tri hug 2010 wei
Tri hug 2010   weiTri hug 2010   wei
Tri hug 2010 weiryancox
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 

Más de ryancox (6)

Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
Developing Applications for Beagle Bone Black, Raspberry Pi and SoC Single Bo...
 
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010 Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
Brian O'Connor HBase Talk - Triangle Hadoop Users Group Dec 2010
 
Hadoop New And Note - December 2010 TriHUG
Hadoop New And Note - December 2010 TriHUGHadoop New And Note - December 2010 TriHUG
Hadoop New And Note - December 2010 TriHUG
 
Tri hug 2010 wei
Tri hug 2010   weiTri hug 2010   wei
Tri hug 2010 wei
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
dtrace
dtracedtrace
dtrace
 

Here is how you could implement BFS using MapReduce:Map Phase:- The mapper takes each vertex and emits its adjacent vertices as key-value pairs, where the key is the neighbor and value is the vertex. - It also emits the source vertex with value 0.Reduce Phase: - The reducer takes each vertex and its list of distances. It outputs the minimum distance.Pseudocode:map(key: vertex, value: null): emit(neighbor, key) for each neighbor of key if key == source: emit(key, 0)reduce(key: vertex, values: list of distances): emit(key, min

  • 1. Processing Megadata With Python and Hadoop July 2010 TriHUG Ryan Cox www.asciiarmor.com
  • 2. 0029029070999991901010106004+64333+023450FM-12+000599999V0202701N015919999999N0000001N9-00781+99999102001ADDGF108991999999999999999999 0029029070999991901010113004+64333+023450FM-12+000599999V0202901N008219999999N0000001N9-00721+99999102001ADDGF104991999999999999999999 0029029070999991901010120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00941+99999102001ADDGF108991999999999999999999 0029029070999991901010206004+64333+023450FM-12+000599999V0201801N008219999999N0000001N9-00611+99999101831ADDGF108991999999999999999999 0029029070999991901010213004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00561+99999101761ADDGF108991999999999999999999 0029029070999991901010220004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999101751ADDGF108991999999999999999999 0029029070999991901010306004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00671+99999101701ADDGF106991999999999999999999 0029029070999991901010313004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999101741ADDGF108991999999999999999999 0029029070999991901010320004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00281+99999101741ADDGF108991999999999999999999 0029029070999991901010406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102311ADDGF108991999999999999999999 0029029070999991901010413004+64333+023450FM-12+000599999V0202301N008219999999N0000001N9-00441+99999102261ADDGF108991999999999999999999 0029029070999991901010420004+64333+023450FM-12+000599999V0202001N011819999999N0000001N9-00391+99999102231ADDGF108991999999999999999999 0029029070999991901010506004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9+00001+99999101821ADDGF104991999999999999999999 0029029070999991901010513004+64333+023450FM-12+000599999V0202701N002119999999N0000001N9+00061+99999102591ADDGF104991999999999999999999 0029029070999991901010520004+64333+023450FM-12+000599999V0202301N004119999999N0000001N9+00001+99999102671ADDGF104991999999999999999999 0029029070999991901010606004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102751ADDGF103991999999999999999999 0029029070999991901010613004+64333+023450FM-12+000599999V0202701N006219999999N0000001N9+00061+99999102981ADDGF100991999999999999999999 0029029070999991901010620004+64333+023450FM-12+000599999V0203201N002119999999N0000001N9-00111+99999103191ADDGF100991999999999999999999 0029029070999991901010706004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999103341ADDGF100991999999999999999999 0029029070999991901010713004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999103321ADDGF100991999999999999999999 0029029070999991901010720004+64333+023450FM-12+000599999V0202001N009819999999N0000001N9-00441+99999103321ADDGF100991999999999999999999 0029029070999991901010806004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00281+99999103221ADDGF108991999999999999999999 0029029070999991901010813004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999103201ADDGF108991999999999999999999 0035029070999991901010820004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00331+99999102991ADDGF108991999999999999999999MW1701 0029029070999991901010906004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00501+99999102871ADDGF108991999999999999999999 0029029070999991901010913004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00331+99999102661ADDGF108991999999999999999999 0029029070999991901010920004+64333+023450FM-12+000599999V0201801N009819999999N0000001N9-00281+99999102391ADDGF108991999999999999999999 0029029070999991901011006004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00441+99999101601ADDGF100991999999999999999999 0029029070999991901011013004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00441+99999101481ADDGF100991999999999999999999 0029029070999991901011020004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00441+99999101381ADDGF100991999999999999999999 0029029070999991901011106004+64333+023450FM-12+000599999V0202501N006219999999N0000001N9-00391+99999101061ADDGF100991999999999999999999 0029029070999991901011113004+64333+023450FM-12+000599999V0202701N008219999999N0000001N9-00501+99999101141ADDGF100991999999999999999999 0029029070999991901011120004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999101261ADDGF100991999999999999999999 0029029070999991901011206004+64333+023450FM-12+000599999V0202701N004119999999N0000001N9-00391+99999101311ADDGF104991999999999999999999 0029029070999991901011213004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00331+99999102071ADDGF103991999999999999999999 0029029070999991901011220004+64333+023450FM-12+000599999V0202901N009819999999N0000001N9-00221+99999102191ADDGF100991999999999999999999 0029029070999991901011306004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9+00001+99999101661ADDGF100991999999999999999999 0029029070999991901011313004+64333+023450FM-12+000599999V0203201N008219999999N0000001N9-00061+99999102351ADDGF100991999999999999999999 0029029070999991901011320004+64333+023450FM-12+000599999V0203201N004119999999N0000001N9-00171+99999102321ADDGF100991999999999999999999 0029029070999991901011406004+64333+023450FM-12+000599999V0209991C000019999999N0000001N9-00441+99999102721ADDGF100991999999999999999999 0029029070999991901011413004+64333+023450FM-12+000599999V0202301N009819999999N0000001N9-00391+99999102551ADDGF100991999999999999999999 0029029070999991901011420004+64333+023450FM-12+000599999V0202301N011819999999N0000001N9-00331+99999102261ADDGF100991999999999999999999 0029029070999991901011506004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9-00061+99999101831ADDGF108991999999999999999999 0029029070999991901011513004+64333+023450FM-12+000599999V0202301N013919999999N0000001N9+00171+99999101541ADDGF108991999999999999999999 0035029070999991901011520004+64333+023450FM-12+000599999V0202301N015919999999N0000001N9+00221+99999101321ADDGF108991999999999999999999MW1721 ~130 GB NCDC climate Dataset
  • 3. high_temp=0 forline inopen('1901'): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): high_temp=max(high_temp,float(temp)) printhigh_temp How can we make this scale? ( and do more interesting things )
  • 4.
  • 5. JeffREY DEAN – Google - 2004 “Our abstraction is in-spired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logical record in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately. Our use of a functional model with user-specified map and reduce operations allows us to parallelize large computations easily and to use re-execution as the primary mechanism for fault tolerance.”
  • 6.
  • 7. defmapper(line): line =line.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if(temp !="+9999"and quality in"01459"): returnfloat(temp) returnNone output =map(mapper,open('1901')) printreduce(max,output) Mapreduce in pure python
  • 8. for line in sys.stdin: val = line.strip() (year, temp, q) = (val[15:19], val[87:92], val[92:93]) if (temp != "+9999" and re.match("[01459]", q)): print "%s%s" % (year, temp) mapper.py (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("") if last_key and last_key != key: print "%s%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%s%s" % (last_key, max_val) reduer.py cat dataFile | mapper.py | sort | reducer.py Hadoop Streaming
  • 10. def mapper(key,value): line = value.strip() (year, temp, quality) = (line[15:19], line[87:92], line[92:93]) if (temp != "+9999" and quality in "01459"): yield year, int(temp) def reducer(key,values): yield key,max(values) if __name__ == "__main__": import dumbo dumbo.run(mapper,reducer,reducer) Dumbo
  • 11.
  • 12. Job / Iteration Abstraction
  • 13. Counter / Status Abstraction
  • 15. Ability to use non-Java combiners
  • 16. Built-in library of mappers / reducers
  • 17. Excellent way to model MR algorithmsDumbo
  • 18. CLI – API – Web Console Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Elastic Map REduce
  • 20. Use Hadoop’s Job Tracker or Amazon’s ‘Debugger’ Elastic Map REduce
  • 23. Quiz: How would you do this?
  • 24. Map Reduce Algorithms ARE different BFS(G, s) // G is the graph and s is the starting node for each vertex u ∈ V [G] - {s} do color[u] ← WHITE // color of vertex u d[u] ← ∞ // distance from source s to vertex u π[u] ← NIL // predecessor of u color[s] ← GRAY d[s] ← 0 π[s] ← NIL Q ← Ø // Q is a FIFO - queue ENQUEUE(Q, s) while Q ≠ Ø // iterates as long as there are gray vertices. do u ← DEQUEUE(Q) for each v ∈ Adj[u] do if color[v] = WHITE // discover the undiscovered adjacent vertices then color[v] ← GRAY // enqueued whenever painted gray d[v] ← d[u] + 1 π[v] ← u ENQUEUE(Q, v) color[u] ← BLACK // painted black whenever dequeued
  • 25. > m = function() { emit(this.user_id, 1); } > r = function(k,vals) { return 1; } > res = db.events.mapReduce(m, r, { query : {type:'sale'} }); > db[res.result].find().limit(2) { "_id" : 8321073716060 , "value" : 1 } { "_id" : 7921232311289 , "value" : 1 } MongoDB > {ok, [R]} = Client:mapred([{<<"groceries">>, <<"mine">>},                             {<<"groceries">>, <<"yours">>}],                            [{'map', {'qfun', Count}, 'none', false},                             {'reduce', {'qfun', Merge}, 'none', true}]). Riak Map Reduce Elsewhere
  • 26. Definitive Guide Hadoop http://www.hadoopbook.com Dumbo http://dumbotics.com/ http://github.com/klbostee/dumbo/ Elastic Map Reduce http://aws.amazon.com/ Boto http://github.com/boto Getting Started Slides http://www.slideshare.net/pacoid/getting-started-on-hadoop Learn MOre
  • 27. DEMO

Notas del editor

  1. This template can be used as a starter file for a photo album.