SlideShare una empresa de Scribd logo
1 de 32
Hadoop Streaming

Programming Hadoop without Java
!

Glenn K. Lockwood, Ph.D.
!
User Services Group
!
San Diego Supercomputer Center
!
University of California San Diego
!
November 8, 2013
!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

HADOOP ARCHITECTURE
RECAP"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Map/Reduce Parallelism
"

task 5!
Data!

task 4!
Data!

task 0!
Data!

task 3!
Data!
task 1!
Data!

task 2!
Data!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Magic of HDFS
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Workflow
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Processing Pipeline
"
1.  Map – convert raw input into key/value pairs on
each node!
2.  Shuffle/Sort – Send all key/value pairs with the
same key to the same reducer node!
3.  Reduce – For each unique key, do something
with all the corresponding values!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

WORDCOUNT EXAMPLES"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop and Python
"
•  Hadoop streaming w/ Python mappers/reducers!
•  portable!
•  most difficult (or least difficult) to use!
•  you are the glue between Python and Hadoop!

•  mrjob (or others: hadoopy, dumbo, etc)!
• 
• 
• 
• 

comprehensive integration!
Python interface to Hadoop streaming!
Analogous interface libraries exist in R, Perl!
Can interface directly with Amazon!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount Example
"

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming with Python
"
•  "Simplest" (most portable) method!
•  Uses raw Python, Hadoop – you are the glue!
cat input.txt | mapper.py | sort | reducer.py > output.txt

provide these two scripts; Hadoop does the rest!

•  generalizable to any language you want (Perl, R,
etc)!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON – Hadoop Streaming
"
Located in streaming/streaming/:!
•  wordcount-streaming-mapper.py

We'll look at this first!
•  wordcount-streaming-reducer.py

We'll look at this second!
•  run-wordcount.sh

All of the Hadoop commands needed to run this example.
Run the script (./run-­‐wordcount.sh) or paste each
command line-by-line!
•  pg2701.txt

The full text of Melville's Moby Dick!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Hadoop streaming mapper
"
#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
for	
  line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  print(	
  '%st%d'	
  %	
  (key,	
  value)	
  )	
  

...!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
What One Mapper Does
"
line	
  =	
  

Call me Ishmael. Some years ago—never mind how long!

keys	
  =	
   Call!

me! Ishmael.! Some!years! ago--never! mind! how! long!

emit.keyval(key,value)	
  ...	
  
Call!

years!
1! Ishmael.! 1! 1!
me!

mind!

long!

1!

1!
to the reducers!

ago--never! 1! how! 1!
1!
Some!1!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Reducer Loop
"
•  If this key is the same as the previous key,!
•  add this key's value to our running total.!

•  Otherwise,!
• 
• 
• 
• 

print out the previous key's name and the running total,!
reset our running total to 0,!
add this key's value to the running total, and!
"this key" is now considered the "previous key"!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Streaming Reducer (1/2)
"
#!/usr/bin/env	
  python	
  
	
  
import	
  sys	
  
	
  
last_key	
  =	
  None	
  
running_total	
  =	
  0	
  
	
  
for	
  input_line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  input_line	
  =	
  input_line.strip()	
  
	
  	
  	
  	
  this_key,	
  value	
  =	
  input_line.split("t",	
  1)	
  
	
  	
  	
  	
  value	
  =	
  int(value)	
  
	
  
(to	
  be	
  continued...)	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Wordcount: Streaming Reducer (2/2)
"
	
  if	
  last_key	
  ==	
  this_key:	
  
	
  	
  	
  	
  	
  	
  	
  	
  running_total	
  +=	
  value	
  	
  
	
  	
  	
  	
  else:	
  
	
  	
  	
  	
  	
  	
  	
  	
  if	
  last_key:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_total)	
  )	
  
	
  	
  	
  	
  	
  	
  	
  	
  running_total	
  =	
  value	
  
	
  	
  	
  	
  	
  	
  	
  	
  last_key	
  =	
  this_key	
  
	
  
if	
  last_key	
  ==	
  this_key:	
  
	
  	
  	
  	
  print(	
  "%st%d"	
  %	
  (last_key,	
  running_total)	
  )	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Testing Mappers/Reducers
"
•  Debugging Hadoop is not fun!
$	
  head	
  -­‐n100	
  pg2701.txt	
  |	
  	
  
	
  	
  ./wordcount-­‐streaming-­‐mapper.py	
  |	
  sort	
  |	
  	
  
	
  	
  ./wordcount-­‐streaming-­‐reducer.py	
  
...	
  
with 	
  5	
  
word,	
  1	
  
world.
	
  1	
  
www.gutenberg.org	
  1	
  
you 	
  3	
  
You 	
  1	
  
	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Launching Hadoop Streaming
"
$	
  hadoop	
  dfs	
  -­‐copyFromLocal	
  ./pg2701.txt	
  mobydick.txt	
  
	
  
$	
  hadoop	
  jar	
  	
  
	
  /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar	
  	
  
	
  	
  	
  	
  -­‐D	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  -­‐mapper	
  "$(which	
  python)	
  $PWD/wordcount-­‐streaming-­‐mapper.py"	
  	
  
	
  	
  	
  	
  -­‐reducer	
  "$(which	
  python)	
  $PWD/wordcount-­‐streaming-­‐reducer.py"	
  	
  
	
  	
  	
  	
  -­‐input	
  mobydick.txt	
  	
  
	
  	
  	
  	
  -­‐output	
  output	
  
	
  
$	
  hadoop	
  dfs	
  -­‐cat	
  output/part-­‐*	
  >	
  ./output.txt	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop with Python - mrjob
"
• 
• 
• 
• 
• 

Mapper, reducer written as functions!
Can serialize (Pickle) objects to use as values!
Presents a single key + all values at once!
Extracts map/reduce errors from Hadoop for you!
Hadoop runs entirely through Python:!
$	
  ./wordcount-­‐mrjob.py	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐jobconf	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  	
  	
  –r	
  hadoop	
  	
  
	
  	
  	
  	
  	
  	
  hdfs:///user/glock/mobydick.txt	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐output-­‐dir	
  hdfs:///user/glock/output	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON - mrjob
"
Located in streaming/mrjob:!
•  wordcount-mrjob.py

Contains both mapper and reducer code!
•  run-wordcount-mrjob.sh

All of the hadoop commands needed to run this example.
Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each
command line-by-line!
•  pg2701.txt

The full text of Melville's Moby Dick!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob - Mapper
"
#!/usr/bin/env	
  python	
  
	
  
from	
  mrjob.job	
  import	
  MRJob	
  
	
  
class	
  MRwordcount(MRJob):	
  
	
  
	
  	
  	
  	
  def	
  mapper(self,	
  _,	
  line):	
   for	
  line	
  in	
  sys.stdin:	
  
	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  keys	
  =	
  line.split()	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  print('%st%d'	
  %	
  (key,	
  value))	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  value	
  
	
  
	
  	
  	
  	
  def	
  reducer(self,	
  key,	
  values):	
  
	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  sum(values)	
  
	
  
if	
  __name__	
  ==	
  '__main__':	
  
	
  	
  	
  	
  MRwordcount.run()	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob - Reducer
"
	
  
	
  	
  	
  	
  def	
  mapper(self,	
  _,	
  line):	
  
	
  	
  	
  	
  	
  	
  	
  	
  line	
  =	
  line.strip()	
  
	
  	
  	
  	
  	
  	
  	
  	
  keys	
  =	
  line.split()	
  
	
  	
  	
  	
  	
  	
  	
  	
  for	
  key	
  in	
  keys:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  value	
  =	
  1	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  value	
  
	
  
	
  	
  	
  	
  def	
  reducer(self,	
  key,	
  values):	
  
	
  	
  	
  	
  	
  	
  	
  	
  yield	
  key,	
  sum(values)	
  
	
  
if	
  __name__	
  ==	
  '__main__':	
  
	
  	
  	
  	
  MRwordcount.run()	
  

•  Reducer gets one
key and ALL values !
•  No need to loop
through key/value
pairs!
•  Use list methods/
iterators to deal with
keys!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
mrjob – Job Launch
"
Run as a python script like
any other!

can pass Hadoop
parameters (and many
more!) in through Python!

$	
  ./wordcount-­‐mrjob.py	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐jobconf	
  mapred.reduce.tasks=2	
  	
  
	
  	
  	
  	
  	
  	
  –r	
  hadoop	
  	
  
	
  	
  	
  	
  	
  	
  hdfs:///user/glock/mobydick.txt	
  	
  
	
  	
  	
  	
  	
  	
  -­‐-­‐output-­‐dir	
  hdfs:///user/glock/output	
  
	
  

Default file locations are
NOT on HDFS—copying to/
from HDFS is done
automatically!

Default output action is to
print results to your screen!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hadoop Streaming!

VCF PARSING: A REAL
EXAMPLE"
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
VCF Parsing Problem"
•  Variant Calling Files (VCFs) are a standard in
bioinformatics!
•  Large files (> 10 GB), semi-structured!
•  Format is a moving target BUT parsing libraries
exist (PyVCF, VCFtools)!
•  Large VCFs still take too long to process serially!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
VCF File Format
"
##fileformat=VCFv4.1	
  
Structure
##FILTER=<ID=LowQual,Description="Low	
  quality">	
   of entire header
must remain intact to
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic	
  depths	
  for	
  the	
  	
  
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate	
  read	
  depth	
  
describe each variant
	
  
record!
...	
  
	
  
#CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0	
  
1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43	
  
1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51	
  
1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28	
  
1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=	
  
	
  
...	
  
	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Strategy: Expand the Pipeline
"
1.  Preprocess VCF to separate header!
2.  Map!
1. 
2. 
3. 

read in header to make

sense of records!
filter out useless records!
generate key/value pairs

for interesting variants!

3.  Sort/Shuffle!
4.  Reduce (if necessary)!
5.  Postprocess (upload to PostgresQL)!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
HANDS ON - VCF
"
Located in: streaming/vcf/	
  
•  preprocess.py

Extracts the header from the VCF file!
•  mapper.py

Simple PyVCF-based mapper!
•  run-parsevcf.sh

Commands to launch the simple VCF parser example!
•  sample.vcf

Sample VCF (cancer)!
•  parsevcf.py

Full preprocess+map+reduce+postprocess application!
•  run-parsevcf-full.sh

Commands to run full pre+map+red+post pipeline!
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Mapper
"
#!/usr/bin/env	
  python	
  
	
  
import	
  vcf	
  
import	
  sys	
  
	
  
vcf_reader	
  =	
  vcf.Reader(open(vcfHeader,	
  'r'))	
  
	
  
vcf_reader._reader	
  =	
  sys.stdin	
  
	
  
vcf_reader.reader	
  =	
  (line.rstrip()	
  for	
  line	
  in	
  	
  
	
  	
  	
  	
  vcf_reader._reader	
  if	
  line.rstrip()	
  and	
  line[0]	
  !=	
  '#')	
  
	
  
(continued...)	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Mapper
"
	
  for	
  record	
  in	
  vcf_reader:	
  
	
  	
  	
  	
  	
  	
  	
  	
  chrom	
  =	
  record.CHROM	
  
	
  	
  	
  	
  	
  	
  	
  	
  id	
  =	
  record.ID	
  
	
  	
  	
  	
  	
  	
  	
  	
  pos	
  =	
  record.POS	
  
	
  	
  	
  	
  	
  	
  	
  	
  ref	
  =	
  record.REF	
  
	
  	
  	
  	
  	
  	
  	
  	
  alt	
  =	
  record.ALT	
  
	
  
	
  	
  	
  	
  	
  	
  	
  	
  try:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  for	
  idx,	
  af	
  in	
  enumerate(record.INFO['AF']):	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  if	
  af	
  >	
  target_af:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  print(	
  "%dt%st%dt%st%st%.2ft%dt%d"	
  %	
  (	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.POS,	
  record.CHROM,	
  record.POS	
  ,	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.REF,	
  record.ALT[idx],	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.INFO['AF'][idx],	
  record.INFO['AC'][idx],	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  record.INFO['AN']	
  )	
  )	
  
	
  	
  	
  	
  	
  	
  	
  	
  except	
  KeyError:	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  pass	
  
SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
Our Hands-On Version: Reducer
"
No reduction step—can turn off reducer entirely
!

hadoop	
  jar	
  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar	
  	
  

	
  	
  	
  	
  -­‐D	
  mapred.reduce.tasks=0	
  	
  
	
  	
  	
  	
  -­‐mapper	
  "$(which	
  python)	
  $PWD/parsevcf.py	
  -­‐m	
  $PWD/header.txt,0.30"	
  	
  
	
  	
  	
  	
  -­‐reducer	
  "$(which	
  python)	
  $PWD/parsevcf.py	
  -­‐r"	
  	
  
	
  	
  	
  	
  -­‐input	
  vcfparse-­‐input/sample.vcf	
  	
  
	
  	
  	
  	
  -­‐output	
  vcfparse-­‐output	
  	
  

	
  

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO
No Reducer – What's the Point?
"
8-node test: two mappers per node = 9x speedup
!

SAN DIEGO SUPERCOMPUTER CENTER
UNIVERSITY OF CALIFORNIA, SAN DIEGO

Más contenido relacionado

La actualidad más candente

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache PigSachin Vakkund
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedSri Ambati
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components Rupak Roy
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGAdam Kawa
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answerstechieguy85
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialFarzad Nozarian
 

La actualidad más candente (19)

Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
An Introduction to Apache Pig
An Introduction to Apache PigAn Introduction to Apache Pig
An Introduction to Apache Pig
 
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLabIntroduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
Introduction to SparkR | Big Data Hadoop Spark Tutorial | CloudxLab
 
Apache pig
Apache pigApache pig
Apache pig
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learnedTom Kraljevic presents H2O on Hadoop- how it works and what we've learned
Tom Kraljevic presents H2O on Hadoop- how it works and what we've learned
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Introduction to PIG components
Introduction to PIG components Introduction to PIG components
Introduction to PIG components
 
Introduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUGIntroduction To Elastic MapReduce at WHUG
Introduction To Elastic MapReduce at WHUG
 
Hadoop Interview Question and Answers
Hadoop  Interview Question and AnswersHadoop  Interview Question and Answers
Hadoop Interview Question and Answers
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Destacado

Thinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIThinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIjkalucki
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsGlenn K. Lockwood
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisSameer Tiwari
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDataWorks Summit
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesAshraf Uddin
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 

Destacado (12)

Thinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming APIThinking in Streaming - Twitter Streaming API
Thinking in Streaming - Twitter Streaming API
 
информатика 5. информация сообщение
информатика 5. информация сообщениеинформатика 5. информация сообщение
информатика 5. информация сообщение
 
ASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and DeploymentsASCI Terascale Simulation Requirements and Deployments
ASCI Terascale Simulation Requirements and Deployments
 
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - RedisStorage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
Storage Systems for big data - HDFS, HBase, and intro to KV Store - Redis
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
 
Big Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture CapabilitiesBig Data: Its Characteristics And Architecture Capabilities
Big Data: Its Characteristics And Architecture Capabilities
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Types of pipes
Types of pipesTypes of pipes
Types of pipes
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar a Hadoop Streaming: Programming Hadoop without Java

Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineeringjtdudley
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePedro Figueiredo
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeProf. Wim Van Criekinge
 
FunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - StreamsFunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - Streamsdarach
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introductionthasso23
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkbhargavi804095
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
How to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsHow to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsmeldra
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteAllen Wittenauer
 
Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Prof. Wim Van Criekinge
 
Web development tools { starter pack }
Web development tools { starter pack }Web development tools { starter pack }
Web development tools { starter pack }François Michaudon
 

Similar a Hadoop Streaming: Programming Hadoop without Java (20)

Parallel R and Hadoop
Parallel R and HadoopParallel R and Hadoop
Parallel R and Hadoop
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Tips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software EngineeringTips And Tricks For Bioinformatics Software Engineering
Tips And Tricks For Bioinformatics Software Engineering
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Osd ctw spark
Osd ctw sparkOsd ctw spark
Osd ctw spark
 
Perl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReducePerl on Amazon Elastic MapReduce
Perl on Amazon Elastic MapReduce
 
Bioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekingeBioinformatics p4-io v2013-wim_vancriekinge
Bioinformatics p4-io v2013-wim_vancriekinge
 
FunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - StreamsFunctionalJS - May 2014 - Streams
FunctionalJS - May 2014 - Streams
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Bioinformatica p4-io
Bioinformatica p4-ioBioinformatica p4-io
Bioinformatica p4-io
 
JIP Pipeline System Introduction
JIP Pipeline System IntroductionJIP Pipeline System Introduction
JIP Pipeline System Introduction
 
power point presentation on pig -hadoop framework
power point presentation on pig -hadoop frameworkpower point presentation on pig -hadoop framework
power point presentation on pig -hadoop framework
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
How to herd cat statues and make awesome things
How to herd cat statues and make awesome thingsHow to herd cat statues and make awesome things
How to herd cat statues and make awesome things
 
Shell scripting
Shell scriptingShell scripting
Shell scripting
 
Apache Hadoop Shell Rewrite
Apache Hadoop Shell RewriteApache Hadoop Shell Rewrite
Apache Hadoop Shell Rewrite
 
Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013Bioinformatics p1-perl-introduction v2013
Bioinformatics p1-perl-introduction v2013
 
Web development tools { starter pack }
Web development tools { starter pack }Web development tools { starter pack }
Web development tools { starter pack }
 

Último

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8DianaGray10
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 

Último (20)

Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8UiPath Studio Web workshop series - Day 8
UiPath Studio Web workshop series - Day 8
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 

Hadoop Streaming: Programming Hadoop without Java

  • 1. Hadoop Streaming
 Programming Hadoop without Java ! Glenn K. Lockwood, Ph.D. ! User Services Group ! San Diego Supercomputer Center ! University of California San Diego ! November 8, 2013 ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 2. Hadoop Streaming! HADOOP ARCHITECTURE RECAP" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 3. Map/Reduce Parallelism " task 5! Data! task 4! Data! task 0! Data! task 3! Data! task 1! Data! task 2! Data! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 4. Magic of HDFS " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 5. Hadoop Workflow " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 6. Hadoop Processing Pipeline " 1.  Map – convert raw input into key/value pairs on each node! 2.  Shuffle/Sort – Send all key/value pairs with the same key to the same reducer node! 3.  Reduce – For each unique key, do something with all the corresponding values! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 7. Hadoop Streaming! WORDCOUNT EXAMPLES" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 8. Hadoop and Python " •  Hadoop streaming w/ Python mappers/reducers! •  portable! •  most difficult (or least difficult) to use! •  you are the glue between Python and Hadoop! •  mrjob (or others: hadoopy, dumbo, etc)! •  •  •  •  comprehensive integration! Python interface to Hadoop streaming! Analogous interface libraries exist in R, Perl! Can interface directly with Amazon! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 9. Wordcount Example " SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 10. Hadoop Streaming with Python " •  "Simplest" (most portable) method! •  Uses raw Python, Hadoop – you are the glue! cat input.txt | mapper.py | sort | reducer.py > output.txt provide these two scripts; Hadoop does the rest! •  generalizable to any language you want (Perl, R, etc)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 11. HANDS ON – Hadoop Streaming " Located in streaming/streaming/:! •  wordcount-streaming-mapper.py
 We'll look at this first! •  wordcount-streaming-reducer.py
 We'll look at this second! •  run-wordcount.sh
 All of the Hadoop commands needed to run this example. Run the script (./run-­‐wordcount.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 12. Wordcount: Hadoop streaming mapper " #!/usr/bin/env  python     import  sys     for  line  in  sys.stdin:          line  =  line.strip()          keys  =  line.split()          for  key  in  keys:                  value  =  1                  print(  '%st%d'  %  (key,  value)  )   ...! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 13. What One Mapper Does " line  =   Call me Ishmael. Some years ago—never mind how long! keys  =   Call! me! Ishmael.! Some!years! ago--never! mind! how! long! emit.keyval(key,value)  ...   Call! years! 1! Ishmael.! 1! 1! me! mind! long! 1! 1! to the reducers! ago--never! 1! how! 1! 1! Some!1! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 14. Reducer Loop " •  If this key is the same as the previous key,! •  add this key's value to our running total.! •  Otherwise,! •  •  •  •  print out the previous key's name and the running total,! reset our running total to 0,! add this key's value to the running total, and! "this key" is now considered the "previous key"! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 15. Wordcount: Streaming Reducer (1/2) " #!/usr/bin/env  python     import  sys     last_key  =  None   running_total  =  0     for  input_line  in  sys.stdin:          input_line  =  input_line.strip()          this_key,  value  =  input_line.split("t",  1)          value  =  int(value)     (to  be  continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 16. Wordcount: Streaming Reducer (2/2) "  if  last_key  ==  this_key:                  running_total  +=  value            else:                  if  last_key:                          print(  "%st%d"  %  (last_key,  running_total)  )                  running_total  =  value                  last_key  =  this_key     if  last_key  ==  this_key:          print(  "%st%d"  %  (last_key,  running_total)  )   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 17. Testing Mappers/Reducers " •  Debugging Hadoop is not fun! $  head  -­‐n100  pg2701.txt  |        ./wordcount-­‐streaming-­‐mapper.py  |  sort  |        ./wordcount-­‐streaming-­‐reducer.py   ...   with  5   word,  1   world.  1   www.gutenberg.org  1   you  3   You  1       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 18. Launching Hadoop Streaming " $  hadoop  dfs  -­‐copyFromLocal  ./pg2701.txt  mobydick.txt     $  hadoop  jar      /opt/hadoop/contrib/streaming/hadoop-­‐streaming-­‐1.1.1.jar            -­‐D  mapred.reduce.tasks=2            -­‐mapper  "$(which  python)  $PWD/wordcount-­‐streaming-­‐mapper.py"            -­‐reducer  "$(which  python)  $PWD/wordcount-­‐streaming-­‐reducer.py"            -­‐input  mobydick.txt            -­‐output  output     $  hadoop  dfs  -­‐cat  output/part-­‐*  >  ./output.txt   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 19. Hadoop with Python - mrjob " •  •  •  •  •  Mapper, reducer written as functions! Can serialize (Pickle) objects to use as values! Presents a single key + all values at once! Extracts map/reduce errors from Hadoop for you! Hadoop runs entirely through Python:! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 20. HANDS ON - mrjob " Located in streaming/mrjob:! •  wordcount-mrjob.py
 Contains both mapper and reducer code! •  run-wordcount-mrjob.sh
 All of the hadoop commands needed to run this example. Run the script (./run-­‐wordcount-­‐mrjob.sh) or paste each command line-by-line! •  pg2701.txt
 The full text of Melville's Moby Dick! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 21. mrjob - Mapper " #!/usr/bin/env  python     from  mrjob.job  import  MRJob     class  MRwordcount(MRJob):            def  mapper(self,  _,  line):   for  line  in  sys.stdin:          line  =  line.strip()                  line  =  line.strip()          keys  =  line.split()                    keys  =  line.split()          for  key  in  keys:                  for  key  in  keys:                  value  =  1                          value  =  1                  print('%st%d'  %  (key,  value))                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 22. mrjob - Reducer "          def  mapper(self,  _,  line):                  line  =  line.strip()                  keys  =  line.split()                  for  key  in  keys:                          value  =  1                          yield  key,  value            def  reducer(self,  key,  values):                  yield  key,  sum(values)     if  __name__  ==  '__main__':          MRwordcount.run()   •  Reducer gets one key and ALL values ! •  No need to loop through key/value pairs! •  Use list methods/ iterators to deal with keys! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 23. mrjob – Job Launch " Run as a python script like any other! can pass Hadoop parameters (and many more!) in through Python! $  ./wordcount-­‐mrjob.py                -­‐-­‐jobconf  mapred.reduce.tasks=2                –r  hadoop                hdfs:///user/glock/mobydick.txt                -­‐-­‐output-­‐dir  hdfs:///user/glock/output     Default file locations are NOT on HDFS—copying to/ from HDFS is done automatically! Default output action is to print results to your screen! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 24. Hadoop Streaming! VCF PARSING: A REAL EXAMPLE" SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 25. VCF Parsing Problem" •  Variant Calling Files (VCFs) are a standard in bioinformatics! •  Large files (> 10 GB), semi-structured! •  Format is a moving target BUT parsing libraries exist (PyVCF, VCFtools)! •  Large VCFs still take too long to process serially! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 26. VCF File Format " ##fileformat=VCFv4.1   Structure ##FILTER=<ID=LowQual,Description="Low  quality">   of entire header must remain intact to ##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic  depths  for  the     ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate  read  depth   describe each variant   record! ...     #CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IHT020en^IHT0   1^I10186^I.^IT^IG^I45.44^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=-­‐0.584;DP=43   1^I10198^I.^IT^IG^I33.46^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=0.277;DP=51   1^I10279^I.^IT^IG^I48.49^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=1.855;DP=28   1^I10389^I.^IAC^IA^I288.40^I.^IAC=2;AF=0.500;AN=4;BaseQRankSum=2.540;DP=     ...     SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 27. Strategy: Expand the Pipeline " 1.  Preprocess VCF to separate header! 2.  Map! 1.  2.  3.  read in header to make
 sense of records! filter out useless records! generate key/value pairs
 for interesting variants! 3.  Sort/Shuffle! 4.  Reduce (if necessary)! 5.  Postprocess (upload to PostgresQL)! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 28. HANDS ON - VCF " Located in: streaming/vcf/   •  preprocess.py
 Extracts the header from the VCF file! •  mapper.py
 Simple PyVCF-based mapper! •  run-parsevcf.sh
 Commands to launch the simple VCF parser example! •  sample.vcf
 Sample VCF (cancer)! •  parsevcf.py
 Full preprocess+map+reduce+postprocess application! •  run-parsevcf-full.sh
 Commands to run full pre+map+red+post pipeline! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 29. Our Hands-On Version: Mapper " #!/usr/bin/env  python     import  vcf   import  sys     vcf_reader  =  vcf.Reader(open(vcfHeader,  'r'))     vcf_reader._reader  =  sys.stdin     vcf_reader.reader  =  (line.rstrip()  for  line  in            vcf_reader._reader  if  line.rstrip()  and  line[0]  !=  '#')     (continued...)   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 30. Our Hands-On Version: Mapper "  for  record  in  vcf_reader:                  chrom  =  record.CHROM                  id  =  record.ID                  pos  =  record.POS                  ref  =  record.REF                  alt  =  record.ALT                    try:                      for  idx,  af  in  enumerate(record.INFO['AF']):                          if  af  >  target_af:                              print(  "%dt%st%dt%st%st%.2ft%dt%d"  %  (                                  record.POS,  record.CHROM,  record.POS  ,                                    record.REF,  record.ALT[idx],                                  record.INFO['AF'][idx],  record.INFO['AC'][idx],                                  record.INFO['AN']  )  )                  except  KeyError:                          pass   SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 31. Our Hands-On Version: Reducer " No reduction step—can turn off reducer entirely ! hadoop  jar  $HADOOP_HOME/contrib/streaming/hadoop-­‐streaming-­‐*.jar            -­‐D  mapred.reduce.tasks=0            -­‐mapper  "$(which  python)  $PWD/parsevcf.py  -­‐m  $PWD/header.txt,0.30"            -­‐reducer  "$(which  python)  $PWD/parsevcf.py  -­‐r"            -­‐input  vcfparse-­‐input/sample.vcf            -­‐output  vcfparse-­‐output       SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO
  • 32. No Reducer – What's the Point? " 8-node test: two mappers per node = 9x speedup ! SAN DIEGO SUPERCOMPUTER CENTER UNIVERSITY OF CALIFORNIA, SAN DIEGO