A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
Â
Python in the Hadoop Ecosystem (Rock Health presentation)
1. 1
A Guide to Python Frameworks for Hadoop
Uri Laserson
laserson@cloudera.com
20 March 2014
2. Goals for today
1. Easy to jump into Hadoop with Python
2. Describe 5 ways to use Python with Hadoop, batch
and interactive
3. Guidelines for choosing Python framework
2
4. About the speaker
⢠Joined Cloudera late 2012
⢠Focus on life sciences/medical
⢠PhD in BME/computational biology at MIT/Harvard
(2005-2012)
⢠Focused on genomics
⢠Cofounded Good Start Genetics (2007-)
⢠Applying next-gen DNA sequencing to genetic carrier
screening
4
5. About the speaker
⢠No formal training in computer science
⢠Never touched Java
⢠Almost all work using Python
5
14. 14
A partial differential equation is an equation that contains partial derivatives.
A 1
partial 2
differential 1
equation 2
is 1
an 1
that 1
contains 1
derivatives. 1
1-grams
15. 15
A partial differential equation is an equation that contains partial derivatives.
A partial 1
partial differential 1
differential equation 1
equation is 1
is an 1
an equation 1
equation that 1
that contains 1
contains partial 1
partial derivatives. 1
2-grams
16. 16
A partial differential equation is an equation that contains partial derivatives.
A partial differential equation is 1
partial differential equation is an 1
differential equation is an equation 1
equation is an equation that 1
is an equation that contains 1
an equation that contains partial 1
equation that contains partial derivatives. 1
5-grams
22. What is Hadoop?
⢠Ecosystem of tools
⢠Core is the HDFS file system
⢠Downloadable set of jars that can be run on any
machine
22
23. HDFS design assumptions
⢠Based on Google File System
⢠Files are large (GBs to TBs)
⢠Failures are common
⢠Massive scale means failures very likely
⢠Disk, node, or network failures
⢠Accesses are large and sequential
⢠Files are append-only
23
24. HDFS properties
⢠Fault-tolerant
⢠Gracefully responds to node/disk/network failures
⢠Horizontally scalable
⢠Low marginal cost
⢠High-bandwidth
24
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E
26. MapReduce computation
⢠Structured as
1. Embarrassingly parallel âmap stageâ
2. Cluster-wide distributed sort (âshuffleâ)
3. Aggregation âreduce stageâ
⢠Data-locality: process the data where it is stored
⢠Fault-tolerance: failed tasks automatically detected
and restarted
⢠Schema-on-read: data must not be stored conforming
to rigid schema
26
27. Pseudocode for MapReduce
27
def map(record):
(ngram, year, count) = unpack(record)
// ensure word1 has the lexicographically first word:
(word1, word2) = sorted(ngram[first], ngram[last])
key = (word1, word2, year)
emit(key, count)
def reduce(key, values):
emit(key, sum(values))
All source code available on GitHub:
https://github.com/laserson/rock-health-python
28. Native Java
28
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class NgramsDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(NgramsMapper.class);
job.setCombinerClass(NgramsReducer.class);
job.setReducerClass(NgramsReducer.class);
job.setOutputKeyClass(TextTriple.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(10);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new NgramsDriver(), args);
System.exit(exitCode);
}
}
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.log4j.Logger;
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {
private Logger LOG = Logger.getLogger(getClass());
private int expectedTokens;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();
LOG.info("inputFile: " + inputFile);
Pattern c = Pattern.compile("([d]+)gram");
Matcher m = c.matcher(inputFile);
m.find();
expectedTokens = Integer.parseInt(m.group(1));
return;
}
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("t");
if (data.length < 3) {
return;
}
String[] ngram = data[0].split("s+");
String year = data[1];
IntWritable count = new IntWritable(Integer.parseInt(data[2]));
if (ngram.length != this.expectedTokens) {
return;
}
// build keyOut
List<String> triple = new ArrayList<String>(3);
triple.add(ngram[0]);
triple.add(ngram[expectedTokens - 1]);
Collections.sort(triple);
triple.add(year);
TextTriple keyOut = new TextTriple(triple);
context.write(keyOut, count);
}
}
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> {
@Override
protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class TextTriple implements WritableComparable<TextTriple> {
private Text first;
private Text second;
private Text third;
public TextTriple() {
set(new Text(), new Text(), new Text());
}
public TextTriple(List<String> list) {
set(new Text(list.get(0)),
new Text(list.get(1)),
new Text(list.get(2)));
}
public void set(Text first, Text second, Text third) {
this.first = first;
this.second = second;
this.third = third;
}
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
third.write(out);
}
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
third.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();
}
@Override
public boolean equals(Object obj) {
if (obj instanceof TextTriple) {
TextTriple tt = (TextTriple) obj;
return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third);
}
return false;
}
@Override
public String toString() {
return first + "t" + second + "t" + third;
}
public int compareTo(TextTriple other) {
int comp = first.compareTo(other.first);
if (comp != 0) {
return comp;
}
comp = second.compareTo(other.second);
if (comp != 0) {
return comp;
}
return third.compareTo(other.third);
}
}
29. Native Java
⢠Maximum flexibility
⢠Fastest performance
⢠Native to Hadoop
⢠Most difficult to write
29
31. Hadoop Streaming: features
⢠Canonical method for using any executable as
mapper/reducer
⢠Includes shell commands, like grep
⢠Transparent communication with Hadoop though
stdin/stdout
⢠Key boundaries manually detected in reducer
⢠Built-in with Hadoop: should require no additional
framework installation
⢠Developer must decide how to encode more
complicated objects (e.g., JSON) or binary data
31
33. mrjob
33
class NgramNeighbors(MRJob):
# specify input/intermed/output serialization
# default output protocol is JSON; here we set it to text
OUTPUT_PROTOCOL = RawProtocol
def mapper(self, key, line):
pass
def combiner(self, key, counts):
pass
def reducer(self, key, counts):
pass
if __name__ == '__main__':
# sets up a runner, based on command line options
NgramNeighbors.run()
34. mrjob: features
⢠Abstracted MapReduce interface
⢠Handles complex Python objects
⢠Multi-step MapReduce workflows
⢠Extremely tight AWS integration
⢠Easily choose to run locally, on Hadoop cluster, or on
EMR
⢠Actively developed; great documentation
34
36. mrjob: serialization
36
class MyMRJob(mrjob.job.MRJob):
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
Defaults
RawProtocol / RawValueProtocol
JSONProtocol / JSONValueProtocol
PickleProtocol / PickleValueProtocol
ReprProtocol / ReprValueProtocol
Available
Custom protocols can be written.
No current support for binary serialization schemes.
37. luigi
⢠Full-fledged workflow management, task
scheduling, dependency resolution tool in Python
(similar to Apache Oozie)
⢠Built-in support for Hadoop by wrapping Streaming
⢠Not as fully-featured as mrjob for Hadoop, but easily
customizable
⢠Internal serialization through repr/eval
⢠Actively developed at Spotify
⢠README is good but documentation is lacking
37
48. What is Spark?
⢠Started in 2009 as academic project from Amplab at
UCBerkeley; now ASF and >100 contributors
⢠In-memory distributed execution engine
⢠Operates on Resilient Distributed Datasets (RDDs)
⢠Provides richer distributed computing primitives for
various problems
⢠Can support SQL, stream processing, ML, graph
computation
⢠Supports Scala, Java, and Python
48
49. Spark uses a general DAG scheduler
⢠Application aware scheduler
⢠Uses locality for both disk
and memory
⢠Partitioning-aware
to avoid shuffles
⢠Can rewrite and optimize
graph based on analysis
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
51. Apache Spark
51
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERRORâ in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQLâ in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQLâ in line).collect()
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)
Logfiltering
(Python)
Logisticregression
(Scala)
53. Whatâs Impala?
⢠Interactive SQL
⢠Typically 4-65x faster than the latest Hive (observed up to 100x faster)
⢠Responses in seconds instead of minutes (sometimes sub-second)
⢠ANSI-92 standard SQL queries with HiveQL
⢠Compatible SQL interface for existing Hadoop/CDH applications
⢠Based on industry standard SQL
⢠Natively on Hadoop/HBase storage and metadata
⢠Flexibility, scale, and cost advantages of Hadoop
⢠No duplication/synchronization of data and metadata
⢠Local processing to avoid network bottlenecks
⢠Separate runtime from batch processing
⢠Hive, Pig, MapReduce are designed and great for batch
⢠Impala is purpose-built for low-latency SQL queries on Hadoop
Cloudera Confidential. Š2013
Cloudera, Inc. All Rights Reserved.
53
54. Cloudera Impala
54
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = âbreast_cancer" AND
VCF_CHROM = "16";
55. Impala Architecture: Planner
⢠Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
Hbase
Scan
Hash
Join
Hdfs
Scan
Exch
TopN
Agg
Exch
at coordinator at DataNodes at region servers
Agg
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Cloudera Confidential. Š2013
Cloudera, Inc. All Rights Reserved.
55
56. Impala User-defined Functions (UDFs)
⢠Tuple => Scalar value
⢠Substring
⢠sin, cos, pow, âŚ
⢠Machine-learning models
⢠Supports Hive UDFs (Java)
⢠Highly unpleasurable
⢠Impala (native) UDFs
⢠C++ interface designed for efficiency
⢠Similar to Postgres UDFs
⢠Runs any LLVM-compiled code
56
61. Iris data and BigML
61
def predict_species_orig(sepal_width=None,
petal_length=None,
petal_width=None):
""" Predictor for species from model/52952081035d07727e01d836
Predictive model by BigML - Machine Learning Made Easy
"""
if (petal_width is None):
return u'Iris-virginica'
if (petal_width > 0.8):
if (petal_width <= 1.75):
if (petal_length is None):
return u'Iris-versicolor'
if (petal_length > 4.95):
if (petal_width <= 1.55):
return u'Iris-virginica'
if (petal_width > 1.55):
if (petal_length > 5.45):
return u'Iris-virginica'
if (petal_length <= 5.45):
return u'Iris-versicolor'
if (petal_length <= 4.95):
if (petal_width <= 1.65):
return u'Iris-versicolor'
if (petal_width > 1.65):
return u'Iris-virginica'
if (petal_width > 1.75):
if (petal_length is None):
return u'Iris-virginica'
if (petal_length > 4.85):
return u'Iris-virginica'
if (petal_length <= 4.85):
if (sepal_width is None):
return u'Iris-virginica'
if (sepal_width <= 3.1):
return u'Iris-virginica'
if (sepal_width > 3.1):
return u'Iris-versicolor'
if (petal_width <= 0.8):
return u'Iris-setosa'
less -S get_some_ngrams.pyhadoopfs -ls rock-health-python/ngrams
Our actual computation
Lexicographic orderingExternal pairs only
Community Is coalescing around HDFS
Community Is coalescing around HDFS
Large blocksBlocks replicated around
Two functions required.Just one of many engines. Weâll talk about 2 more later.
Switching âhadoopâ to âemrâ sends job to Amazon instead.
less -S get_some_ngrams.pyhadoopfs -ls rock-health-python/ngrams
Software: Cloudera Enterprise â The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system â brings computation to data (rather than the other way around)Scale capacity and performance linearly â just add nodesProven at massive scale â tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn itâs native format â no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question â if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) â multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing â MapReduce, Hive, Pig, JavaInteractive SQL â Impala, BI toolsInteractive Search â for non-technical users, or helping to identify datasets for further analysisMachine learning â apply algorithms to large datasets using libraries like Apache MahoutMath â tools like SAS and R for data scientists and statisticiansMore to comeâŚCost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement â can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in â choose your vendor based solely on meritClouderaâs open source strategyIf it stores or processes data, itâs open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem â defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
Software: Cloudera Enterprise â The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system â brings computation to data (rather than the other way around)Scale capacity and performance linearly â just add nodesProven at massive scale â tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn itâs native format â no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question â if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) â multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing â MapReduce, Hive, Pig, JavaInteractive SQL â Impala, BI toolsInteractive Search â for non-technical users, or helping to identify datasets for further analysisMachine learning â apply algorithms to large datasets using libraries like Apache MahoutMath â tools like SAS and R for data scientists and statisticiansMore to comeâŚCost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement â can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in â choose your vendor based solely on meritClouderaâs open source strategyIf it stores or processes data, itâs open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem â defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
Change #1
Has 40 B rows. Scaled to 160 B rows, including joins.