SlideShare a Scribd company logo
1 of 76
1
A Guide to Python Frameworks for Hadoop
Uri Laserson
laserson@cloudera.com
20 March 2014
Goals for today
1. Easy to jump into Hadoop with Python
2. Describe 5 ways to use Python with Hadoop, batch
and interactive
3. Guidelines for choosing Python framework
2
3
Code:
https://github.com/laserson/rock-health-python
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
python-frameworks-for-hadoop/
Slides:
http://www.slideshare.net/urilaserson/
About the speaker
• Joined Cloudera late 2012
• Focus on life sciences/medical
• PhD in BME/computational biology at MIT/Harvard
(2005-2012)
• Focused on genomics
• Cofounded Good Start Genetics (2007-)
• Applying next-gen DNA sequencing to genetic carrier
screening
4
About the speaker
• No formal training in computer science
• Never touched Java
• Almost all work using Python
5
6
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy
• Disco
• octopy
• Mortar Data
• Pig UDF/Jython
• hipy
• Impala + Numba
7
Goals for Python framework
1. “Pseudocodiness”/simplicity
2. Flexibility/generality
3. Ease of use/installation
4. Performance
8
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy
• Disco
• octopy
• Mortar Data
• Pig UDF/Jython
• hipy
• Impala + Numba
9
Python frameworks for Hadoop
• Hadoop Streaming
• mrjob (Yelp)
• dumbo
• Luigi (Spotify)
• hadoopy
• pydoop
• PySpark
• happy abandoned? Jython-based
• Disco not Hadoop
• octopy not serious/not Hadoop
• Mortar Data HaaS; support numpy, scipy, nltk, pip-installable in UDF
• Pig UDF/Jython Pig is another talk; Jython limited
• hipy Python syntactic sugar to construct Hive queries
• Impala + Numba
10
11
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams
12
An n-gram is a tuple of n words.
Problem: aggregating the Google n-gram data
http://books.google.com/ngrams
1 2 3 4 5 6 7 8
( )
8-gram
13
"A partial differential equation is an equation that contains partial derivatives."
14
A partial differential equation is an equation that contains partial derivatives.
A 1
partial 2
differential 1
equation 2
is 1
an 1
that 1
contains 1
derivatives. 1
1-grams
15
A partial differential equation is an equation that contains partial derivatives.
A partial 1
partial differential 1
differential equation 1
equation is 1
is an 1
an equation 1
equation that 1
that contains 1
contains partial 1
partial derivatives. 1
2-grams
16
A partial differential equation is an equation that contains partial derivatives.
A partial differential equation is 1
partial differential equation is an 1
differential equation is an equation 1
equation is an equation that 1
is an equation that contains 1
an equation that contains partial 1
equation that contains partial derivatives. 1
5-grams
17
18
goto code
19
flourished in 1993 2 2 2
flourished in 1998 2 2 1
flourished in 1999 6 6 4
flourished in 2000 5 5 5
flourished in 2001 1 1 1
flourished in 2002 7 7 3
flourished in 2003 9 9 4
flourished in 2004 22 21 13
flourished in 2005 37 37 22
flourished in 2006 55 55 38
flourished in 2007 99 98 76
flourished in 2008 220 215 118
fluid of 1899 2 2 1
fluid of 2000 3 3 1
fluid of 2002 2 1 1
fluid of 2003 3 3 1
fluid of 2004 3 3 3
2-gram year matches pages volumes
20
Compute how often two words are near each
other in a given year.
Two words are “near” if they are both
present in a 2-, 3-, 4-, or 5-gram.
21
...2-grams...
(cat, the) 1999 14
(the, cat) 1999 7002
...3-grams...
(the, cheshire, cat) 1999 563
...4-grams...
...5-grams...
(the, cat, in, the, hat) 1999 1023
(the, dog, chased, the, cat) 1999 403
(cat, is, one, of, the) 1999 24
(cat, the) 1999 8006
(hat, the) 1999 1023
raw data
aggregated results
lexicographic
ordering
internal n-grams counted by smaller n-grams:
• avoids double-counting
• increases sensitivity (observed at least 40 times)
What is Hadoop?
• Ecosystem of tools
• Core is the HDFS file system
• Downloadable set of jars that can be run on any
machine
22
HDFS design assumptions
• Based on Google File System
• Files are large (GBs to TBs)
• Failures are common
• Massive scale means failures very likely
• Disk, node, or network failures
• Accesses are large and sequential
• Files are append-only
23
HDFS properties
• Fault-tolerant
• Gracefully responds to node/disk/network failures
• Horizontally scalable
• Low marginal cost
• High-bandwidth
24
1
2
3
4
5
2
4
5
1
2
5
1
3
4
2
3
5
1
3
4
Input File
HDFS storage distribution
Node A Node B Node C Node D Node E
MapReduce computation
25
MapReduce computation
• Structured as
1. Embarrassingly parallel “map stage”
2. Cluster-wide distributed sort (“shuffle”)
3. Aggregation “reduce stage”
• Data-locality: process the data where it is stored
• Fault-tolerance: failed tasks automatically detected
and restarted
• Schema-on-read: data must not be stored conforming
to rigid schema
26
Pseudocode for MapReduce
27
def map(record):
(ngram, year, count) = unpack(record)
// ensure word1 has the lexicographically first word:
(word1, word2) = sorted(ngram[first], ngram[last])
key = (word1, word2, year)
emit(key, count)
def reduce(key, values):
emit(key, sum(values))
All source code available on GitHub:
https://github.com/laserson/rock-health-python
Native Java
28
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class NgramsDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Job job = new Job(getConf());
job.setJarByClass(getClass());
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(NgramsMapper.class);
job.setCombinerClass(NgramsReducer.class);
job.setReducerClass(NgramsReducer.class);
job.setOutputKeyClass(TextTriple.class);
job.setOutputValueClass(IntWritable.class);
job.setNumReduceTasks(10);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new NgramsDriver(), args);
System.exit(exitCode);
}
}
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.log4j.Logger;
public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> {
private Logger LOG = Logger.getLogger(getClass());
private int expectedTokens;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName();
LOG.info("inputFile: " + inputFile);
Pattern c = Pattern.compile("([d]+)gram");
Matcher m = c.matcher(inputFile);
m.find();
expectedTokens = Integer.parseInt(m.group(1));
return;
}
@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String[] data = value.toString().split("t");
if (data.length < 3) {
return;
}
String[] ngram = data[0].split("s+");
String year = data[1];
IntWritable count = new IntWritable(Integer.parseInt(data[2]));
if (ngram.length != this.expectedTokens) {
return;
}
// build keyOut
List<String> triple = new ArrayList<String>(3);
triple.add(ngram[0]);
triple.add(ngram[expectedTokens - 1]);
Collections.sort(triple);
triple.add(year);
TextTriple keyOut = new TextTriple(triple);
context.write(keyOut, count);
}
}
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Reducer;
public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> {
@Override
protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.List;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
public class TextTriple implements WritableComparable<TextTriple> {
private Text first;
private Text second;
private Text third;
public TextTriple() {
set(new Text(), new Text(), new Text());
}
public TextTriple(List<String> list) {
set(new Text(list.get(0)),
new Text(list.get(1)),
new Text(list.get(2)));
}
public void set(Text first, Text second, Text third) {
this.first = first;
this.second = second;
this.third = third;
}
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
third.write(out);
}
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
third.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode();
}
@Override
public boolean equals(Object obj) {
if (obj instanceof TextTriple) {
TextTriple tt = (TextTriple) obj;
return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third);
}
return false;
}
@Override
public String toString() {
return first + "t" + second + "t" + third;
}
public int compareTo(TextTriple other) {
int comp = first.compareTo(other.first);
if (comp != 0) {
return comp;
}
comp = second.compareTo(other.second);
if (comp != 0) {
return comp;
}
return third.compareTo(other.third);
}
}
Native Java
• Maximum flexibility
• Fastest performance
• Native to Hadoop
• Most difficult to write
29
Hadoop Streaming
30
hadoop jar hadoop-streaming-*-.jar 
-input path/to/input
-output path/to/output
-mapper “grep WARN”
Hadoop Streaming: features
• Canonical method for using any executable as
mapper/reducer
• Includes shell commands, like grep
• Transparent communication with Hadoop though
stdin/stdout
• Key boundaries manually detected in reducer
• Built-in with Hadoop: should require no additional
framework installation
• Developer must decide how to encode more
complicated objects (e.g., JSON) or binary data
31
Hadoop Streaming
32
goto code
mrjob
33
class NgramNeighbors(MRJob):
# specify input/intermed/output serialization
# default output protocol is JSON; here we set it to text
OUTPUT_PROTOCOL = RawProtocol
def mapper(self, key, line):
pass
def combiner(self, key, counts):
pass
def reducer(self, key, counts):
pass
if __name__ == '__main__':
# sets up a runner, based on command line options
NgramNeighbors.run()
mrjob: features
• Abstracted MapReduce interface
• Handles complex Python objects
• Multi-step MapReduce workflows
• Extremely tight AWS integration
• Easily choose to run locally, on Hadoop cluster, or on
EMR
• Actively developed; great documentation
34
mrjob
35
goto code
mrjob: serialization
36
class MyMRJob(mrjob.job.MRJob):
INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol
INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol
OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol
Defaults
RawProtocol / RawValueProtocol
JSONProtocol / JSONValueProtocol
PickleProtocol / PickleValueProtocol
ReprProtocol / ReprValueProtocol
Available
Custom protocols can be written.
No current support for binary serialization schemes.
luigi
• Full-fledged workflow management, task
scheduling, dependency resolution tool in Python
(similar to Apache Oozie)
• Built-in support for Hadoop by wrapping Streaming
• Not as fully-featured as mrjob for Hadoop, but easily
customizable
• Internal serialization through repr/eval
• Actively developed at Spotify
• README is good but documentation is lacking
37
luigi
38
goto code
The cluster used for benchmarking
• 5 virtual machines
• 4 CPUs
• 10 GB RAM
• 100 GB disk
• CentOS 6.2
• CDH4 (Hadoop 2)
• 20 map tasks
• 10 reduce tasks
• Python 2.6
39
(Unscientific) performance comparison
40
(Unscientific) performance comparison
41
Streaming has
lowest overhead
(Unscientific) performance comparison
42
JSON SerDe
Feature comparison
43
Feature comparison
44
45
Questions?
‹#›
‹#›
What is Spark?
• Started in 2009 as academic project from Amplab at
UCBerkeley; now ASF and >100 contributors
• In-memory distributed execution engine
• Operates on Resilient Distributed Datasets (RDDs)
• Provides richer distributed computing primitives for
various problems
• Can support SQL, stream processing, ML, graph
computation
• Supports Scala, Java, and Python
48
Spark uses a general DAG scheduler
• Application aware scheduler
• Uses locality for both disk
and memory
• Partitioning-aware
to avoid shuffles
• Can rewrite and optimize
graph based on analysis
join
union
groupBy
map
Stage 3
Stage 1
Stage 2
A: B:
C: D:
E:
F:
G:
= cached data partition
Operations on RDDs
50
Zaharia 2011
Apache Spark
51
file = spark.textFile("hdfs://...")
errors = file.filter(lambda line: "ERROR” in line)
# Count all the errors
errors.count()
# Count errors mentioning MySQL
errors.filter(lambda line: "MySQL” in line).count()
# Fetch the MySQL errors as an array of strings
errors.filter(lambda line: "MySQL” in line).collect()
val points = spark.textFile(...).map(parsePoint).cache()
var w = Vector.random(D) // current separating plane
for (i <- 1 to ITERATIONS) {
val gradient = points.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x
).reduce(_ + _)
w -= gradient
}
println("Final separating plane: " + w)
Logfiltering
(Python)
Logisticregression
(Scala)
Apache Spark
52
goto code
What’s Impala?
• Interactive SQL
• Typically 4-65x faster than the latest Hive (observed up to 100x faster)
• Responses in seconds instead of minutes (sometimes sub-second)
• ANSI-92 standard SQL queries with HiveQL
• Compatible SQL interface for existing Hadoop/CDH applications
• Based on industry standard SQL
• Natively on Hadoop/HBase storage and metadata
• Flexibility, scale, and cost advantages of Hadoop
• No duplication/synchronization of data and metadata
• Local processing to avoid network bottlenecks
• Separate runtime from batch processing
• Hive, Pig, MapReduce are designed and great for batch
• Impala is purpose-built for low-latency SQL queries on Hadoop
Cloudera Confidential. Š2013
Cloudera, Inc. All Rights Reserved.
53
Cloudera Impala
54
SELECT cosmic as snp_id,
vcf_chrom as chr,
vcf_pos as pos,
sample_id as sample,
vcf_call_gt as genotype,
sample_affection as phenotype
FROM
hg19_parquet_snappy_join_cached_partitioned
WHERE
COSMIC IS NOT NULL AND
dbSNP IS NULL AND
sample_study = ”breast_cancer" AND
VCF_CHROM = "16";
Impala Architecture: Planner
• Example: query with join and aggregation
SELECT state, SUM(revenue)
FROM HdfsTbl h JOIN HbaseTbl b ON (...)
GROUP BY 1 ORDER BY 2 desc LIMIT 10
Hbase
Scan
Hash
Join
Hdfs
Scan
Exch
TopN
Agg
Exch
at coordinator at DataNodes at region servers
Agg
TopN
Agg
Hash
Join
Hdfs
Scan
Hbase
Scan
Cloudera Confidential. Š2013
Cloudera, Inc. All Rights Reserved.
55
Impala User-defined Functions (UDFs)
• Tuple => Scalar value
• Substring
• sin, cos, pow, …
• Machine-learning models
• Supports Hive UDFs (Java)
• Highly unpleasurable
• Impala (native) UDFs
• C++ interface designed for efficiency
• Similar to Postgres UDFs
• Runs any LLVM-compiled code
56
LLVM compiler infrastructure
57
LLVM: C++ example
58
bool StringEq(FunctionContext* context,
const StringVal& arg1,
const StringVal& arg2) {
if (arg1.is_null != arg2.is_null)
return false;
if (arg1.is_null)
return true;
if (arg1.len != arg2.len)
return false;
return (arg1.ptr == arg2.ptr) ||
memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0;
}
LLVM: IR output
59
; ModuleID = '<stdin>'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
target triple = "x86_64-apple-macosx10.7.0"
%"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* }
%"class.impala::FunctionContextImpl" = type opaque
%"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* }
%"struct.impala_udf::AnyVal" = type { i8 }
; Function Attrs: nounwind readonly ssp uwtable
define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"*
nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 {
entry:
%is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0
%0 = load i8* %is_null, align 1, !tbaa !0, !range !3
%is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0
%1 = load i8* %is_null1, align 1, !tbaa !0, !range !3
%cmp = icmp eq i8 %0, %1
br i1 %cmp, label %if.end, label %return
if.end: ; preds = %entry
%tobool = icmp eq i8 %0, 0
br i1 %tobool, label %if.end7, label %return
if.end7: ; preds = %if.end
%len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1
%2 = load i32* %len, align 4, !tbaa !4
%len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1
%3 = load i32* %len8, align 4, !tbaa !4
%cmp9 = icmp eq i32 %2, %3
br i1 %cmp9, label %if.end11, label %return
if.end11: ; preds = %if.end7
%ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2
%4 = load i8** %ptr, align 8, !tbaa !5
%ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2
%5 = load i8** %ptr12, align 8, !tbaa !5
%cmp13 = icmp eq i8* %4, %5
br i1 %cmp13, label %return, label %lor.rhs
lor.rhs: ; preds = %if.end11
%conv17 = sext i32 %2 to i64
%call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17)
%cmp18 = icmp eq i32 %call, 0
br label %return
LLVM compiler infrastructure
60
NumbaPython
Iris data and BigML
61
def predict_species_orig(sepal_width=None,
petal_length=None,
petal_width=None):
""" Predictor for species from model/52952081035d07727e01d836
Predictive model by BigML - Machine Learning Made Easy
"""
if (petal_width is None):
return u'Iris-virginica'
if (petal_width > 0.8):
if (petal_width <= 1.75):
if (petal_length is None):
return u'Iris-versicolor'
if (petal_length > 4.95):
if (petal_width <= 1.55):
return u'Iris-virginica'
if (petal_width > 1.55):
if (petal_length > 5.45):
return u'Iris-virginica'
if (petal_length <= 5.45):
return u'Iris-versicolor'
if (petal_length <= 4.95):
if (petal_width <= 1.65):
return u'Iris-versicolor'
if (petal_width > 1.65):
return u'Iris-virginica'
if (petal_width > 1.75):
if (petal_length is None):
return u'Iris-virginica'
if (petal_length > 4.85):
return u'Iris-virginica'
if (petal_length <= 4.85):
if (sepal_width is None):
return u'Iris-virginica'
if (sepal_width <= 3.1):
return u'Iris-virginica'
if (sepal_width > 3.1):
return u'Iris-versicolor'
if (petal_width <= 0.8):
return u'Iris-setosa'
Impala + Numba
62
goto code
Impala + Numba
• Still pre-alpha
• Significantly faster execution thanks to native LLVM
• Significantly easier to write UDFs
63
64
Conclusions
65
If you have access to a Hadoop cluster and you want a
one-off quick-and-dirty job…
Hadoop Streaming
66
If you want an expressive Pythonic interface to build
complex, regular ETL workflows…
Luigi
67
If you want to integrate Hadoop with other regular
processes…
Luigi
68
If you don’t have access to Hadoop and want to try
stuff out…
mrjob
69
If you’re heavily using AWS…
mrjob
70
If you want to work interactively…
PySpark
71
If you want to do in-memory analytics…
PySpark
72
If you want to do anything…*
PySpark
73
If you want ease of Python with high performance
Impala + Numba
74
If you want to write Python UDFs for SQL queries…
Impala + Numba
75
Code:
https://github.com/laserson/rock-health-python
Blog post:
http://blog.cloudera.com/blog/2013/01/a-guide-to-
python-frameworks-for-hadoop/
Slides:
http://www.slideshare.net/urilaserson/
76

More Related Content

What's hot

20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldLester Martin
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceeakasit_dpu
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data PlatformShu-Jeng Hsieh
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersDataWorks Summit/Hadoop Summit
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemCloudera, Inc.
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerDataWorks Summit
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To HadoopBill Graham
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Uwe Printz
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...DataWorks Summit
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 

What's hot (19)

20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Mutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable WorldMutable Data in Hive's Immutable World
Mutable Data in Hive's Immutable World
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
Securing data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache RangerSecuring data in hybrid environments using Apache Ranger
Securing data in hybrid environments using Apache Ranger
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
Introduction to the Hadoop Ecosystem (IT-Stammtisch Darmstadt Edition)
 
GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...GDPR compliance application architecture and implementation using Hadoop and ...
GDPR compliance application architecture and implementation using Hadoop and ...
 
Enterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on HadoopEnterprise Grade Streaming under 2ms on Hadoop
Enterprise Grade Streaming under 2ms on Hadoop
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 

Viewers also liked

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
How to find the current active namenode in a Hadoop High Availability cluster
How to find the current active namenode in a Hadoop High Availability clusterHow to find the current active namenode in a Hadoop High Availability cluster
How to find the current active namenode in a Hadoop High Availability clusterDevopam Mittra
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Julien Le Dem
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 

Viewers also liked (6)

Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
How to find the current active namenode in a Hadoop High Availability cluster
How to find the current active namenode in a Hadoop High Availability clusterHow to find the current active namenode in a Hadoop High Availability cluster
How to find the current active namenode in a Hadoop High Availability cluster
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 

Similar to Python in the Hadoop Ecosystem (Rock Health presentation)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging EnvironmentsPaul Groth
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Matthew J Collins
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppet
 
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes  with ...GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes  with ...
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...KAI CHU CHUNG
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in RSamuel Bosch
 
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017Codemotion
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installationmellempudilavanya999
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWAREFernando Lopez Aguilar
 

Similar to Python in the Hadoop Ecosystem (Rock Health presentation) (20)

Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Provenance for Data Munging Environments
Provenance for Data Munging EnvironmentsProvenance for Data Munging Environments
Provenance for Data Munging Environments
 
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
Mining Whole Museum Collections Datasets for Expanding Understanding of Colle...
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, PuppetPuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
PuppetConf 2017: How People Actually Write Puppet- Gareth Rushgrove, Puppet
 
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes  with ...GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes  with ...
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...
 
Go. Why it goes
Go. Why it goesGo. Why it goes
Go. Why it goes
 
Go. why it goes v2
Go. why it goes v2Go. why it goes v2
Go. why it goes v2
 
Fuzzing - Part 1
Fuzzing - Part 1Fuzzing - Part 1
Fuzzing - Part 1
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Handout3o
Handout3oHandout3o
Handout3o
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Reproducible Computational Research in R
Reproducible Computational Research in RReproducible Computational Research in R
Reproducible Computational Research in R
 
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
Golang and Domain Specific Languages - Lorenzo Fontana - Codemotion Rome 2017
 
Big data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with InstallationBig data using Hadoop, Hive, Sqoop with Installation
Big data using Hadoop, Hive, Sqoop with Installation
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Cosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARECosmos, Big Data GE implementation in FIWARE
Cosmos, Big Data GE implementation in FIWARE
 

More from Uri Laserson

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Uri Laserson
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Uri Laserson
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyUri Laserson
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic BiologyUri Laserson
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 

More from Uri Laserson (6)

Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)Petascale Genomics (Strata Singapore 20151203)
Petascale Genomics (Strata Singapore 20151203)
 
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
APIs and Synthetic Biology
APIs and Synthetic BiologyAPIs and Synthetic Biology
APIs and Synthetic Biology
 
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
Numba-compiled Python UDFs for Impala (Impala Meetup 5/20/14)
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 

Recently uploaded

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 

Recently uploaded (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 

Python in the Hadoop Ecosystem (Rock Health presentation)

  • 1. 1 A Guide to Python Frameworks for Hadoop Uri Laserson laserson@cloudera.com 20 March 2014
  • 2. Goals for today 1. Easy to jump into Hadoop with Python 2. Describe 5 ways to use Python with Hadoop, batch and interactive 3. Guidelines for choosing Python framework 2
  • 4. About the speaker • Joined Cloudera late 2012 • Focus on life sciences/medical • PhD in BME/computational biology at MIT/Harvard (2005-2012) • Focused on genomics • Cofounded Good Start Genetics (2007-) • Applying next-gen DNA sequencing to genetic carrier screening 4
  • 5. About the speaker • No formal training in computer science • Never touched Java • Almost all work using Python 5
  • 6. 6
  • 7. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy • Disco • octopy • Mortar Data • Pig UDF/Jython • hipy • Impala + Numba 7
  • 8. Goals for Python framework 1. “Pseudocodiness”/simplicity 2. Flexibility/generality 3. Ease of use/installation 4. Performance 8
  • 9. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy • Disco • octopy • Mortar Data • Pig UDF/Jython • hipy • Impala + Numba 9
  • 10. Python frameworks for Hadoop • Hadoop Streaming • mrjob (Yelp) • dumbo • Luigi (Spotify) • hadoopy • pydoop • PySpark • happy abandoned? Jython-based • Disco not Hadoop • octopy not serious/not Hadoop • Mortar Data HaaS; support numpy, scipy, nltk, pip-installable in UDF • Pig UDF/Jython Pig is another talk; Jython limited • hipy Python syntactic sugar to construct Hive queries • Impala + Numba 10
  • 11. 11 An n-gram is a tuple of n words. Problem: aggregating the Google n-gram data http://books.google.com/ngrams
  • 12. 12 An n-gram is a tuple of n words. Problem: aggregating the Google n-gram data http://books.google.com/ngrams 1 2 3 4 5 6 7 8 ( ) 8-gram
  • 13. 13 "A partial differential equation is an equation that contains partial derivatives."
  • 14. 14 A partial differential equation is an equation that contains partial derivatives. A 1 partial 2 differential 1 equation 2 is 1 an 1 that 1 contains 1 derivatives. 1 1-grams
  • 15. 15 A partial differential equation is an equation that contains partial derivatives. A partial 1 partial differential 1 differential equation 1 equation is 1 is an 1 an equation 1 equation that 1 that contains 1 contains partial 1 partial derivatives. 1 2-grams
  • 16. 16 A partial differential equation is an equation that contains partial derivatives. A partial differential equation is 1 partial differential equation is an 1 differential equation is an equation 1 equation is an equation that 1 is an equation that contains 1 an equation that contains partial 1 equation that contains partial derivatives. 1 5-grams
  • 17. 17
  • 19. 19 flourished in 1993 2 2 2 flourished in 1998 2 2 1 flourished in 1999 6 6 4 flourished in 2000 5 5 5 flourished in 2001 1 1 1 flourished in 2002 7 7 3 flourished in 2003 9 9 4 flourished in 2004 22 21 13 flourished in 2005 37 37 22 flourished in 2006 55 55 38 flourished in 2007 99 98 76 flourished in 2008 220 215 118 fluid of 1899 2 2 1 fluid of 2000 3 3 1 fluid of 2002 2 1 1 fluid of 2003 3 3 1 fluid of 2004 3 3 3 2-gram year matches pages volumes
  • 20. 20 Compute how often two words are near each other in a given year. Two words are “near” if they are both present in a 2-, 3-, 4-, or 5-gram.
  • 21. 21 ...2-grams... (cat, the) 1999 14 (the, cat) 1999 7002 ...3-grams... (the, cheshire, cat) 1999 563 ...4-grams... ...5-grams... (the, cat, in, the, hat) 1999 1023 (the, dog, chased, the, cat) 1999 403 (cat, is, one, of, the) 1999 24 (cat, the) 1999 8006 (hat, the) 1999 1023 raw data aggregated results lexicographic ordering internal n-grams counted by smaller n-grams: • avoids double-counting • increases sensitivity (observed at least 40 times)
  • 22. What is Hadoop? • Ecosystem of tools • Core is the HDFS file system • Downloadable set of jars that can be run on any machine 22
  • 23. HDFS design assumptions • Based on Google File System • Files are large (GBs to TBs) • Failures are common • Massive scale means failures very likely • Disk, node, or network failures • Accesses are large and sequential • Files are append-only 23
  • 24. HDFS properties • Fault-tolerant • Gracefully responds to node/disk/network failures • Horizontally scalable • Low marginal cost • High-bandwidth 24 1 2 3 4 5 2 4 5 1 2 5 1 3 4 2 3 5 1 3 4 Input File HDFS storage distribution Node A Node B Node C Node D Node E
  • 26. MapReduce computation • Structured as 1. Embarrassingly parallel “map stage” 2. Cluster-wide distributed sort (“shuffle”) 3. Aggregation “reduce stage” • Data-locality: process the data where it is stored • Fault-tolerance: failed tasks automatically detected and restarted • Schema-on-read: data must not be stored conforming to rigid schema 26
  • 27. Pseudocode for MapReduce 27 def map(record): (ngram, year, count) = unpack(record) // ensure word1 has the lexicographically first word: (word1, word2) = sorted(ngram[first], ngram[last]) key = (word1, word2, year) emit(key, count) def reduce(key, values): emit(key, sum(values)) All source code available on GitHub: https://github.com/laserson/rock-health-python
  • 28. Native Java 28 import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class NgramsDriver extends Configured implements Tool { public int run(String[] args) throws Exception { Job job = new Job(getConf()); job.setJarByClass(getClass()); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(NgramsMapper.class); job.setCombinerClass(NgramsReducer.class); job.setReducerClass(NgramsReducer.class); job.setOutputKeyClass(TextTriple.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(10); return job.waitForCompletion(true) ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new NgramsDriver(), args); System.exit(exitCode); } } import java.io.IOException; import java.util.ArrayList; import java.util.Collections; import java.util.List; import java.util.regex.Matcher; import java.util.regex.Pattern; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.lib.input.FileSplit; import org.apache.log4j.Logger; public class NgramsMapper extends Mapper<LongWritable, Text, TextTriple, IntWritable> { private Logger LOG = Logger.getLogger(getClass()); private int expectedTokens; @Override protected void setup(Context context) throws IOException, InterruptedException { String inputFile = ((FileSplit) context.getInputSplit()).getPath().getName(); LOG.info("inputFile: " + inputFile); Pattern c = Pattern.compile("([d]+)gram"); Matcher m = c.matcher(inputFile); m.find(); expectedTokens = Integer.parseInt(m.group(1)); return; } @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] data = value.toString().split("t"); if (data.length < 3) { return; } String[] ngram = data[0].split("s+"); String year = data[1]; IntWritable count = new IntWritable(Integer.parseInt(data[2])); if (ngram.length != this.expectedTokens) { return; } // build keyOut List<String> triple = new ArrayList<String>(3); triple.add(ngram[0]); triple.add(ngram[expectedTokens - 1]); Collections.sort(triple); triple.add(year); TextTriple keyOut = new TextTriple(triple); context.write(keyOut, count); } } import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.mapreduce.Reducer; public class NgramsReducer extends Reducer<TextTriple, IntWritable, TextTriple, IntWritable> { @Override protected void reduce(TextTriple key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); } } import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import java.util.List; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.WritableComparable; public class TextTriple implements WritableComparable<TextTriple> { private Text first; private Text second; private Text third; public TextTriple() { set(new Text(), new Text(), new Text()); } public TextTriple(List<String> list) { set(new Text(list.get(0)), new Text(list.get(1)), new Text(list.get(2))); } public void set(Text first, Text second, Text third) { this.first = first; this.second = second; this.third = third; } public void write(DataOutput out) throws IOException { first.write(out); second.write(out); third.write(out); } public void readFields(DataInput in) throws IOException { first.readFields(in); second.readFields(in); third.readFields(in); } @Override public int hashCode() { return first.hashCode() * 163 + second.hashCode() * 31 + third.hashCode(); } @Override public boolean equals(Object obj) { if (obj instanceof TextTriple) { TextTriple tt = (TextTriple) obj; return first.equals(tt.first) && second.equals(tt.second) && third.equals(tt.third); } return false; } @Override public String toString() { return first + "t" + second + "t" + third; } public int compareTo(TextTriple other) { int comp = first.compareTo(other.first); if (comp != 0) { return comp; } comp = second.compareTo(other.second); if (comp != 0) { return comp; } return third.compareTo(other.third); } }
  • 29. Native Java • Maximum flexibility • Fastest performance • Native to Hadoop • Most difficult to write 29
  • 30. Hadoop Streaming 30 hadoop jar hadoop-streaming-*-.jar -input path/to/input -output path/to/output -mapper “grep WARN”
  • 31. Hadoop Streaming: features • Canonical method for using any executable as mapper/reducer • Includes shell commands, like grep • Transparent communication with Hadoop though stdin/stdout • Key boundaries manually detected in reducer • Built-in with Hadoop: should require no additional framework installation • Developer must decide how to encode more complicated objects (e.g., JSON) or binary data 31
  • 33. mrjob 33 class NgramNeighbors(MRJob): # specify input/intermed/output serialization # default output protocol is JSON; here we set it to text OUTPUT_PROTOCOL = RawProtocol def mapper(self, key, line): pass def combiner(self, key, counts): pass def reducer(self, key, counts): pass if __name__ == '__main__': # sets up a runner, based on command line options NgramNeighbors.run()
  • 34. mrjob: features • Abstracted MapReduce interface • Handles complex Python objects • Multi-step MapReduce workflows • Extremely tight AWS integration • Easily choose to run locally, on Hadoop cluster, or on EMR • Actively developed; great documentation 34
  • 36. mrjob: serialization 36 class MyMRJob(mrjob.job.MRJob): INPUT_PROTOCOL = mrjob.protocol.RawValueProtocol INTERNAL_PROTOCOL = mrjob.protocol.JSONProtocol OUTPUT_PROTOCOL = mrjob.protocol.JSONProtocol Defaults RawProtocol / RawValueProtocol JSONProtocol / JSONValueProtocol PickleProtocol / PickleValueProtocol ReprProtocol / ReprValueProtocol Available Custom protocols can be written. No current support for binary serialization schemes.
  • 37. luigi • Full-fledged workflow management, task scheduling, dependency resolution tool in Python (similar to Apache Oozie) • Built-in support for Hadoop by wrapping Streaming • Not as fully-featured as mrjob for Hadoop, but easily customizable • Internal serialization through repr/eval • Actively developed at Spotify • README is good but documentation is lacking 37
  • 39. The cluster used for benchmarking • 5 virtual machines • 4 CPUs • 10 GB RAM • 100 GB disk • CentOS 6.2 • CDH4 (Hadoop 2) • 20 map tasks • 10 reduce tasks • Python 2.6 39
  • 48. What is Spark? • Started in 2009 as academic project from Amplab at UCBerkeley; now ASF and >100 contributors • In-memory distributed execution engine • Operates on Resilient Distributed Datasets (RDDs) • Provides richer distributed computing primitives for various problems • Can support SQL, stream processing, ML, graph computation • Supports Scala, Java, and Python 48
  • 49. Spark uses a general DAG scheduler • Application aware scheduler • Uses locality for both disk and memory • Partitioning-aware to avoid shuffles • Can rewrite and optimize graph based on analysis join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: = cached data partition
  • 51. Apache Spark 51 file = spark.textFile("hdfs://...") errors = file.filter(lambda line: "ERROR” in line) # Count all the errors errors.count() # Count errors mentioning MySQL errors.filter(lambda line: "MySQL” in line).count() # Fetch the MySQL errors as an array of strings errors.filter(lambda line: "MySQL” in line).collect() val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.random(D) // current separating plane for (i <- 1 to ITERATIONS) { val gradient = points.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final separating plane: " + w) Logfiltering (Python) Logisticregression (Scala)
  • 53. What’s Impala? • Interactive SQL • Typically 4-65x faster than the latest Hive (observed up to 100x faster) • Responses in seconds instead of minutes (sometimes sub-second) • ANSI-92 standard SQL queries with HiveQL • Compatible SQL interface for existing Hadoop/CDH applications • Based on industry standard SQL • Natively on Hadoop/HBase storage and metadata • Flexibility, scale, and cost advantages of Hadoop • No duplication/synchronization of data and metadata • Local processing to avoid network bottlenecks • Separate runtime from batch processing • Hive, Pig, MapReduce are designed and great for batch • Impala is purpose-built for low-latency SQL queries on Hadoop Cloudera Confidential. Š2013 Cloudera, Inc. All Rights Reserved. 53
  • 54. Cloudera Impala 54 SELECT cosmic as snp_id, vcf_chrom as chr, vcf_pos as pos, sample_id as sample, vcf_call_gt as genotype, sample_affection as phenotype FROM hg19_parquet_snappy_join_cached_partitioned WHERE COSMIC IS NOT NULL AND dbSNP IS NULL AND sample_study = ”breast_cancer" AND VCF_CHROM = "16";
  • 55. Impala Architecture: Planner • Example: query with join and aggregation SELECT state, SUM(revenue) FROM HdfsTbl h JOIN HbaseTbl b ON (...) GROUP BY 1 ORDER BY 2 desc LIMIT 10 Hbase Scan Hash Join Hdfs Scan Exch TopN Agg Exch at coordinator at DataNodes at region servers Agg TopN Agg Hash Join Hdfs Scan Hbase Scan Cloudera Confidential. Š2013 Cloudera, Inc. All Rights Reserved. 55
  • 56. Impala User-defined Functions (UDFs) • Tuple => Scalar value • Substring • sin, cos, pow, … • Machine-learning models • Supports Hive UDFs (Java) • Highly unpleasurable • Impala (native) UDFs • C++ interface designed for efficiency • Similar to Postgres UDFs • Runs any LLVM-compiled code 56
  • 58. LLVM: C++ example 58 bool StringEq(FunctionContext* context, const StringVal& arg1, const StringVal& arg2) { if (arg1.is_null != arg2.is_null) return false; if (arg1.is_null) return true; if (arg1.len != arg2.len) return false; return (arg1.ptr == arg2.ptr) || memcmp(arg1.ptr, arg2.ptr, arg1.len) == 0; }
  • 59. LLVM: IR output 59 ; ModuleID = '<stdin>' target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128" target triple = "x86_64-apple-macosx10.7.0" %"class.impala_udf::FunctionContext" = type { %"class.impala::FunctionContextImpl"* } %"class.impala::FunctionContextImpl" = type opaque %"struct.impala_udf::StringVal" = type { %"struct.impala_udf::AnyVal", i32, i8* } %"struct.impala_udf::AnyVal" = type { i8 } ; Function Attrs: nounwind readonly ssp uwtable define zeroext i1 @_Z8StringEqPN10impala_udf15FunctionContextERKNS_9StringValES4_(%"class.impala_udf::FunctionContext"* nocapture %context, %"struct.impala_udf::StringVal"* nocapture %arg1, %"struct.impala_udf::StringVal"* nocapture %arg2) #0 { entry: %is_null = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 0, i32 0 %0 = load i8* %is_null, align 1, !tbaa !0, !range !3 %is_null1 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 0, i32 0 %1 = load i8* %is_null1, align 1, !tbaa !0, !range !3 %cmp = icmp eq i8 %0, %1 br i1 %cmp, label %if.end, label %return if.end: ; preds = %entry %tobool = icmp eq i8 %0, 0 br i1 %tobool, label %if.end7, label %return if.end7: ; preds = %if.end %len = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 1 %2 = load i32* %len, align 4, !tbaa !4 %len8 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 1 %3 = load i32* %len8, align 4, !tbaa !4 %cmp9 = icmp eq i32 %2, %3 br i1 %cmp9, label %if.end11, label %return if.end11: ; preds = %if.end7 %ptr = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg1, i64 0, i32 2 %4 = load i8** %ptr, align 8, !tbaa !5 %ptr12 = getelementptr inbounds %"struct.impala_udf::StringVal"* %arg2, i64 0, i32 2 %5 = load i8** %ptr12, align 8, !tbaa !5 %cmp13 = icmp eq i8* %4, %5 br i1 %cmp13, label %return, label %lor.rhs lor.rhs: ; preds = %if.end11 %conv17 = sext i32 %2 to i64 %call = tail call i32 @memcmp(i8* %4, i8* %5, i64 %conv17) %cmp18 = icmp eq i32 %call, 0 br label %return
  • 61. Iris data and BigML 61 def predict_species_orig(sepal_width=None, petal_length=None, petal_width=None): """ Predictor for species from model/52952081035d07727e01d836 Predictive model by BigML - Machine Learning Made Easy """ if (petal_width is None): return u'Iris-virginica' if (petal_width > 0.8): if (petal_width <= 1.75): if (petal_length is None): return u'Iris-versicolor' if (petal_length > 4.95): if (petal_width <= 1.55): return u'Iris-virginica' if (petal_width > 1.55): if (petal_length > 5.45): return u'Iris-virginica' if (petal_length <= 5.45): return u'Iris-versicolor' if (petal_length <= 4.95): if (petal_width <= 1.65): return u'Iris-versicolor' if (petal_width > 1.65): return u'Iris-virginica' if (petal_width > 1.75): if (petal_length is None): return u'Iris-virginica' if (petal_length > 4.85): return u'Iris-virginica' if (petal_length <= 4.85): if (sepal_width is None): return u'Iris-virginica' if (sepal_width <= 3.1): return u'Iris-virginica' if (sepal_width > 3.1): return u'Iris-versicolor' if (petal_width <= 0.8): return u'Iris-setosa'
  • 63. Impala + Numba • Still pre-alpha • Significantly faster execution thanks to native LLVM • Significantly easier to write UDFs 63
  • 65. 65 If you have access to a Hadoop cluster and you want a one-off quick-and-dirty job… Hadoop Streaming
  • 66. 66 If you want an expressive Pythonic interface to build complex, regular ETL workflows… Luigi
  • 67. 67 If you want to integrate Hadoop with other regular processes… Luigi
  • 68. 68 If you don’t have access to Hadoop and want to try stuff out… mrjob
  • 69. 69 If you’re heavily using AWS… mrjob
  • 70. 70 If you want to work interactively… PySpark
  • 71. 71 If you want to do in-memory analytics… PySpark
  • 72. 72 If you want to do anything…* PySpark
  • 73. 73 If you want ease of Python with high performance Impala + Numba
  • 74. 74 If you want to write Python UDFs for SQL queries… Impala + Numba
  • 76. 76

Editor's Notes

  1. Share my experiences starting out on Hadoop
  2. 1. For those new to Hadoop
  3. Hipy is syntactic sugar for Hive
  4. less -S get_some_ngrams.pyhadoopfs -ls rock-health-python/ngrams
  5. Our actual computation
  6. Lexicographic orderingExternal pairs only
  7. Community Is coalescing around HDFS
  8. Community Is coalescing around HDFS
  9. Large blocksBlocks replicated around
  10. Two functions required.Just one of many engines. We’ll talk about 2 more later.
  11. Switching “hadoop” to “emr” sends job to Amazon instead.
  12. less -S get_some_ngrams.pyhadoopfs -ls rock-health-python/ngrams
  13. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  14. Software: Cloudera Enterprise – The Platform for Big DataA complete data management solution powered by Apache HadoopA collection of open source projects form the foundation of the platformCloudera has wrapped the open source core with additional software for system and data management as well as technical support5 Attributes of Cloudera Enterprise:ScalableStorage and compute in a single system – brings computation to data (rather than the other way around)Scale capacity and performance linearly – just add nodesProven at massive scale – tens of PB of data, millions of usersFlexibleStore any type of dataStructured, unstructured, semi-structuredIn it’s native format – no conversion requiredNo loss of data fidelity due to ETLFluid structuringNo single model or schema that the data must conform toDetermine how you want to look at data at the time you ask the question – if the attribute exists in the raw data, you can query against itAlter structure to optimize query performance as desired (not required) – multiple open source file formats like Avro, ParquetMultiple forms of computationBring different tools to bear on the data, depending on your skillset and what you want to doBatch processing – MapReduce, Hive, Pig, JavaInteractive SQL – Impala, BI toolsInteractive Search – for non-technical users, or helping to identify datasets for further analysisMachine learning – apply algorithms to large datasets using libraries like Apache MahoutMath – tools like SAS and R for data scientists and statisticiansMore to come…Cost-EffectiveScale out on inexpensive, industry standard hardware (vs. highly tuned, specialized hardware)Fault tolerance built-inLeverage cost structures with existing vendorsReduced data movement – can perform more operations in a single place due to flexible toolingFewer redundant copies of dataLess time spent migrating/managingOpen source software is easy acquire and prove the value/ROIOpenRapid innovationLarge development communitiesThe most talented engineers from across the worldEasy to acquire and prove valueFree to download and deployDemonstrate the value of the technology before you make a large-scale investmentNo vendor lock-in – choose your vendor based solely on meritCloudera’s open source strategyIf it stores or processes data, it’s open sourceBig commitment to open sourceLeading contributor to the Apache Hadoop ecosystem – defining the future of the platform together with the communityIntegratedWorks with all your existing investmentsDatabases and data warehousesAnalytics and BI solutionsETL toolsPlatforms and operating systemsHardware and networking equipmentOver 700 partners including all of the leaders in the market segments aboveComplements those investments by allowing you to align data and processes to the right solution
  15. Change #1
  16. Has 40 B rows. Scaled to 160 B rows, including joins.
  17. It’s easy enoughLowest overhead