SlideShare una empresa de Scribd logo
1 de 64
A New Partnership for eScience
Bill Howe, UW
Ed Lazowska, UW
Garth Gibson, CMU
Christos Faloutsos, CMU
Peter Lee, CMU (DARPA)
Chris Mentzel, Moore
QuickTime™ and a
decompressor
are needed to see this picture.
http://escience.washington.edu
3/12/09 Bill Howe, eScience Institute4
3/12/09 Bill Howe, eScience Institute5
3/12/09 Bill Howe, eScience Institute6
The University of Washington
eScience Institute
 Rationale
 The exponential increase in sensors is transitioning all fields of science and
engineering from data-poor to data-rich
 Techniques and technologies include

Sensors and sensor networks, databases, data mining, machine learning,
visualization, cluster/cloud computing
 If these techniques and technologies are not widely available and widely
practiced, UW will cease to be competitive
 Mission
 Help position the University of Washington at the forefront of research both in
modern eScience techniques and technologies, and in the fields that depend
upon them
 Strategy
 Bootstrap a cadre of Research Scientists
 Add faculty in key fields
 Build out a “consultancy” of students and non-research staff
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute7
Staff and Funding
 Funding
 $1M/year direct appropriation from WA State Legislature
 $1.5M from Gordon and Betty Moore Foundation (joint with CMU)
 Multiple proposals outstanding
 Staffing
 Dave Beck, Research Scientist: Biosciences and software eng.
 Jeff Gardner, Research Scientist: Astrophysics and HPC
 Bill Howe,Research Scientist: Databases, visualization, DISC
 Ed Lazowska, Director
 Erik Lundberg (50%), Operations Director
 Mette Peters, Health Sciences Liaison
 Chance Reschke, Research Engineer: large scale computing platforms
 …plus a senior faculty search underway
 …plus a “consultancy” of students and professional staff
3/12/09 Bill Howe, eScience Institute8
All science is reducing to a database problem
Old model: “Query the world” (Data acquisition coupled to a specific hypothesis)
New model: “Download the world” (Data acquired en masse, in support of many hypotheses)
 Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)
 Medicine: ubiquitous digital records, MRI, ultrasound
 Oceanography: high-resolution models, cheap sensors, satellites
 Biology: lab automation, high-throughput sequencing
“Increase data collection exponentially with FlowCam!”
3/12/09 Bill Howe, eScience Institute9
The long tail is getting fatter:
notebooks become spreadsheets (MB),
spreadsheets become databases (GB),
databases become clusters (TB)
clusters become clouds (PB)
The Long Tail
datavolume
rank
Researchers with growing data management challenges
but limited resources for cyberinfrastructure
• No dedicated IT staff
• Over-reliance on inadequate but familiar tools
CERN
(~15PB/year)
LSST
(~100PB)
PanSTARRS
(~40PB)
Ocean
Modelers <Spreadsheet
users>
SDSS
(~100TB)
Seis-
mologists
MicrobiologistsCARMEN
(~50TB)
“The future is already here. It’s just not very
evenly distributed.” -- William Gibson
3/12/09 Bill Howe, eScience Institute10
Case Study: Armbrust Lab
3/12/09 Bill Howe, eScience Institute11
Armbrust Lab Tech Roadmap
ClustalW
scalability
cluster/cloud
workstation/server
MAQ
specifictasks
generaltasks
Excel
NCBI
BLAST
Phred/Phrap
CloudBurst
CLC Genomics
Machine
Hadoop/
Dryad
Parallel
Databases
?
Azure, AWS
WebBlast*
RDBMS
R
PPlacer*
AnnoJ
BioPython
Past
Present
Soon
Other tools
specialization
3/12/09 Bill Howe, eScience Institute12
What Does Scalable Mean?
 Operationally:
 In the past: “Works even if data doesn’t fit in main memory”
 Now: “Can make use of 1000s of cheap computers”
 Formally:
 In the past: polynomial time and space. If you have N data
items, you must do no more than Nk
operations
 Soon: logarithmic time and linear space. If you have N data
items, you must do no more than N log(N) operations
 Soon, you’ll only get one pass at the data
 So you better make that one pass count
3/12/09 Bill Howe, eScience Institute13
A Goal: Cross-Scale Solutions
 Gracefully scale up
 from files to databases to cluster to cloud
 from MB to GB to TB to PB
 “Gracefully” means:
 logical data independence
 no expensive ETL migration projects
 “Gracefully” means: everyone can use it
 Hackers / Computational Scientists
 Lab/Field Scientists
 The Public
 K12
 Legislators
3/12/09 Bill Howe, eScience Institute14
Data Model Operations Services
GPL * * None for free
Workflow * arbitrary boxes-
and-arrows
typing, provenance,
Pegasus-style resource
mapping, task
parallelism
SQL /
Relational
Algebra
Relations Select, Project,
Join, Aggregate, …
optimization, physical
data independence,
indexing, parallelism
MapReduce [(key,value)] Map, Reduce massive data
parallelism, fault
tolerance, scheduling
Pig Nested
Relations
RA-like, with
Nest/Flatten
optimization,
monitoring, scheduling
DryadLINQ IQueryable,
IEnumerable
RA + Apply +
Partitioning
typing, massive data
parallelism, fault
tolerance
MPI Arrays/
Matrices
70+ ops data parallelism, full
control
3/12/09 Bill Howe, eScience Institute15
MapReduce
 Many tasks process big data, produce big data
 Want to use hundreds or thousands of CPUs
 ... but this needs to be easy
 Parallel databases exist, but require DBAs and $$$$
 …and do not easily scale to thousands of computers
 MapReduce is a lightweight framework, providing:
 Automatic parallelization and distribution
 Fault-tolerance
 I/O scheduling
 Status and monitoring
3/12/09 Bill Howe, eScience Institute16
public class LogEntry {
public string user, ip;
public string page;
public LogEntry(string line) {
string[] fields = line.Split(' ');
this.user = fields[8];
this.ip = fields[9];
this.page = fields[5];
}
}
public class UserPageCount{
public string user, page;
public int count;
public UserPageCount(
string usr, string page, int cnt){
this.user = usr;
this.page = page;
this.count = cnt;
}
}
PartitionedTable<string> logs =
PartitionedTable.Get<string>(@”file:…logfile.pt”);
var logentries =
from line in logs
where !line.StartsWith("#")
select new LogEntry(line);
var user =
from access in logentries
where access.user.EndsWith(@"ulfar")
select access;
var accesses =
from access in user
group access by access.page into pages
select new UserPageCount("ulfar", pages.Key, pages.Count());
var htmAccesses =
from access in accesses
where access.page.EndsWith(".htm")
orderby access.count descending
select access;
htmAccesses.ToPartitionedTable(@”file:…results.pt”);
slide source: Christophe Poulain, MSR
A complete DryadLINQ program
3/12/09 Bill Howe, eScience Institute17
Relational Databases
Pre-relational DBMS brittleness: if your
data changed, your application often
broke.
Early RDBMS were buggy and slow (and
often reviled), but required only 5% of the
application code.
physical data independence
logical data independence
files and
pointers
relations
view
s
“Activities of users at terminals and
most application programs should
remain unaffected when the internal
representation of data is changed and
even when some aspects of the
external representation are changed.”
Key Idea: Programs that manipulate tabular
data exhibit an algebraic structure allowing
reasoning and manipulation independently
of physical data representation
3/12/09 Bill Howe, eScience Institute18
Relational Databases
 Databases are especially, but exclusively, effective at
“Needle in Haystack” problems:
 Extracting small results from big datasets
 Transparently provide “old style” scalability
 Your query will always* finish, regardless of dataset size.
 Indexes are easily built and automatically used when
appropriateCREATE INDEX seq_idx ON sequence(seq);
SELECT seq
FROM sequence
WHERE seq = ‘GATTACGATATTA’;
*almost
3/12/09 Bill Howe, eScience Institute19
Key Idea: Data Independence
physical data independence
logical data independence
files and
pointers
relations
view
s
SELECT *
FROM my_sequences
SELECT seq
FROM ncbi_sequences
WHERE seq =
‘GATTACGATATTA’;
f = fopen(‘table_file’);
fseek(10030440);
while (True) {
fread(&buf, 1, 8192, f);
if (buf == GATTACGATATTA) {
. . .
3/12/09 Bill Howe, eScience Institute20
Key Idea: An Algebra of Tables
select
project
join join
Other operators: aggregate, union, difference, cross product
3/12/09 Bill Howe, eScience Institute21
Key Idea: Algebraic Optimization
N = ((z*2)+((z*3)+0))/1
Algebraic Laws:
1. (+) identity: x+0 = x
2. (/) identity: x/1 = x
3. (*) distributes: (n*x+n*y) = n*(x+y)
4. (*) commutes: x*y = y*x
Apply rules 1, 3, 4, 2:
N = (2+3)*z
two operations instead of five, no division operator
Same idea works with the Relational Algebra!
3/12/09 Bill Howe, eScience Institute22
Shared Nothing Parallel Databases
 Teradata
 Greenplum
 Netezza
 Aster Data Systems
 DataAllegro
 Vertica
 MonetDB
Microsoft
Recently commercialized as “Vectorwise”
Case Study: Astrophysics Simulation
24
N-body Astrophysics Simulation
• 15 years in dev
• 109
particles
• Gravity
• Months to run
• 7.5 million
CPU hours
• 500 timesteps
• Big Bang to now
Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul
Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
25
Q1: Find Hot Gas
SELECT id
FROM gas
WHERE temp > 150000
26
Single Node: Query 1
169 MB 1.4 GB 36 GB
27
Multiple Nodes: Query 1
Database Z
28
Multiple Nodes:Query 2
Database Z
29
Q4: Gas Deletion
SELECT gas1.id
FROM gas1
FULL OUTER JOIN gas2
ON gas1.id=gas2.id
WHERE gas2.id=NULL
Particles removed
between two timesteps
30
Single Node: Query 4
31
Multiple Nodes: Query 4
3/12/09 Bill Howe, eScience Institute32
Ease of Use
star43 = FOREACH rawGas43 GENERATE $0 AS pid:long;
star60 = FOREACH rawGas60 GENERATE $0 AS pid:long;
groupedGas = COGROUP star43 BY pid, star60 BY pid;
selectedGas = FOREACH groupedGas GENERATE
FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43,
FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60;
destroyed = FILTER selectedGas BY s60 is null;
Visualization and Mashups
Dancing with Data
3/12/09 Bill Howe, eScience Institute34
Data explosion, again
 Data growth is outpacing Moore’s Law
Why?
 Cost of acquisition has dropped through the floor
 Every pairwise comparison of datasets
generates a new dataset -- N2
growth
 So: Scalable analysis is necessary
 But: Scalable analysis is hard
3/12/09 Bill Howe, eScience Institute35
It’s not just the size….
 Corollary: # of apps scales as N2
 Every pairwise comparison motivates a new application
 To keep up, we need to
 entrain new programmers,
 make existing programmers more productive,
 or both
3/12/09 Bill Howe, eScience Institute36
Satellite Images + Crime Incidence Reports
3/12/09 Bill Howe, eScience Institute37
Twitter Feed + Flickr Stream
3/12/09 Bill Howe, eScience Institute38
Zooplankton and Temperature
<Vis movie>
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute39
Why Visualization?
 High bandwidth of the human visual cortex
 Query-writing presumes a precise goal
 Try this in SQL: “What does the salt wedge look like?”
3/12/09 Bill Howe, eScience Institute40
Data Product Ensembles
source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
3/12/09 Bill Howe, eScience Institute41
Example: Find matching sequences
 Given a set of sequences
 Find all sequences equal to
“GATTACGATATTA”
3/12/09 Bill Howe, eScience Institute42
Example System: Teradata
AMP = unit of parallelism
3/12/09 Bill Howe, eScience Institute43
Example System: Teradata
SELECT *
FROM Orders o, Lines i
WHERE o.item = i.item
AND o.date = today()
join
select
scan scan
date = today()
o.item = i.item
Order oItem i
Find all orders from today, along with the items ordered
3/12/09 Bill Howe, eScience Institute44
Example System: Teradata
AMP 1 AMP 2 AMP 3
select
date=today()
select
date=today()
select
date=today()
scan
Order o
scan
Order o
scan
Order o
hash
h(item)
hash
h(item)
hash
h(item)
AMP 1 AMP 2 AMP 3
3/12/09 Bill Howe, eScience Institute45
Example System: Teradata
AMP 1 AMP 2 AMP 3
scan
Item i
AMP 1 AMP 2 AMP 3
hash
h(item)
scan
Item i
hash
h(item)
scan
Item i
hash
h(item)
3/12/09 Bill Howe, eScience Institute46
Example System: Teradata
AMP 1 AMP 2 AMP 3
join join join
o.item = i.item o.item = i.item o.item = i.item
contains all orders and all lines
where hash(item) = 1
contains all orders and all lines
where hash(item) = 2
contains all orders and all lines
where hash(item) = 3
3/12/09 Bill Howe, eScience Institute47
MapReduce Programming Model
 Input & Output: each a set of key/value pairs
 Programmer specifies two functions:
 Processes input key/value pair
 Produces set of intermediate pairs
 Combines all intermediate values for a particular key
 Produces a set of merged output values (usually just one)
map (in_key, in_value) -> list(out_key, intermediate_value)
reduce (out_key, list(intermediate_value)) -> list(out_value)
Inspired by primitives from functional programming
languages such as Lisp, Scheme, and Haskell
slide source: Google, Inc.
3/12/09 Bill Howe, eScience Institute48
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Document Processing
3/12/09 Bill Howe, eScience Institute49
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from that subordination in
which they have hitherto remained, and to assume among powers of the earth the equal and independent station
to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind
requires that they should declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent; that from that equal
creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and
the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just
power from the consent of the governed; that whenever any form of government shall become destructive of
these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's
foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect
their safety and happiness. Prudence indeed will dictate that governments long established should not be
changed for light and transient causes: and accordingly all experience hath shewn that mankind are more
disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are
accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing
invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to
throw off such government and to provide new guards for future security. Such has been the patient sufferings
of the colonies; and such is now the necessity which constrains them to expunge their former systems of
government. the history of his present majesty is a history of unremitting injuries and usurpations, among which
no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object
the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid
world, for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
How many “big”, “medium”, and “small” words are used?
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Big = Yellow = 10+ letters
Medium = Red = 5..9 letters
Small = Blue = 2..4 letters
Tiny = Pink = 1 letter
Example: Word length histogram
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Example: Word length histogram
Split the document into
chunks and process
each chunk on a
different computer
Chunk 1
Chunk 2
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Abridged Declaration of Independence
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map Task 1
(204 words)
Map Task 2
(190 words)
(key, value)
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
Example: Word length histogram
3/12/09 Bill Howe, eScience Institute53
(yellow, 17)
(red, 77)
(blue, 107)
(pink, 3)
(yellow, 20)
(red, 71)
(blue, 93)
(pink, 6 )
Reduce tasks
(yellow, 17)
(yellow, 20)
(red, 77)
(red, 71)
(blue, 93)
(blue, 107)
(pink, 6)
(pink, 3)
Example: Word length histogram
A Declaration By the Representatives of the United States of America, in General
Congress Assembled.
When in the course of human events it becomes necessary for a people to advance from
that subordination in which they have hitherto remained, and to assume among powers of
the earth the equal and independent station to which the laws of nature and of nature's
god entitle them, a decent respect to the opinions of mankind requires that they should
declare the causes which impel them to the change.
We hold these truths to be self-evident; that all men are created equal and independent;
that from that equal creation they derive rights inherent and inalienable, among which are
the preservation of life, and liberty, and the pursuit of happiness; that to secure these
ends, governments are instituted among men, deriving their just power from the consent
of the governed; that whenever any form of government shall become destructive of these
ends, it is the right of the people to alter or to abolish it, and to institute new government,
laying it's foundation on such principles and organizing it's power in such form, as to
them shall seem most likely to effect their safety and happiness. Prudence indeed will
dictate that governments long established should not be changed for light and transient
causes: and accordingly all experience hath shewn that mankind are more disposed to
suffer while evils are sufferable, than to right themselves by abolishing the forms to
which they are accustomed. But when a long train of abuses and usurpations, begun at a
distinguished period, and pursuing invariably the same object, evinces a design to reduce
them to arbitrary power, it is their right, it is their duty, to throw off such government and
to provide new guards for future security. Such has been the patient sufferings of the
colonies; and such is now the necessity which constrains them to expunge their former
systems of government. the history of his present majesty is a history of unremitting
injuries and usurpations, among which no one fact stands single or solitary to contradict
the uniform tenor of the rest, all of which have in direct object the establishment of an
absolute tyranny over these states. To prove this, let facts be submitted to a candid world,
for the truth of which we pledge a faith yet unsullied by falsehood.
Map task 1
Map task 2
“Shuffle step”
(yellow, 37)
(red, 148)
(blue, 200)
(pink, 9)
3/12/09 Bill Howe, eScience Institute54
New Example: What does this do?
map(String input_key, String input_value):
// input_key: document name
// input_value: document contents
for each word w in input_value:
EmitIntermediate(w, 1);
reduce(String output_key, Iterator intermediate_values):
// output_key: word
// output_values: ????
int result = 0;
for each v in intermediate_values:
result += v;
Emit(result);
slide source: Google, Inc.
3/12/09 Bill Howe, eScience Institute55
Before RDBMS: if your data changed, your application broke.
Early RDBMS were buggy and slow (and often reviled), but
required only 5% of the application code.
“Activities of users at terminals and most application programs
should remain unaffected when the internal representation of data
is changed and even when some aspects of the external
representation are changed.” -- E.F. Codd 1979
Key Ideas: Programs that manipulate tabular data exhibit an
algebraic structure allowing reasoning and manipulation
independently of physical data representation
Relational Database Management Systems (RDBMS)
3/12/09 Bill Howe, eScience Institute56
MapReduce is a Nascent Database Engine
Access Methods and
Scheduling:
Query Language:
Query Optimizer:
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Pig Latin
Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
3/12/09 Bill Howe, eScience Institute57
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
MapReduce and Hadoop
 MR introduced by Google
 Published paper in OSDI 2004
 MR: high-level programming model and
implementation for large-scale parallel data
processing
 Hadoop
 Open source MR implementation
 Yahoo!, Facebook, New York Times
3/12/09 Bill Howe, eScience Institute58
operators:
• LOAD
• STORE
• FILTER
• FOREACH … GENERATE
• GROUP
binary operators:
• JOIN
• COGROUP
• UNION
other support:
• UDFs
• COUNT
• SUM
• AVG
• MIN/MAX
Additional operators:
http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm
A Query Language for MR: Pig Latin
 High-level, SQL-like dataflow language for MR
 Goal: Sweet spot between SQL and MR
 Applies SQL-like, high-level language constructs to
accomplish low-level MR programming.
3/12/09 Bill Howe, eScience Institute59
New Task: k-mer Similarity
 Given a set of database sequences and a
set of query sequences
 Return the top N similar pairs, where
similarity is defined as the number of
common k-mers
3/12/09 Bill Howe, eScience Institute60
Pig Latin program
D = LOAD ’db_sequences.fasta' USING FASTA() AS
(did,dsequence);
Q = LOAD ’query_sequences.fasta' USING FASTA() AS
(qid,qsequence);
Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence));
Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence));
R = JOIN Kd BY kmer, Kq BY kmer
G = GROUP R BY (qid, did);
C = FOREACH G GENERATE qid, did, COUNT(kmer) as score
T = FILTER C BY score > 4
STORE g INTO seqs.txt';
3/12/09 Bill Howe, eScience Institute61
New Task: Alignment
 RMAP alignment implemented in Hadoop
Michael Schatz, CloudBurst: highly sensitive read mapping with
MapReduce, Bioinformatics 25(11), April 2009
 Goal: Align reads to a reference genome
 Overview:
 Map: Split reads and reference into k-mers
 Reduce: for matching k-mers, find end-to-end
alignments (seed and extend)
3/12/09 Bill Howe, eScience Institute62
MapReduce Overhead
QuickTime™ and a
decompressor
are needed to see this picture.
3/12/09 Bill Howe, eScience Institute63
Elastic MapReduce
 Custom Jar
 Java
 Streaming
 Any language that can read/write stdin/stdout
 Pig
 Simple data flow language
 Hive
 SQL
3/12/09 Bill Howe, eScience Institute64
Taxonomy of Parallel Architectures
Easiest to program, but
$$$$
Scales to 1000s of computers

Más contenido relacionado

La actualidad más candente

A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: EywaEugene Siow
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Robert Grossman
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the ContinuumIan Foster
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paperJose Enrique Ruiz
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Robert Grossman
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Ola Spjuth
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable PapersJose Enrique Ruiz
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchRobert Grossman
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationIan Foster
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Robert Grossman
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceRobert Grossman
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!Ian Foster
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesJose Enrique Ruiz
 
Digital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in AstronomyDigital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in AstronomyJose Enrique Ruiz
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilitiesIan Foster
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneIan Foster
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? Robert Grossman
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefRobert Grossman
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Robert Grossman
 

La actualidad más candente (20)

A Biological Internet?: Eywa
A Biological Internet?: EywaA Biological Internet?: Eywa
A Biological Internet?: Eywa
 
Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11Open Science Data Cloud - CCA 11
Open Science Data Cloud - CCA 11
 
Coding the Continuum
Coding the ContinuumCoding the Continuum
Coding the Continuum
 
Digital Science: Towards the executable paper
Digital Science: Towards the executable paperDigital Science: Towards the executable paper
Digital Science: Towards the executable paper
 
Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)Open Science Data Cloud (IEEE Cloud 2011)
Open Science Data Cloud (IEEE Cloud 2011)
 
Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...Data-intensive applications on cloud computing resources: Applications in lif...
Data-intensive applications on cloud computing resources: Applications in lif...
 
Open Science and Executable Papers
Open Science and Executable PapersOpen Science and Executable Papers
Open Science and Executable Papers
 
Using the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science ResearchUsing the Open Science Data Cloud for Data Science Research
Using the Open Science Data Cloud for Data Science Research
 
Data Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud AutomationData Tribology: Overcoming Data Friction with Cloud Automation
Data Tribology: Overcoming Data Friction with Cloud Automation
 
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
Introduction to Big Data and Science Clouds (Chapter 1, SC 11 Tutorial)
 
The Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of ScienceThe Open Science Data Cloud: Empowering the Long Tail of Science
The Open Science Data Cloud: Empowering the Long Tail of Science
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
IPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutablesIPython Notebooks - Hacia los papers ejecutables
IPython Notebooks - Hacia los papers ejecutables
 
Digital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in AstronomyDigital Science: Reproducibility and Visibility in Astronomy
Digital Science: Reproducibility and Visibility in Astronomy
 
Big data at experimental facilities
Big data at experimental facilitiesBig data at experimental facilities
Big data at experimental facilities
 
Accelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundaneAccelerating data-intensive science by outsourcing the mundane
Accelerating data-intensive science by outsourcing the mundane
 
Research Objects in Wf4Ever
Research Objects in Wf4EverResearch Objects in Wf4Ever
Research Objects in Wf4Ever
 
What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care? What is a Data Commons and Why Should You Care?
What is a Data Commons and Why Should You Care?
 
Large Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster ReliefLarge Scale On-Demand Image Processing For Disaster Relief
Large Scale On-Demand Image Processing For Disaster Relief
 
Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)Bionimbus Cambridge Workshop (3-28-11, v7)
Bionimbus Cambridge Workshop (3-28-11, v7)
 

Destacado

Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013University of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Making It Your Own: Transitioning Into a New Electronic Resources Role
Making It Your Own: Transitioning Into a New Electronic Resources RoleMaking It Your Own: Transitioning Into a New Electronic Resources Role
Making It Your Own: Transitioning Into a New Electronic Resources RoleAlana Nuth
 
Good Design Doesn't Happen Alone
Good Design Doesn't Happen AloneGood Design Doesn't Happen Alone
Good Design Doesn't Happen AloneVeronica Douglas
 
Upcycling Instruction: Developing effective approaches to teaching experience...
Upcycling Instruction: Developing effective approaches to teaching experience...Upcycling Instruction: Developing effective approaches to teaching experience...
Upcycling Instruction: Developing effective approaches to teaching experience...Veronica Douglas
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Workshop: Designing Effective Poster Presentations
Workshop: Designing Effective Poster Presentations Workshop: Designing Effective Poster Presentations
Workshop: Designing Effective Poster Presentations Jolene W
 

Destacado (13)

Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Librarian Design Share
Librarian Design ShareLibrarian Design Share
Librarian Design Share
 
Making It Your Own: Transitioning Into a New Electronic Resources Role
Making It Your Own: Transitioning Into a New Electronic Resources RoleMaking It Your Own: Transitioning Into a New Electronic Resources Role
Making It Your Own: Transitioning Into a New Electronic Resources Role
 
Good Design Doesn't Happen Alone
Good Design Doesn't Happen AloneGood Design Doesn't Happen Alone
Good Design Doesn't Happen Alone
 
Upcycling Instruction: Developing effective approaches to teaching experience...
Upcycling Instruction: Developing effective approaches to teaching experience...Upcycling Instruction: Developing effective approaches to teaching experience...
Upcycling Instruction: Developing effective approaches to teaching experience...
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Workshop: Designing Effective Poster Presentations
Workshop: Designing Effective Poster Presentations Workshop: Designing Effective Poster Presentations
Workshop: Designing Effective Poster Presentations
 

Similar a A New Partnership for Advancing eScience Techniques and Technologies

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositoriesChris Rusbridge
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisUniversity of Washington
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceUniversity of Washington
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper ProvenancePaul Groth
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesIan Foster
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science ServicesIan Foster
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...San Diego Supercomputer Center
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationIan Foster
 
Text mining and machine learning
Text mining and machine learningText mining and machine learning
Text mining and machine learningJisc RDM
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLAnubhav Jain
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8Scott Edmunds
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Safe Software
 

Similar a A New Partnership for Advancing eScience Techniques and Technologies (20)

eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Data curation issues for repositories
Data curation issues for repositoriesData curation issues for repositories
Data curation issues for repositories
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy SciencesDiscovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
Discovery Engines for Big Data: Accelerating Discovery in Basic Energy Sciences
 
Accelerating Discovery via Science Services
Accelerating Discovery via Science ServicesAccelerating Discovery via Science Services
Accelerating Discovery via Science Services
 
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
SciDB : Open Source Data Management System for Data-Intensive Scientific Anal...
 
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and AutomationThe Discovery Cloud: Accelerating Science via Outsourcing and Automation
The Discovery Cloud: Accelerating Science via Outsourcing and Automation
 
Text mining and machine learning
Text mining and machine learningText mining and machine learning
Text mining and machine learning
 
Data dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNLData dissemination and materials informatics at LBNL
Data dissemination and materials informatics at LBNL
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
AI Science
AI Science AI Science
AI Science
 
HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8HKU Data Curation MLIM7350 Class 8
HKU Data Curation MLIM7350 Class 8
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
Spatial decision support and analytics on a campus scale: bringing GIS, CAD, ...
 

Más de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersUniversity of Washington
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceUniversity of Washington
 

Más de University of Washington (13)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 

Último

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 

Último (20)

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 

A New Partnership for Advancing eScience Techniques and Technologies

  • 1. A New Partnership for eScience Bill Howe, UW Ed Lazowska, UW Garth Gibson, CMU Christos Faloutsos, CMU Peter Lee, CMU (DARPA) Chris Mentzel, Moore QuickTime™ and a decompressor are needed to see this picture.
  • 2.
  • 4. 3/12/09 Bill Howe, eScience Institute4
  • 5. 3/12/09 Bill Howe, eScience Institute5
  • 6. 3/12/09 Bill Howe, eScience Institute6 The University of Washington eScience Institute  Rationale  The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich  Techniques and technologies include  Sensors and sensor networks, databases, data mining, machine learning, visualization, cluster/cloud computing  If these techniques and technologies are not widely available and widely practiced, UW will cease to be competitive  Mission  Help position the University of Washington at the forefront of research both in modern eScience techniques and technologies, and in the fields that depend upon them  Strategy  Bootstrap a cadre of Research Scientists  Add faculty in key fields  Build out a “consultancy” of students and non-research staff QuickTime™ and a decompressor are needed to see this picture.
  • 7. 3/12/09 Bill Howe, eScience Institute7 Staff and Funding  Funding  $1M/year direct appropriation from WA State Legislature  $1.5M from Gordon and Betty Moore Foundation (joint with CMU)  Multiple proposals outstanding  Staffing  Dave Beck, Research Scientist: Biosciences and software eng.  Jeff Gardner, Research Scientist: Astrophysics and HPC  Bill Howe,Research Scientist: Databases, visualization, DISC  Ed Lazowska, Director  Erik Lundberg (50%), Operations Director  Mette Peters, Health Sciences Liaison  Chance Reschke, Research Engineer: large scale computing platforms  …plus a senior faculty search underway  …plus a “consultancy” of students and professional staff
  • 8. 3/12/09 Bill Howe, eScience Institute8 All science is reducing to a database problem Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, in support of many hypotheses)  Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS)  Medicine: ubiquitous digital records, MRI, ultrasound  Oceanography: high-resolution models, cheap sensors, satellites  Biology: lab automation, high-throughput sequencing “Increase data collection exponentially with FlowCam!”
  • 9. 3/12/09 Bill Howe, eScience Institute9 The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) The Long Tail datavolume rank Researchers with growing data management challenges but limited resources for cyberinfrastructure • No dedicated IT staff • Over-reliance on inadequate but familiar tools CERN (~15PB/year) LSST (~100PB) PanSTARRS (~40PB) Ocean Modelers <Spreadsheet users> SDSS (~100TB) Seis- mologists MicrobiologistsCARMEN (~50TB) “The future is already here. It’s just not very evenly distributed.” -- William Gibson
  • 10. 3/12/09 Bill Howe, eScience Institute10 Case Study: Armbrust Lab
  • 11. 3/12/09 Bill Howe, eScience Institute11 Armbrust Lab Tech Roadmap ClustalW scalability cluster/cloud workstation/server MAQ specifictasks generaltasks Excel NCBI BLAST Phred/Phrap CloudBurst CLC Genomics Machine Hadoop/ Dryad Parallel Databases ? Azure, AWS WebBlast* RDBMS R PPlacer* AnnoJ BioPython Past Present Soon Other tools specialization
  • 12. 3/12/09 Bill Howe, eScience Institute12 What Does Scalable Mean?  Operationally:  In the past: “Works even if data doesn’t fit in main memory”  Now: “Can make use of 1000s of cheap computers”  Formally:  In the past: polynomial time and space. If you have N data items, you must do no more than Nk operations  Soon: logarithmic time and linear space. If you have N data items, you must do no more than N log(N) operations  Soon, you’ll only get one pass at the data  So you better make that one pass count
  • 13. 3/12/09 Bill Howe, eScience Institute13 A Goal: Cross-Scale Solutions  Gracefully scale up  from files to databases to cluster to cloud  from MB to GB to TB to PB  “Gracefully” means:  logical data independence  no expensive ETL migration projects  “Gracefully” means: everyone can use it  Hackers / Computational Scientists  Lab/Field Scientists  The Public  K12  Legislators
  • 14. 3/12/09 Bill Howe, eScience Institute14 Data Model Operations Services GPL * * None for free Workflow * arbitrary boxes- and-arrows typing, provenance, Pegasus-style resource mapping, task parallelism SQL / Relational Algebra Relations Select, Project, Join, Aggregate, … optimization, physical data independence, indexing, parallelism MapReduce [(key,value)] Map, Reduce massive data parallelism, fault tolerance, scheduling Pig Nested Relations RA-like, with Nest/Flatten optimization, monitoring, scheduling DryadLINQ IQueryable, IEnumerable RA + Apply + Partitioning typing, massive data parallelism, fault tolerance MPI Arrays/ Matrices 70+ ops data parallelism, full control
  • 15. 3/12/09 Bill Howe, eScience Institute15 MapReduce  Many tasks process big data, produce big data  Want to use hundreds or thousands of CPUs  ... but this needs to be easy  Parallel databases exist, but require DBAs and $$$$  …and do not easily scale to thousands of computers  MapReduce is a lightweight framework, providing:  Automatic parallelization and distribution  Fault-tolerance  I/O scheduling  Status and monitoring
  • 16. 3/12/09 Bill Howe, eScience Institute16 public class LogEntry { public string user, ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } } public class UserPageCount{ public string user, page; public int count; public UserPageCount( string usr, string page, int cnt){ this.user = usr; this.page = page; this.count = cnt; } } PartitionedTable<string> logs = PartitionedTable.Get<string>(@”file:…logfile.pt”); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where access.user.EndsWith(@"ulfar") select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToPartitionedTable(@”file:…results.pt”); slide source: Christophe Poulain, MSR A complete DryadLINQ program
  • 17. 3/12/09 Bill Howe, eScience Institute17 Relational Databases Pre-relational DBMS brittleness: if your data changed, your application often broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. physical data independence logical data independence files and pointers relations view s “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” Key Idea: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation
  • 18. 3/12/09 Bill Howe, eScience Institute18 Relational Databases  Databases are especially, but exclusively, effective at “Needle in Haystack” problems:  Extracting small results from big datasets  Transparently provide “old style” scalability  Your query will always* finish, regardless of dataset size.  Indexes are easily built and automatically used when appropriateCREATE INDEX seq_idx ON sequence(seq); SELECT seq FROM sequence WHERE seq = ‘GATTACGATATTA’; *almost
  • 19. 3/12/09 Bill Howe, eScience Institute19 Key Idea: Data Independence physical data independence logical data independence files and pointers relations view s SELECT * FROM my_sequences SELECT seq FROM ncbi_sequences WHERE seq = ‘GATTACGATATTA’; f = fopen(‘table_file’); fseek(10030440); while (True) { fread(&buf, 1, 8192, f); if (buf == GATTACGATATTA) { . . .
  • 20. 3/12/09 Bill Howe, eScience Institute20 Key Idea: An Algebra of Tables select project join join Other operators: aggregate, union, difference, cross product
  • 21. 3/12/09 Bill Howe, eScience Institute21 Key Idea: Algebraic Optimization N = ((z*2)+((z*3)+0))/1 Algebraic Laws: 1. (+) identity: x+0 = x 2. (/) identity: x/1 = x 3. (*) distributes: (n*x+n*y) = n*(x+y) 4. (*) commutes: x*y = y*x Apply rules 1, 3, 4, 2: N = (2+3)*z two operations instead of five, no division operator Same idea works with the Relational Algebra!
  • 22. 3/12/09 Bill Howe, eScience Institute22 Shared Nothing Parallel Databases  Teradata  Greenplum  Netezza  Aster Data Systems  DataAllegro  Vertica  MonetDB Microsoft Recently commercialized as “Vectorwise”
  • 24. 24 N-body Astrophysics Simulation • 15 years in dev • 109 particles • Gravity • Months to run • 7.5 million CPU hours • 500 timesteps • Big Bang to now Simulations from Tom Quinn’s Lab, work by Sarah Loebman, YongChul Kwon, Bill Howe, Jeff Gardner, Magda Balazinska
  • 25. 25 Q1: Find Hot Gas SELECT id FROM gas WHERE temp > 150000
  • 26. 26 Single Node: Query 1 169 MB 1.4 GB 36 GB
  • 27. 27 Multiple Nodes: Query 1 Database Z
  • 29. 29 Q4: Gas Deletion SELECT gas1.id FROM gas1 FULL OUTER JOIN gas2 ON gas1.id=gas2.id WHERE gas2.id=NULL Particles removed between two timesteps
  • 32. 3/12/09 Bill Howe, eScience Institute32 Ease of Use star43 = FOREACH rawGas43 GENERATE $0 AS pid:long; star60 = FOREACH rawGas60 GENERATE $0 AS pid:long; groupedGas = COGROUP star43 BY pid, star60 BY pid; selectedGas = FOREACH groupedGas GENERATE FLATTEN((IsEmpty(gas43) ? null : gas43)) as s43, FLATTEN((IsEmpty(gas60) ? null : gas60)) as s60; destroyed = FILTER selectedGas BY s60 is null;
  • 34. 3/12/09 Bill Howe, eScience Institute34 Data explosion, again  Data growth is outpacing Moore’s Law Why?  Cost of acquisition has dropped through the floor  Every pairwise comparison of datasets generates a new dataset -- N2 growth  So: Scalable analysis is necessary  But: Scalable analysis is hard
  • 35. 3/12/09 Bill Howe, eScience Institute35 It’s not just the size….  Corollary: # of apps scales as N2  Every pairwise comparison motivates a new application  To keep up, we need to  entrain new programmers,  make existing programmers more productive,  or both
  • 36. 3/12/09 Bill Howe, eScience Institute36 Satellite Images + Crime Incidence Reports
  • 37. 3/12/09 Bill Howe, eScience Institute37 Twitter Feed + Flickr Stream
  • 38. 3/12/09 Bill Howe, eScience Institute38 Zooplankton and Temperature <Vis movie> QuickTime™ and a decompressor are needed to see this picture.
  • 39. 3/12/09 Bill Howe, eScience Institute39 Why Visualization?  High bandwidth of the human visual cortex  Query-writing presumes a precise goal  Try this in SQL: “What does the salt wedge look like?”
  • 40. 3/12/09 Bill Howe, eScience Institute40 Data Product Ensembles source: Antonio Baptista, Center for Coastal Margin Observation and Prediction
  • 41. 3/12/09 Bill Howe, eScience Institute41 Example: Find matching sequences  Given a set of sequences  Find all sequences equal to “GATTACGATATTA”
  • 42. 3/12/09 Bill Howe, eScience Institute42 Example System: Teradata AMP = unit of parallelism
  • 43. 3/12/09 Bill Howe, eScience Institute43 Example System: Teradata SELECT * FROM Orders o, Lines i WHERE o.item = i.item AND o.date = today() join select scan scan date = today() o.item = i.item Order oItem i Find all orders from today, along with the items ordered
  • 44. 3/12/09 Bill Howe, eScience Institute44 Example System: Teradata AMP 1 AMP 2 AMP 3 select date=today() select date=today() select date=today() scan Order o scan Order o scan Order o hash h(item) hash h(item) hash h(item) AMP 1 AMP 2 AMP 3
  • 45. 3/12/09 Bill Howe, eScience Institute45 Example System: Teradata AMP 1 AMP 2 AMP 3 scan Item i AMP 1 AMP 2 AMP 3 hash h(item) scan Item i hash h(item) scan Item i hash h(item)
  • 46. 3/12/09 Bill Howe, eScience Institute46 Example System: Teradata AMP 1 AMP 2 AMP 3 join join join o.item = i.item o.item = i.item o.item = i.item contains all orders and all lines where hash(item) = 1 contains all orders and all lines where hash(item) = 2 contains all orders and all lines where hash(item) = 3
  • 47. 3/12/09 Bill Howe, eScience Institute47 MapReduce Programming Model  Input & Output: each a set of key/value pairs  Programmer specifies two functions:  Processes input key/value pair  Produces set of intermediate pairs  Combines all intermediate values for a particular key  Produces a set of merged output values (usually just one) map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell slide source: Google, Inc.
  • 48. 3/12/09 Bill Howe, eScience Institute48 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Document Processing
  • 49. 3/12/09 Bill Howe, eScience Institute49 Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Word length histogram How many “big”, “medium”, and “small” words are used?
  • 50. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Big = Yellow = 10+ letters Medium = Red = 5..9 letters Small = Blue = 2..4 letters Tiny = Pink = 1 letter Example: Word length histogram
  • 51. Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Example: Word length histogram Split the document into chunks and process each chunk on a different computer Chunk 1 Chunk 2
  • 52. (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Abridged Declaration of Independence A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map Task 1 (204 words) Map Task 2 (190 words) (key, value) (yellow, 17) (red, 77) (blue, 107) (pink, 3) Example: Word length histogram
  • 53. 3/12/09 Bill Howe, eScience Institute53 (yellow, 17) (red, 77) (blue, 107) (pink, 3) (yellow, 20) (red, 71) (blue, 93) (pink, 6 ) Reduce tasks (yellow, 17) (yellow, 20) (red, 77) (red, 71) (blue, 93) (blue, 107) (pink, 6) (pink, 3) Example: Word length histogram A Declaration By the Representatives of the United States of America, in General Congress Assembled. When in the course of human events it becomes necessary for a people to advance from that subordination in which they have hitherto remained, and to assume among powers of the earth the equal and independent station to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the change. We hold these truths to be self-evident; that all men are created equal and independent; that from that equal creation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and the pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their just power from the consent of the governed; that whenever any form of government shall become destructive of these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it's foundation on such principles and organizing it's power in such form, as to them shall seem most likely to effect their safety and happiness. Prudence indeed will dictate that governments long established should not be changed for light and transient causes: and accordingly all experience hath shewn that mankind are more disposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they are accustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuing invariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, to throw off such government and to provide new guards for future security. Such has been the patient sufferings of the colonies; and such is now the necessity which constrains them to expunge their former systems of government. the history of his present majesty is a history of unremitting injuries and usurpations, among which no one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct object the establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candid world, for the truth of which we pledge a faith yet unsullied by falsehood. Map task 1 Map task 2 “Shuffle step” (yellow, 37) (red, 148) (blue, 200) (pink, 9)
  • 54. 3/12/09 Bill Howe, eScience Institute54 New Example: What does this do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; Emit(result); slide source: Google, Inc.
  • 55. 3/12/09 Bill Howe, eScience Institute55 Before RDBMS: if your data changed, your application broke. Early RDBMS were buggy and slow (and often reviled), but required only 5% of the application code. “Activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed.” -- E.F. Codd 1979 Key Ideas: Programs that manipulate tabular data exhibit an algebraic structure allowing reasoning and manipulation independently of physical data representation Relational Database Management Systems (RDBMS)
  • 56. 3/12/09 Bill Howe, eScience Institute56 MapReduce is a Nascent Database Engine Access Methods and Scheduling: Query Language: Query Optimizer: QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Pig Latin Graphics taken from: hadoop.apache.org and research.yahoo.com/node/90
  • 57. 3/12/09 Bill Howe, eScience Institute57 QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. MapReduce and Hadoop  MR introduced by Google  Published paper in OSDI 2004  MR: high-level programming model and implementation for large-scale parallel data processing  Hadoop  Open source MR implementation  Yahoo!, Facebook, New York Times
  • 58. 3/12/09 Bill Howe, eScience Institute58 operators: • LOAD • STORE • FILTER • FOREACH … GENERATE • GROUP binary operators: • JOIN • COGROUP • UNION other support: • UDFs • COUNT • SUM • AVG • MIN/MAX Additional operators: http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm A Query Language for MR: Pig Latin  High-level, SQL-like dataflow language for MR  Goal: Sweet spot between SQL and MR  Applies SQL-like, high-level language constructs to accomplish low-level MR programming.
  • 59. 3/12/09 Bill Howe, eScience Institute59 New Task: k-mer Similarity  Given a set of database sequences and a set of query sequences  Return the top N similar pairs, where similarity is defined as the number of common k-mers
  • 60. 3/12/09 Bill Howe, eScience Institute60 Pig Latin program D = LOAD ’db_sequences.fasta' USING FASTA() AS (did,dsequence); Q = LOAD ’query_sequences.fasta' USING FASTA() AS (qid,qsequence); Kd = FOREACH D GENERATE did, FLATTEN(kmers(7, dsequence)); Kq = FOREACH Q GENERATE qid, FLATTEN(kmers(7, qsequence)); R = JOIN Kd BY kmer, Kq BY kmer G = GROUP R BY (qid, did); C = FOREACH G GENERATE qid, did, COUNT(kmer) as score T = FILTER C BY score > 4 STORE g INTO seqs.txt';
  • 61. 3/12/09 Bill Howe, eScience Institute61 New Task: Alignment  RMAP alignment implemented in Hadoop Michael Schatz, CloudBurst: highly sensitive read mapping with MapReduce, Bioinformatics 25(11), April 2009  Goal: Align reads to a reference genome  Overview:  Map: Split reads and reference into k-mers  Reduce: for matching k-mers, find end-to-end alignments (seed and extend)
  • 62. 3/12/09 Bill Howe, eScience Institute62 MapReduce Overhead QuickTime™ and a decompressor are needed to see this picture.
  • 63. 3/12/09 Bill Howe, eScience Institute63 Elastic MapReduce  Custom Jar  Java  Streaming  Any language that can read/write stdin/stdout  Pig  Simple data flow language  Hive  SQL
  • 64. 3/12/09 Bill Howe, eScience Institute64 Taxonomy of Parallel Architectures Easiest to program, but $$$$ Scales to 1000s of computers

Notas del editor

  1. &amp;lt;number&amp;gt; My name is Bill Howe. I’m not Ed Lazowska. In all fields of science, data is starting to come in faster than it can be analyzed, so we need to advance and proliferate computational technologies in sensor networking, databases and data mining, visualization, machine learning, and cluster/cloud computing. And if we don’t, we see UW losing its competitive edge. The mission of the eScience Institute is to prevent that from happening So by the animation loophole, there we go. Funding! We have $1M from the state, and we just got a nice award from the Moore foundation, and several proposals outstandind. People! We have a fantastic team: Dave Beck in Biosciences, Jeff Gardner in Astrophysics and HPC, myself in Databases, Ed and Erik, Mette Peters in Health Sciences, and Chance Reshke in large-scale computing platforms. And there’s our URL: escience.washington
  2. Drowning in data; starving for information We’re at war with these engineering companies. FlowCAM is bragging about the amount of data they can spray out of their device. How to use this enormous data stream to answer scientific questions is someone else’s problem.
  3. The long tail of eScience -- huge number of scientists who struggle with data management, but do not have access to IT resources -- no clusters, no system administrators, no programmers, and no computer scientists. They rely on spreadsheets, email, and maybe a shared file system. Their data challenges have more to do with heterogeneity than size: tens of spreadsheets from different sources. However: the long tail is becoming the fat tail. Tens of spreadsheets are growing to hundreds, and the number of records in each goes from hundreds to thousands. How many of you know someone who was forced to split a large spreadsheet into multiple files in order to get around the 65k record limit in certain versions of Excel? Further, medium data (gigabytes) becomes big data (terabytes). Ocean modelers are moving from regional-focus to meso-scale simulations to global simulations.
  4. Armbrust Lab combines lab-based and field-based studies to address basic questions about the function of marine ecosystems.
  5. Asterisk /Underlined (*) indicates custom software developed in the Armbrust Lab. Blue: Traditional tools for “basement” bioinformatics -- individual scientists Orange: Increased centralization, economies of scale, shared resources. Deployed in the Armbrust Lab Yellow: Third-party tools developed for scalable bioinformatics Purple: Emerging tools under evaluation for convenient petascale bioinformatics. Through a collaboration with the eScience Institute (under funding review by Moore Foundation!) Thanks to advances in sensors, sequencing instruments, and algorithms, the field of bioinformatics is moving away from &amp;quot;single-task&amp;quot; software that operate on datasets that fit on a single computer in favor of flexible, &amp;quot;multi-purpose&amp;quot; frameworks that can operate on datasets that span clusters of computers. In our lab, we have deployed a variety of flexible tools, and have developed our own software to streamline our scientific process and reduce the overall &amp;quot;time to insight&amp;quot;. (maybe talk about WebBlast and PPlacer here.) Observing that the amount of data collected is doubling every year (outpacing even Moore&amp;apos;s Law!), we are also collaborating with the UW eScience Institute to explore ways we can harness emerging technologies for massively parallel data analysis involving hundreds or thousands of machines. Some of these frameworks involve &amp;quot;cloud computing&amp;quot; -- the use of computational infrastructure provided, inexpensively, by &amp;quot;big players&amp;quot; in software and computing --- Amazon, Microsoft, Google. [Maybe more on the eScience Institute?]
  6. &amp;lt;number&amp;gt; Dial down the expressiveness but dial up the programming and execution services
  7. It turns out that you can express a wide variety of computations using only a handful of operators.
  8. &amp;lt;number&amp;gt;
  9. &amp;lt;number&amp;gt;
  10. &amp;lt;number&amp;gt;
  11. &amp;lt;number&amp;gt;
  12. &amp;lt;number&amp;gt; Two nodes slower than one, four nodes slower than 8 -- shows overhead of providing parallel processing
  13. &amp;lt;number&amp;gt;
  14. &amp;lt;number&amp;gt;
  15. &amp;lt;number&amp;gt;
  16. Data products are the currency of scientific and statistical communication with the public Ex: Obama map Ex: Mars Rover pictures generate 218M hits in 24 hrs But: Datasets are growing too big and too complex to view through a few static images Scientists want to create interactive visualizations that allow others to explore their results Ex: Nasa 3D with Photosynth Ex: CAMERA Ex:
  17. On the order of hundreds of points. Manual browsing.
  18. This movie was rendered offline, but it’s increasingly important to be able to create visualizations on the fly to allow interactive exploration of large datasets.
  19. Visualization is a more efficient way to query data -- you can browse and explore. But you need to be able to switch back and forth between interactive browsing and symbolic querying
  20. Climatology is long-term average
  21. Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character. Map will read in text and tag each word as a different color depending on the length of the word.
  22. Want to know the makeup of the text by word length. For example, we’d like to know how many words have greater than 10 characters. We’d also like to know how many words have between 5 and 9 characters, between 2 and 4 and those with just 1 character. Map will read in text and tag each word as a different color depending on the length of the word.
  23. Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  24. Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  25. Motivating Map task and intuition behind map…. Think of map as a group by. Distribution of word lengths
  26. It provides a means of describing data with its natural structure only--that is, without superimposing any additional structure for machine representation purposes. Accordingly, it provides a basis for a high level data language which will yield maximal independence between programs on the one hand and machine representation on the other.
  27. So these two different views of the world, RDBMS and MapReduce are not really different at all -- just different feature sets along a continuum of data processing. As evidence Teradata Greenplum Netezza Aster Data Systems Dataupia Vertica MonetDB
  28. Hadoop implementation based off details in MR 2004 paper.
  29. Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you. This is by no means an exhaustive list of operators
  30. Don’t have to write separate map and reduce functions… will take care of that for you as well as optimize for you.
  31. The goal here is to make Shared Nothing Architecturs easier to program.