SlideShare una empresa de Scribd logo
1 de 69
Democratizing Data Science
in the Cloud
Bill Howe, Ph.D.
Associate Director and Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/1/2016 Bill Howe, UW 1
11/1/2016 Bill Howe, UW 2
Cloud Data Management is about
sharing resources between tenants
We’re interested in new services powered by sharing
more than infrastructure – schema, data, queries
Why?
Example: JBOT* Open Data systems
Google
Fusion
Tables
3
Entrepreneurship
1) “Data once guarded for assumed but untested
reasons is now open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an
organization use data that had been the
realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way
scientists always have”
-- Jeff Hammerbacher
Mt. Sinai, formerly Cloudera
*Just a Bunch of Tables
Data, data, data
4
Kevin Merrit
CEO
Socrata
Deep Dhillon
CTO
Socrata
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Virtualization
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
DB-as-a-Service
Benefits: Significantly reduced management overhead
Challenges: security, scheduling, SLAs, isolation
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
JBOT* Query-as-a-Service Systems
Goal:
smart cross-tenant services,
trained on everyone’s data
• Metadata inference and data curation
• Query recommendation via common idioms
• Data discovery – e.g., “find me things to join with”
• Visualization recommendation
• Semi-automatic integration services
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
*Just a Bunch of Tables
Example Service: Automated Data Curation
11/1/2016 Bill Howe, UW 9
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Goal: Repair metadata for genetic
datasets using the content of the data, the
structure of an associated ontology, the
abstract of the paper, and everything else.
Deep Neural Network
Tissue Type Labels
Innovations in transfer learning,
poor training data, etc.
Paper
Abstract
Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Iterative co-learning between text-based classified and
expression-based classifier: Both models improve by
training on each others’ results
• SQLShare: Query-as-a-Service
• VizDeck: Visualization recommendation
• Myria: Big Data Ecosystems
VizDeck
Some Cloud Data Systems
1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
11/1/2016 Bill Howe, UW 15
http://sqlshare.escience.washington.edu
SIGMOD 2016
SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
The SQLShare Corpus:
A multi-year log of hand-written analytics queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
19/57
A SQL “learner”
http://uwescience.github.io/sqlshare/
Latent Idioms for Schema-Independent Query Recommendation
Background on
Word2Vec, GloVE:
Map each term in a
corpus to a vector in
a high-dimensional
space based on its
co-occurrences.
Linear relationships
between these
vectors appear to
capture remarkable
semantic properties
:
SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]
SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]
SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]
select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]
SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]
SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]
SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]
SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]
SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]
SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]
SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]
:
Apply the same trick to the SQLShare corpus, cluster the results
A not-very-interesting cluster:
Latent SQL Idioms
:
SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'
SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'
SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'
SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha
:
Another not-very-interesting cluster:
We see other clusters that seem to capture more basics: “union,”
“group by with one grouping column,” “left outer join,” “string
manipulation,” etc.
Latent SQL Idioms
Latent SQL Idioms
More interesting examples:
select floor(latitude/0.7)*0.7 as latbin
, floor(longitude/0.7)*0.7 as lonbin
, species
FROM [koenigk92@gmail.com].[All3col]
select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number
and charindex(',', [protein]) = 0 -- and no comma present
then [protein]
else substring([protein], patindex('%[0-9]%', [protein]),
charindex(',', [protein])-patindex('%[0-9]%', [protein]))
end as [protein d1124],
[tot indep spectra] as [tot spectra d1124]
from [emmats@washington.edu].[d1_file124.txt]
Parsing a common
bioinformatics file format
Expressions for binning
space and time columns
MYRIA: BIG DATA POLYSTORES
11/1/2016 Bill Howe, UW 24
Q Q Q
….
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
Q Q Q
….
Polystore Ecosystems: “Software Defined Databases”
Data Plane /
Database sys.
Application /
schema, data,
query logs
RDBMS HPC / Linear Algebra Graphs
Polystore
Execution
Plan
move
data
execute
query
Polystore
Execution
Plan
Tables KeyVal Arrays Graphs
Myria Algebra
Tables KeyVal Arrays Graphs
Spark Accumulo CombBLAS GraphX
Parallel
Algebra
Logical
Algebra
RACO
Relational Algebra COmpiler
CombBLAS
API
Spark
API
Accumulo Graph
API
rewrite
rules
Array
Algebra
MyriaL
Services: visualization, logging, discovery, history, browsing
Orchestration
https://github.com/uwescience/raco
https://metanautix.com/tr/01_big_data_techniques_for_media_graphics.pdf
11/1/2016 Bill Howe, UW 33
Ollie Lo, Los Alamos National Lab
34
CurGood = SCAN(public:adhoc:sc_points);
DO
mean = [FROM CurGood EMIT val=AVG(v)];
std = [FROM CurGood EMIT val=STDEV(v)];
NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *];
CurGood = CurGood - NewBad;
continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0];
WHILE continue;
DUMP(CurGood);
Sigma-clipping, V0
35
CurGood = P
sum = [FROM CurGood EMIT SUM(val)];
sumsq = [FROM CurGood EMIT SUM(val*val)]
cnt = [FROM CurGood EMIT CNT(*)];
NewBad = []
DO
sum = sum – [FROM NewBad EMIT SUM(val)];
sumsq = sum – [FROM NewBad EMIT SUM(val*val)];
cnt = sum - [FROM NewBad EMIT CNT(*)];
mean = sum / cnt
std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum))
NewBad = FILTER([ABS(val-mean)>std], CurGood)
CurGood = CurGood - NewBad
WHILE NewBad != {}
Sigma-clipping, V1: Incremental
36
Points = SCAN(public:adhoc:sc_points);
aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
newBad = []
bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)];
DO
new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)];
aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum,
sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt];
stats = [FROM aggs EMIT mean=_sum/cnt,
std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))];
newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std];
tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v
AND v >= bounds.lower EMIT v=Points.v];
tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v
AND v <= bounds.upper EMIT v=Points.v];
newBad = UNIONALL(tooLow, tooHigh);
bounds = newBounds;
continue = [FROM newBad EMIT COUNT(v) > 0];
WHILE continue;
output = [FROM Points, bounds WHERE Points.v > bounds.lower AND
Points.v < bounds.upper EMIT v=Points.v];
DUMP(output);
Sigma-clipping, V2
Dominik Moritz
EuroVis 15
Empower the end user to do
performance profiling, debugging, etc.
Diagnosing problems
Sourcenode
Destination node
Dominik Moritz
EuroVis 15
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 39
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 40
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 42
Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 45/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 46/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
Graph Patterns
47
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
11/1/2016 Bill Howe, UW 48
ICS 15
RADISH
ICS 16
Brandon
Myers
TPC-H
Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– “Software-defined Databases”
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 49
select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zeros
Complexity of matrix multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
BLAS vs. SpBLAS vs. SQL (10k)
off the shelf
database
15X
11/1/2016 Bill Howe, UW 54
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
11/1/2016 Bill Howe, UW 55
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
A x B x C
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
A(i, j, val)
B(j, k, val)
C(k, m, val)
take three sparse
matrices
Now compute
multiway hypercube join:
O (|A|/p + |B|/p^2 + |C|/p)
Group by:
~O (N)
But wait, there’s more…..
2 seconds,
balanced
Hypercube
shuffle
Partitioned
hash join
43 seconds,
tons of skew
Task: self-multiply with 1M non-zeros
Seung-Hee
BaeScalable Graph Clustering
Version 1
Parallelize Best-known
Serial Algorithm
ICDM 2013
Version 2
Free 30% improvement
for any algorithm
TKDD 2014 SC 2015
Version 3
Distributed approx.
algorithm, 1.5B edges
http://escience.washington.edu
http://myria.cs.washington.edu
http://uwescience.github.io/sqlshare/
VIZDECK: VISUALIZATION
RECOMMENDATION
11/1/2016 Bill Howe, UW 60
“Data Triage” Pipeline
61
SAS
Excel
XML
CSV
SQL Azure
Files Tables Views
parse /
extract
“relational
analysis”
visual
analysis
Visualizations
SIGMOD 11
SSDBM 13
SIGMOD 16
sqlshare.escience.washington.edu
CHI 12
SIGMOD 12
iConference 13
SSDBM 11
CiSE 13
SSDBM 15
62
63
video
11/1/2016 Bill Howe, UW 65
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fusion VizDeck ManyEyes Tableau
Task Completion Rate / Time - All Questions
CHI 13
Visualization Recommendation
• Model each “vizlet” as a triple
(x_column, y_column, vizlet_type)
• Extract features from each column
(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)
• Interpret each “promotion” as a yes vote and each “discard” as a
no vote
• Train a (simple) model to predict vizlet type from features
• Recommend highest-scoring vizlets
• Add a diversity term to prevent a bunch of similar plots
• Incorporate score modifiers defined by the vizlet designer
– “My bar chart looks best when there are about 5 bars.”
– “My timeseries plot ignores null values”
11/1/2016 Bill Howe, UW 66
Example of a Learned Rule (1)
low x-entropy => bad scatter plot
11/1/2016 Bill Howe, UW 67
bad scatter plotgood scatter plot
Example of a Learned Rule (2)
low x-entropy => histogram
11/1/2016 Bill Howe, UW 68
bad scatter plot good histogram
Example of a Learned Rule (3)
69
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
Voyager
11/1/2016 Bill Howe, UW 70
Kanit “Ham” Wongsuphasawat Dominik Moritz
InfoVis 15
Within the first few queries, you’ve
touched all the tables.
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/

Más contenido relacionado

La actualidad más candente

Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
Jay Gendron
 

La actualidad más candente (20)

Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Urban Data Science at UW
Urban Data Science at UWUrban Data Science at UW
Urban Data Science at UW
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
From Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge GraphsFrom Text to Data to the World: The Future of Knowledge Graphs
From Text to Data to the World: The Future of Knowledge Graphs
 
Data Science: Not Just For Big Data
Data Science: Not Just For Big DataData Science: Not Just For Big Data
Data Science: Not Just For Big Data
 
Thoughts on Knowledge Graphs & Deeper Provenance
Thoughts on Knowledge Graphs  & Deeper ProvenanceThoughts on Knowledge Graphs  & Deeper Provenance
Thoughts on Knowledge Graphs & Deeper Provenance
 
Big Data in NATO and Your Role
Big Data in NATO and Your RoleBig Data in NATO and Your Role
Big Data in NATO and Your Role
 
Machines are people too
Machines are people tooMachines are people too
Machines are people too
 
International Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data ScienceInternational Collaboration Networks in the Emerging (Big) Data Science
International Collaboration Networks in the Emerging (Big) Data Science
 
The Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture DataThe Roots: Linked data and the foundations of successful Agriculture Data
The Roots: Linked data and the foundations of successful Agriculture Data
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Towards research data knowledge graphs
Towards research data knowledge graphsTowards research data knowledge graphs
Towards research data knowledge graphs
 
The Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for ScienceThe Challenge of Deeper Knowledge Graphs for Science
The Challenge of Deeper Knowledge Graphs for Science
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Tutorial Data Management and workflows
Tutorial Data Management and workflowsTutorial Data Management and workflows
Tutorial Data Management and workflows
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)Big Data, The Community and The Commons (May 12, 2014)
Big Data, The Community and The Commons (May 12, 2014)
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 

Destacado

Yo puedo ser escritor
Yo puedo ser escritorYo puedo ser escritor
Yo puedo ser escritor
dec-admin
 
338.mejores alumnos, por un méxico mejor
338.mejores alumnos,  por un méxico mejor338.mejores alumnos,  por un méxico mejor
338.mejores alumnos, por un méxico mejor
dec-admin
 
190.pitufos verdes
190.pitufos verdes190.pitufos verdes
190.pitufos verdes
dec-admin
 
353.el reciclaje
353.el reciclaje353.el reciclaje
353.el reciclaje
dec-admin
 
55. por una alimentación balanceada
55. por una alimentación balanceada55. por una alimentación balanceada
55. por una alimentación balanceada
dec-admin
 
432. la contaminación y sus consecuencias en el entorno
432. la contaminación y sus consecuencias en el entorno432. la contaminación y sus consecuencias en el entorno
432. la contaminación y sus consecuencias en el entorno
dec-admin
 
Grupo No. 4 de Exposision de Habilitación Modulo II
Grupo No. 4 de Exposision de Habilitación Modulo II Grupo No. 4 de Exposision de Habilitación Modulo II
Grupo No. 4 de Exposision de Habilitación Modulo II
drzaberkis1
 
Nada se tira todo se transforma
Nada se tira todo se transformaNada se tira todo se transforma
Nada se tira todo se transforma
dec-admin
 
CV-Jainak-10.08.2016
CV-Jainak-10.08.2016CV-Jainak-10.08.2016
CV-Jainak-10.08.2016
Atul Jain
 
414. cambio climático
414. cambio climático414. cambio climático
414. cambio climático
dec-admin
 
Proyecto el parque
Proyecto el parqueProyecto el parque
Proyecto el parque
dec-admin
 
Para cambiar solo tienes que hacerlo
Para cambiar solo tienes que hacerloPara cambiar solo tienes que hacerlo
Para cambiar solo tienes que hacerlo
dec-admin
 

Destacado (16)

Yo puedo ser escritor
Yo puedo ser escritorYo puedo ser escritor
Yo puedo ser escritor
 
338.mejores alumnos, por un méxico mejor
338.mejores alumnos,  por un méxico mejor338.mejores alumnos,  por un méxico mejor
338.mejores alumnos, por un méxico mejor
 
190.pitufos verdes
190.pitufos verdes190.pitufos verdes
190.pitufos verdes
 
353.el reciclaje
353.el reciclaje353.el reciclaje
353.el reciclaje
 
55. por una alimentación balanceada
55. por una alimentación balanceada55. por una alimentación balanceada
55. por una alimentación balanceada
 
Enumerated data types
Enumerated data typesEnumerated data types
Enumerated data types
 
432. la contaminación y sus consecuencias en el entorno
432. la contaminación y sus consecuencias en el entorno432. la contaminación y sus consecuencias en el entorno
432. la contaminación y sus consecuencias en el entorno
 
Grupo No. 4 de Exposision de Habilitación Modulo II
Grupo No. 4 de Exposision de Habilitación Modulo II Grupo No. 4 de Exposision de Habilitación Modulo II
Grupo No. 4 de Exposision de Habilitación Modulo II
 
Nada se tira todo se transforma
Nada se tira todo se transformaNada se tira todo se transforma
Nada se tira todo se transforma
 
CV-Jainak-10.08.2016
CV-Jainak-10.08.2016CV-Jainak-10.08.2016
CV-Jainak-10.08.2016
 
414. cambio climático
414. cambio climático414. cambio climático
414. cambio climático
 
Trabajo practico n° 3
Trabajo practico n° 3Trabajo practico n° 3
Trabajo practico n° 3
 
Educación (presentación inf1032)
Educación (presentación inf1032)Educación (presentación inf1032)
Educación (presentación inf1032)
 
Proyecto el parque
Proyecto el parqueProyecto el parque
Proyecto el parque
 
Disena1 hm
Disena1 hmDisena1 hm
Disena1 hm
 
Para cambiar solo tienes que hacerlo
Para cambiar solo tienes que hacerloPara cambiar solo tienes que hacerlo
Para cambiar solo tienes que hacerlo
 

Similar a Democratizing Data Science in the Cloud

An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
Shiyong Lu
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
n5712036
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedIn
Kun Le
 
Towords a cloud computing research agendapdf
Towords a cloud computing research agendapdfTowords a cloud computing research agendapdf
Towords a cloud computing research agendapdf
hajlaoui jaleleddine
 
Towords a cloud computing research agendapdf
Towords a cloud computing research agendapdfTowords a cloud computing research agendapdf
Towords a cloud computing research agendapdf
hajlaoui jaleleddine
 
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Raza Baloch
 

Similar a Democratizing Data Science in the Cloud (20)

BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
 
Poster
PosterPoster
Poster
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail ScienceSQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
SQL is Dead; Long Live SQL: Lightweight Query Services for Long Tail Science
 
A New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScienceA New Partnership for Cross-Scale, Cross-Domain eScience
A New Partnership for Cross-Scale, Cross-Domain eScience
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 
The “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedInThe “Big Data” Ecosystem at LinkedIn
The “Big Data” Ecosystem at LinkedIn
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Towords a cloud computing research agendapdf
Towords a cloud computing research agendapdfTowords a cloud computing research agendapdf
Towords a cloud computing research agendapdf
 
Towords a cloud computing research agendapdf
Towords a cloud computing research agendapdfTowords a cloud computing research agendapdf
Towords a cloud computing research agendapdf
 
Measurement and modeling of the web and related data sets
Measurement and modeling of the web and related data setsMeasurement and modeling of the web and related data sets
Measurement and modeling of the web and related data sets
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra MSBI and Data WareHouse techniques by Quontra
MSBI and Data WareHouse techniques by Quontra
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshThe Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh
 
DBMS an Example
DBMS an ExampleDBMS an Example
DBMS an Example
 
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
Cdocumentsandsettingsuser1desktop2 dbmsexamples-091012013049-phpapp01
 
SC10 project slides
SC10 project slidesSC10 project slides
SC10 project slides
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Introduction to D3.js
Introduction to D3.jsIntroduction to D3.js
Introduction to D3.js
 

Más de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
University of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
University of Washington
 

Más de University of Washington (17)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
Data Science and Urban Science @ UW
Data Science and Urban Science @ UWData Science and Urban Science @ UW
Data Science and Urban Science @ UW
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013Big Data Curricula at the UW eScience Institute, JSM 2013
Big Data Curricula at the UW eScience Institute, JSM 2013
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale ClustersHaLoop: Efficient Iterative Processing on Large-Scale Clusters
HaLoop: Efficient Iterative Processing on Large-Scale Clusters
 
Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce Query-Driven Visualization in the Cloud with MapReduce
Query-Driven Visualization in the Cloud with MapReduce
 
Visual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory ScienceVisual Data Analytics in the Cloud for Exploratory Science
Visual Data Analytics in the Cloud for Exploratory Science
 
Data-Intensive Scalable Science
Data-Intensive Scalable ScienceData-Intensive Scalable Science
Data-Intensive Scalable Science
 
Research Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and AnalysisResearch Dataspaces: Pay-as-you-go Integration and Analysis
Research Dataspaces: Pay-as-you-go Integration and Analysis
 

Último

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
gajnagarg
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 

Último (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
Just Call Vip call girls Mysore Escorts ☎️9352988975 Two shot with one girl (...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 

Democratizing Data Science in the Cloud

  • 1. Democratizing Data Science in the Cloud Bill Howe, Ph.D. Associate Director and Senior Data Science Fellow, eScience Institute Affiliate Associate Professor, Computer Science & Engineering 11/1/2016 Bill Howe, UW 1
  • 2. 11/1/2016 Bill Howe, UW 2 Cloud Data Management is about sharing resources between tenants We’re interested in new services powered by sharing more than infrastructure – schema, data, queries
  • 3. Why? Example: JBOT* Open Data systems Google Fusion Tables 3 Entrepreneurship 1) “Data once guarded for assumed but untested reasons is now open, and we're seeing benefits.” -- Nigel Shadbolt, Open Data Institute 2) Need to help “non-specialists within an organization use data that had been the realm of programmers and DB admins” -- Benjamin Romano, Xconomy “Businesses are now using data the way scientists always have” -- Jeff Hammerbacher Mt. Sinai, formerly Cloudera *Just a Bunch of Tables
  • 4. Data, data, data 4 Kevin Merrit CEO Socrata Deep Dhillon CTO Socrata
  • 5. Q Q Q …. Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 6. Q Q Q …. Benefits: Significantly reduced management overhead Challenges: security, scheduling, SLAs, isolation Virtualization Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 7. Q Q Q …. DB-as-a-Service Benefits: Significantly reduced management overhead Challenges: security, scheduling, SLAs, isolation Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 8. Q Q Q …. JBOT* Query-as-a-Service Systems Goal: smart cross-tenant services, trained on everyone’s data • Metadata inference and data curation • Query recommendation via common idioms • Data discovery – e.g., “find me things to join with” • Visualization recommendation • Semi-automatic integration services Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs *Just a Bunch of Tables
  • 9. Example Service: Automated Data Curation 11/1/2016 Bill Howe, UW 9 Microarray samples submitted to the Gene Expression Omnibus Curation is fast becoming the bottleneck to data sharing Maxim Gretchkin Hoifung Poon
  • 10. Example Service: Automated Data Curation Maxim Gretchkin Hoifung Poon Goal: Repair metadata for genetic datasets using the content of the data, the structure of an associated ontology, the abstract of the paper, and everything else. Deep Neural Network Tissue Type Labels Innovations in transfer learning, poor training data, etc. Paper Abstract
  • 11. Example Service: Automated Data Curation Maxim Gretchkin Hoifung Poon Iterative co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results
  • 12. • SQLShare: Query-as-a-Service • VizDeck: Visualization recommendation • Myria: Big Data Ecosystems VizDeck Some Cloud Data Systems
  • 13. 1) Upload data “as is” Cloud-hosted, secure; no need to install or design a database; no pre-defined schema; schema inference; some itegration 2) Write Queries Right in your browser, writing views on top of views on top of views ... SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC 3) Share the results Make them public, tag them, share with specific colleagues – anyone with access can query http://sqlshare.escience.washington.edu
  • 14. 11/1/2016 Bill Howe, UW 15 http://sqlshare.escience.washington.edu
  • 16. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp , x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp , w.category as nc_category , CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) THEN x.end_bp - x.start_bp + 1 WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) THEN x.end_bp - w.start_bp + 1 WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) THEN w.end_bp - x.start_bp + 1 END AS len_overlap FROM [koesterj@washington.edu].[hotspots_deserts.tab] x INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w ON x.chr = w.chr WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp) OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp) OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp) ORDER BY x.strain, x.chr ASC, x.start_bp ASC Non-programmers can write very complex queries (rather than relying on staff programmers) Example: Computing the overlaps of two sets of blast results We see thousands of queries written by non-programmers
  • 17. The SQLShare Corpus: A multi-year log of hand-written analytics queries Queries 24275 Views 4535 Tables 3891 Users 591 SIGMOD 2016 Shrainik Jain https://uwescience.github.io/sqlshare
  • 19. Latent Idioms for Schema-Independent Query Recommendation Background on Word2Vec, GloVE: Map each term in a corpus to a vector in a high-dimensional space based on its co-occurrences. Linear relationships between these vectors appear to capture remarkable semantic properties
  • 20. : SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt] SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv] SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined] select count(Wave_Height) from [christa.kohnert@gmail.com].[Join] SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv] SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv] SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011] SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv] SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv] SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv] SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt] SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia] SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania] : Apply the same trick to the SQLShare corpus, cluster the results A not-very-interesting cluster: Latent SQL Idioms
  • 21. : SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%' SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%' SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm' SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country' SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal' SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha : Another not-very-interesting cluster: We see other clusters that seem to capture more basics: “union,” “group by with one grouping column,” “left outer join,” “string manipulation,” etc. Latent SQL Idioms
  • 22. Latent SQL Idioms More interesting examples: select floor(latitude/0.7)*0.7 as latbin , floor(longitude/0.7)*0.7 as lonbin , species FROM [koenigk92@gmail.com].[All3col] select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number and charindex(',', [protein]) = 0 -- and no comma present then [protein] else substring([protein], patindex('%[0-9]%', [protein]), charindex(',', [protein])-patindex('%[0-9]%', [protein])) end as [protein d1124], [tot indep spectra] as [tot spectra d1124] from [emmats@washington.edu].[d1_file124.txt] Parsing a common bioinformatics file format Expressions for binning space and time columns
  • 23. MYRIA: BIG DATA POLYSTORES 11/1/2016 Bill Howe, UW 24
  • 24. Q Q Q …. Control Plane / Infrastructure Data Plane / Database sys. Application / schema, data, query logs
  • 25. Q Q Q …. Polystore Ecosystems: “Software Defined Databases” Data Plane / Database sys. Application / schema, data, query logs RDBMS HPC / Linear Algebra Graphs
  • 29. Spark Accumulo CombBLAS GraphX Parallel Algebra Logical Algebra RACO Relational Algebra COmpiler CombBLAS API Spark API Accumulo Graph API rewrite rules Array Algebra MyriaL Services: visualization, logging, discovery, history, browsing Orchestration https://github.com/uwescience/raco
  • 30.
  • 32. 11/1/2016 Bill Howe, UW 33 Ollie Lo, Los Alamos National Lab
  • 33. 34 CurGood = SCAN(public:adhoc:sc_points); DO mean = [FROM CurGood EMIT val=AVG(v)]; std = [FROM CurGood EMIT val=STDEV(v)]; NewBad = [FROM Good WHERE ABS(Good.v - mean) > 2 * std EMIT *]; CurGood = CurGood - NewBad; continue = [FROM NewBad EMIT COUNT(NewBad.v) > 0]; WHILE continue; DUMP(CurGood); Sigma-clipping, V0
  • 34. 35 CurGood = P sum = [FROM CurGood EMIT SUM(val)]; sumsq = [FROM CurGood EMIT SUM(val*val)] cnt = [FROM CurGood EMIT CNT(*)]; NewBad = [] DO sum = sum – [FROM NewBad EMIT SUM(val)]; sumsq = sum – [FROM NewBad EMIT SUM(val*val)]; cnt = sum - [FROM NewBad EMIT CNT(*)]; mean = sum / cnt std = sqrt(1/(cnt*(cnt-1)) * (cnt * sumsq - sum*sum)) NewBad = FILTER([ABS(val-mean)>std], CurGood) CurGood = CurGood - NewBad WHILE NewBad != {} Sigma-clipping, V1: Incremental
  • 35. 36 Points = SCAN(public:adhoc:sc_points); aggs = [FROM Points EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; newBad = [] bounds = [FROM Points EMIT lower=MIN(v), upper=MAX(v)]; DO new_aggs = [FROM newBad EMIT _sum=SUM(v), sumsq=SUM(v*v), cnt=COUNT(v)]; aggs = [FROM aggs, new_aggs EMIT _sum=aggs._sum - new_aggs._sum, sumsq=aggs.sumsq - new_aggs.sumsq, cnt=aggs.cnt - new_aggs.cnt]; stats = [FROM aggs EMIT mean=_sum/cnt, std=SQRT(1.0/(cnt*(cnt-1)) * (cnt * sumsq - _sum * _sum))]; newBounds = [FROM stats EMIT lower=mean - 2 * std, upper=mean + 2 * std]; tooLow = [FROM Points, bounds, newBounds WHERE newBounds.lower > v AND v >= bounds.lower EMIT v=Points.v]; tooHigh = [FROM Points, bounds, newBounds WHERE newBounds.upper < v AND v <= bounds.upper EMIT v=Points.v]; newBad = UNIONALL(tooLow, tooHigh); bounds = newBounds; continue = [FROM newBad EMIT COUNT(v) > 0]; WHILE continue; output = [FROM Points, bounds WHERE Points.v > bounds.lower AND Points.v < bounds.upper EMIT v=Points.v]; DUMP(output); Sigma-clipping, V2
  • 36. Dominik Moritz EuroVis 15 Empower the end user to do performance profiling, debugging, etc.
  • 38. Some ongoing work • “from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 39
  • 39. Some ongoing work • “from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 40
  • 40.
  • 41. Some ongoing work • “from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 42
  • 42. Query compilation for distributed processing pipeline as parallel code parallel compiler machine code [Myers ’14] pipeline fragment code pipeline fragment code sequential compiler machine code [Crotty ’14, Li ’14, Seo ’14, Murray ‘11] sequential compiler
  • 44. 11/1/2016 Bill Howe, UW 45/57 1% selection microbenchmark, 20GB Avoid long code paths ICS 16 Brandon Myers
  • 45. 11/1/2016 Bill Howe, UW 46/57 Q2 SP2Bench, 100M triples, multiple self-joins Communication optimization ICS 16 Brandon Myers
  • 46. Graph Patterns 47 • SP2Bench, 100 million triples • Queries compiled to a PGAS C++ language layer, then compiled again by a low-level PGAS compiler • One of Myria’s supported back ends • Comparison with Shark/Spark, which itself has been shown to be 100X faster than Hadoop-based systems • …plus PageRank, Naïve Bayes, and more RADISH ICS 16 Brandon Myers
  • 47. 11/1/2016 Bill Howe, UW 48 ICS 15 RADISH ICS 16 Brandon Myers TPC-H
  • 48. Some ongoing work • “from scratch” polystore optimizer – Columbia-style, with some ideas from PL community • Anecdotal Optimization – Infer optimization decisions based on coarse-grained experimental results from unreliable sources (blogs, literature) – “System X is 2X faster than System Y on PageRank” • Benchmarking Linear Algebra Systems vs. Databases – HPC community thinks they are 1000X faster; they aren’t – DB community thinks they are competitive; they aren’t • Query compilation – “Software-defined Databases” – Bridge the gap between MPI and DB • New query language Kamooks blending arrays and relations 11/1/2016 Bill Howe, UW 49
  • 49. select A.i, B.k, sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k Matrix multiply in RA Matrix multiply
  • 50. sparsity exponent (r s.t. m=nr) Complexity exponent n2.38 mn m0.7n1.2+n2 slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication n = number of rows m = number of non-zeros Complexity of matrix multiply naïve sparse algorithm best known sparse algorithm best known dense algorithm lots of room here
  • 51. BLAS vs. SpBLAS vs. SQL (10k) off the shelf database 15X
  • 52. 11/1/2016 Bill Howe, UW 54 20k X 20k matrix multiply by sparsity CombBLAS, MyriaX, Radish
  • 53. 11/1/2016 Bill Howe, UW 55 50k X 50k matrix multiply by sparsity CombBLAS, MyriaX, Radish Filter to upper left corner of result matrix
  • 54. select AB.i, C.m, sum(AB.val*C.val) from (select A.i, B.k, sum(A.val*B.val) from A, B where A.j = B.j group by A.i, B.k ) AB, C where AB.k = C.k group by AB.i, C.m A x B x C select A.i, C.m, sum(A.val*B.val*C.val) from A, B, C where A.j = B.j and B.k = C.k group by A.i, C.m A(i, j, val) B(j, k, val) C(k, m, val) take three sparse matrices Now compute multiway hypercube join: O (|A|/p + |B|/p^2 + |C|/p) Group by: ~O (N) But wait, there’s more…..
  • 55. 2 seconds, balanced Hypercube shuffle Partitioned hash join 43 seconds, tons of skew Task: self-multiply with 1M non-zeros
  • 56. Seung-Hee BaeScalable Graph Clustering Version 1 Parallelize Best-known Serial Algorithm ICDM 2013 Version 2 Free 30% improvement for any algorithm TKDD 2014 SC 2015 Version 3 Distributed approx. algorithm, 1.5B edges
  • 59. “Data Triage” Pipeline 61 SAS Excel XML CSV SQL Azure Files Tables Views parse / extract “relational analysis” visual analysis Visualizations SIGMOD 11 SSDBM 13 SIGMOD 16 sqlshare.escience.washington.edu CHI 12 SIGMOD 12 iConference 13 SSDBM 11 CiSE 13 SSDBM 15
  • 60. 62
  • 61. 63
  • 62.
  • 63. video 11/1/2016 Bill Howe, UW 65 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Fusion VizDeck ManyEyes Tableau Task Completion Rate / Time - All Questions CHI 13
  • 64. Visualization Recommendation • Model each “vizlet” as a triple (x_column, y_column, vizlet_type) • Extract features from each column (f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type) • Interpret each “promotion” as a yes vote and each “discard” as a no vote • Train a (simple) model to predict vizlet type from features • Recommend highest-scoring vizlets • Add a diversity term to prevent a bunch of similar plots • Incorporate score modifiers defined by the vizlet designer – “My bar chart looks best when there are about 5 bars.” – “My timeseries plot ignores null values” 11/1/2016 Bill Howe, UW 66
  • 65. Example of a Learned Rule (1) low x-entropy => bad scatter plot 11/1/2016 Bill Howe, UW 67 bad scatter plotgood scatter plot
  • 66. Example of a Learned Rule (2) low x-entropy => histogram 11/1/2016 Bill Howe, UW 68 bad scatter plot good histogram
  • 67. Example of a Learned Rule (3) 69 high x-periodicity => timeseries plot (periodicity = 1 / variance in gap length between successive values)
  • 68. Voyager 11/1/2016 Bill Howe, UW 70 Kanit “Ham” Wongsuphasawat Dominik Moritz InfoVis 15
  • 69. Within the first few queries, you’ve touched all the tables. SIGMOD 2016 Shrainik Jain http://uwescience.github.io/sqlshare/

Notas del editor

  1. Let me give you a brief example of a project a little further upstream that the incubation program can provide access to. This work is in a space of open data sharing platforms, along with Socrata here in Seattle, products from Google and Microsoft, and a number of other companies. Two observations motivate the products in this space: First, there’s a movement toward open data that has researchers, government agencies, and even companies exposing their data assets online for use by others for reasons of transparency, efficiency, accountability. Even for commercial data, there are marketplaces emerging to facilitate the buying and selling of data. All of these use cases need new technology. So that’s one reason. Second, if you’re going to use someone else’s data, you need it to be as accessible as possible. In particular, you need to help data analysts use the data “had previously been the realm of programmers and DB adminsistrators” – here I’m quoting Benjamin Romano from in an Xconomy article about Socrata. SQLShare is an open data system, but emphasizes rich data manipulation rather than just fetch and retrieval, interoperability with external tools and existing databases, local or cloud deployments, and built-in services for data integration, profiling, and visualization. Ginger mentioned this system in her talk – we have maintained a production deployment here on campus for three years focusing on science users. Our observation is that science use cases are a predictor for commercial use cases – businesses are beginning to use data the same way scientists always have – they collect it aggressively, torture it with analytics, use it to make predictions about the world. So we think if we can handle these difficult science use cases that we will also be addressing a significant commercial problem.
  2. Solutions are emerging, powered by the open data movement. Socrata, a local Seattle company, has built a very successful business of helping cities jailbreak their data, and are now engaged in climbing the application stack to support analytics and visualization. Essentially every url of the form data.yourcity.gov is powered by Socrata’s technology Data, People, and Infrastructure
  3. and you can extend this model to the database layer to help share services like backup, recovery, caching, load balancing
  4. If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well. And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload. For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas. You can discover public datasets to join with, like Alon worked on with Fusion Tables You can recommend visualizations automatically You can automatically infer and attach metadata – semi-automatic data curation. A big globally shared data lake “Precision Medicine for Databases”
  5. So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database. You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries. Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.  ---- Key ideas to get data in: a) Use the cloud to avoid having to install and run a database b) Give up on the schema -- just throw your data in "as is" and do "lazy integration.” c) Use some magic to automate parsing, integration, recommendations, and more. Key ideas to get data out: a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find. b) Saving and reusing queries is a first class requirement.  Given examples, it's easy to modify it into an "adjacent" query. c) Expose the whole system through a REST API to make it easy to bring new client applications online.
  6. Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations, We are exploring some of these.
  7. We see non-programmers who write these wonderful 40-line queries. This one does interval queries on genomic sequences. She doesn’t write any R, any Python, but she can do this, and she’s no longer dependent on staff programmers.
  8. If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well. And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload. For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas. You can discover public datasets to join with, like Alon worked on with Fusion Tables You can recommend visualizations automatically You can automatically infer and attach metadata – semi-automatic data curation. A big globally shared data lake
  9. Express these plans Optimize these plans Compile these plans Execute these plans
  10. Express these plans Optimize these plans Compile these plans Execute these plans
  11. So our approach is to model this overlap in capabilities as its own language. We start
  12. We hoist
  13. NOTES: Optimizations enable? with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
  14. Can you just run this in a database and expect good performance. Of course not. But is this a fundamentally bad idea to run it this way? Maybe not.
  15. This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse
  16. Now let’s do
  17. If you can automate, you can precompute specualtively.
  18. On Big Data: interactive, on the web may or may not be feasible for big big data, but a good model for visualization recommendation can enable speculative generation.
  19. Why do we care about lifetime? Table usage predictions for caching and partitioning. Move from reactive to proactive physical design services. Query idioms are consistent, while the data is fleeting. Not exact queries as in a streaming system, but the “methods” are reused over and over. Extracting and optimizing these idioms across tenants is our goal.
  20. And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
  21. … but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
  22. We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve. Essentially, we want to remove the speed-bump of data handling from the scientists.