Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Democratizing Data Science in the Cloud
1. Democratizing Data Science
in the Cloud
Bill Howe, Ph.D.
Associate Director and Senior Data Science Fellow, eScience Institute
Affiliate Associate Professor, Computer Science & Engineering
11/1/2016 Bill Howe, UW 1
2. 11/1/2016 Bill Howe, UW 2
Cloud Data Management is about
sharing resources between tenants
We’re interested in new services powered by sharing
more than infrastructure – schema, data, queries
3. Why?
Example: JBOT* Open Data systems
Google
Fusion
Tables
3
Entrepreneurship
1) “Data once guarded for assumed but untested
reasons is now open, and we're seeing benefits.”
-- Nigel Shadbolt, Open Data Institute
2) Need to help “non-specialists within an
organization use data that had been the
realm of programmers and DB admins”
-- Benjamin Romano, Xconomy
“Businesses are now using data the way
scientists always have”
-- Jeff Hammerbacher
Mt. Sinai, formerly Cloudera
*Just a Bunch of Tables
8. Q Q Q
….
JBOT* Query-as-a-Service Systems
Goal:
smart cross-tenant services,
trained on everyone’s data
• Metadata inference and data curation
• Query recommendation via common idioms
• Data discovery – e.g., “find me things to join with”
• Visualization recommendation
• Semi-automatic integration services
Control Plane /
Infrastructure
Data Plane /
Database sys.
Application /
schema, data,
query logs
*Just a Bunch of Tables
9. Example Service: Automated Data Curation
11/1/2016 Bill Howe, UW 9
Microarray samples submitted to the Gene Expression Omnibus
Curation is fast becoming the
bottleneck to data sharing
Maxim
Gretchkin
Hoifung
Poon
10. Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Goal: Repair metadata for genetic
datasets using the content of the data, the
structure of an associated ontology, the
abstract of the paper, and everything else.
Deep Neural Network
Tissue Type Labels
Innovations in transfer learning,
poor training data, etc.
Paper
Abstract
11. Example Service: Automated Data Curation
Maxim
Gretchkin
Hoifung
Poon
Iterative co-learning between text-based classified and
expression-based classifier: Both models improve by
training on each others’ results
12. • SQLShare: Query-as-a-Service
• VizDeck: Visualization recommendation
• Myria: Big Data Ecosystems
VizDeck
Some Cloud Data Systems
13. 1) Upload data “as is”
Cloud-hosted, secure; no
need to install or design a
database; no pre-defined
schema; schema inference;
some itegration
2) Write Queries
Right in your browser,
writing views on top of
views on top of views ...
SELECT hit, COUNT(*)
FROM tigrfam_surface
GROUP BY hit
ORDER BY cnt DESC
3) Share the results
Make them public, tag them,
share with specific colleagues –
anyone with access can query
http://sqlshare.escience.washington.edu
16. SELECT x.strain, x.chr, x.region as snp_region, x.start_bp as snp_start_bp
, x.end_bp as snp_end_bp, w.start_bp as nc_start_bp, w.end_bp as nc_end_bp
, w.category as nc_category
, CASE WHEN (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
THEN x.end_bp - x.start_bp + 1
WHEN (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
THEN x.end_bp - w.start_bp + 1
WHEN (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
THEN w.end_bp - x.start_bp + 1
END AS len_overlap
FROM [koesterj@washington.edu].[hotspots_deserts.tab] x
INNER JOIN [koesterj@washington.edu].[table_noncoding_positions.tab] w
ON x.chr = w.chr
WHERE (x.start_bp >= w.start_bp AND x.end_bp <= w.end_bp)
OR (x.start_bp <= w.start_bp AND w.start_bp <= x.end_bp)
OR (x.start_bp <= w.end_bp AND w.end_bp <= x.end_bp)
ORDER BY x.strain, x.chr ASC, x.start_bp ASC
Non-programmers can write very complex queries
(rather than relying on staff programmers)
Example: Computing the overlaps of two sets of blast results
We see thousands of
queries written by
non-programmers
17. The SQLShare Corpus:
A multi-year log of hand-written analytics queries
Queries 24275
Views 4535
Tables 3891
Users 591
SIGMOD 2016
Shrainik Jain
https://uwescience.github.io/sqlshare
19. Latent Idioms for Schema-Independent Query Recommendation
Background on
Word2Vec, GloVE:
Map each term in a
corpus to a vector in
a high-dimensional
space based on its
co-occurrences.
Linear relationships
between these
vectors appear to
capture remarkable
semantic properties
20. :
SELECT COUNT(*) FROM [candrzejowiec@yahoo.com].[table_Firearms.txt]
SELECT COUNT (HiLo) FROM [roula.cardaras@gmail.com].[table_MUK.csv]
SELECT count(*) FROM [leslie@westerncatholic.org].[Depth_combined]
select count(Wave_Height) from [christa.kohnert@gmail.com].[Join]
SELECT count(*) FROM [wenjunh@washington.edu].[ecoli_nogaps_1.csv]
SELECT Count(*) FROM [latcron@gmail.com].[TargetTrackFeatures.csv]
SELECT count(*) FROM [billhowe].[sunrise sunset times 2009 - 2011]
SELECT Count(*) FROM [bifxcore@gmail.com].[table_ec_pdb_genus.csv]
SELECT count(*) FROM [whitead@washington.edu].[ecoli_nogaps_1.csv]
SELECT COUNT(*) FROM [ribalet@washington.edu].[Tokyo_0_merged.csv]
SELECT COUNT(*) FROM [dhalperi@washington.edu].[SPID_GOnumber.txt]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Orthosia]
SELECT COUNT (species) FROM [bigbananatopdog@gmail.com].[Leucania]
:
Apply the same trick to the SQLShare corpus, cluster the results
A not-very-interesting cluster:
Latent SQL Idioms
21. :
SELECT COUNT(*) FROM [ajw123@washington.edu].[table_proteins.csv] WHERE species LIKE 'Homo sapiens
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT count (*) FROM [1029880@apps.nsd.org].[Task 5] where (Hashtags_In_Text) Like '%Phailin%'
SELECT Count (*) FROM [kzoehayes@gmail.com].[Dated_Join] WHERE Category = 'Warm'
SELECT COUNT (*) FROM [ethanknight08@gmail.com].[table_PopulationV2.txt] WHERE Column1='Country'
SELECT COUNT(*) FROM [missmelupton@gmail.com].[table_pHWaterTemp] WHERE TempCategory='normal'
SELECT COUNT(*) FROM [1004387@apps.nsd.org].[no retweete] WHERE hashtags_in_text LIKE '%#odisha
:
Another not-very-interesting cluster:
We see other clusters that seem to capture more basics: “union,”
“group by with one grouping column,” “left outer join,” “string
manipulation,” etc.
Latent SQL Idioms
22. Latent SQL Idioms
More interesting examples:
select floor(latitude/0.7)*0.7 as latbin
, floor(longitude/0.7)*0.7 as lonbin
, species
FROM [koenigk92@gmail.com].[All3col]
select distinct case when patindex('%[0-9]%', [protein]) = 1 -- first char is number
and charindex(',', [protein]) = 0 -- and no comma present
then [protein]
else substring([protein], patindex('%[0-9]%', [protein]),
charindex(',', [protein])-patindex('%[0-9]%', [protein]))
end as [protein d1124],
[tot indep spectra] as [tot spectra d1124]
from [emmats@washington.edu].[d1_file124.txt]
Parsing a common
bioinformatics file format
Expressions for binning
space and time columns
38. Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 39
39. Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 40
40.
41. Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 42
42. Query compilation for distributed processing
pipeline
as
parallel
code
parallel compiler
machine
code
[Myers ’14]
pipeline
fragment
code
pipeline
fragment
code
sequential
compiler
machine
code
[Crotty ’14, Li ’14, Seo ’14, Murray ‘11]
sequential
compiler
44. 11/1/2016 Bill Howe, UW 45/57
1% selection microbenchmark, 20GB
Avoid long code paths
ICS 16
Brandon
Myers
45. 11/1/2016 Bill Howe, UW 46/57
Q2 SP2Bench, 100M triples, multiple self-joins
Communication optimization
ICS 16
Brandon
Myers
46. Graph Patterns
47
• SP2Bench, 100 million triples
• Queries compiled to a PGAS C++ language layer, then
compiled again by a low-level PGAS compiler
• One of Myria’s supported back ends
• Comparison with Shark/Spark, which itself has been shown to
be 100X faster than Hadoop-based systems
• …plus PageRank, Naïve Bayes, and more
RADISH
ICS 16
Brandon
Myers
48. Some ongoing work
• “from scratch” polystore optimizer
– Columbia-style, with some ideas from PL community
• Anecdotal Optimization
– Infer optimization decisions based on coarse-grained experimental
results from unreliable sources (blogs, literature)
– “System X is 2X faster than System Y on PageRank”
• Benchmarking Linear Algebra Systems vs. Databases
– HPC community thinks they are 1000X faster; they aren’t
– DB community thinks they are competitive; they aren’t
• Query compilation
– “Software-defined Databases”
– Bridge the gap between MPI and DB
• New query language Kamooks blending arrays and relations
11/1/2016 Bill Howe, UW 49
49. select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
Matrix multiply in RA
Matrix multiply
50. sparsity exponent (r s.t. m=nr)
Complexity
exponent
n2.38
mn
m0.7n1.2+n2
slide adapted from ZwickR. Yuster and U. Zwick, Fast Sparse Matrix Multiplication
n = number of rows
m = number of non-zeros
Complexity of matrix multiply
naïve sparse
algorithm
best known
sparse
algorithm
best known
dense
algorithm
lots of room
here
52. 11/1/2016 Bill Howe, UW 54
20k X 20k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
53. 11/1/2016 Bill Howe, UW 55
50k X 50k matrix multiply by sparsity
CombBLAS, MyriaX, Radish
Filter to upper left corner of result matrix
54. select AB.i, C.m, sum(AB.val*C.val)
from
(select A.i, B.k, sum(A.val*B.val)
from A, B
where A.j = B.j
group by A.i, B.k
) AB,
C
where AB.k = C.k
group by AB.i, C.m
A x B x C
select A.i, C.m, sum(A.val*B.val*C.val)
from A, B, C
where A.j = B.j
and B.k = C.k
group by A.i, C.m
A(i, j, val)
B(j, k, val)
C(k, m, val)
take three sparse
matrices
Now compute
multiway hypercube join:
O (|A|/p + |B|/p^2 + |C|/p)
Group by:
~O (N)
But wait, there’s more…..
63. video
11/1/2016 Bill Howe, UW 65
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Fusion VizDeck ManyEyes Tableau
Task Completion Rate / Time - All Questions
CHI 13
64. Visualization Recommendation
• Model each “vizlet” as a triple
(x_column, y_column, vizlet_type)
• Extract features from each column
(f1x, f2x,…, fNx, f1y, f2y, …, fNy, vizlet_type)
• Interpret each “promotion” as a yes vote and each “discard” as a
no vote
• Train a (simple) model to predict vizlet type from features
• Recommend highest-scoring vizlets
• Add a diversity term to prevent a bunch of similar plots
• Incorporate score modifiers defined by the vizlet designer
– “My bar chart looks best when there are about 5 bars.”
– “My timeseries plot ignores null values”
11/1/2016 Bill Howe, UW 66
65. Example of a Learned Rule (1)
low x-entropy => bad scatter plot
11/1/2016 Bill Howe, UW 67
bad scatter plotgood scatter plot
66. Example of a Learned Rule (2)
low x-entropy => histogram
11/1/2016 Bill Howe, UW 68
bad scatter plot good histogram
67. Example of a Learned Rule (3)
69
high x-periodicity => timeseries plot
(periodicity = 1 / variance in gap length between successive values)
69. Within the first few queries, you’ve
touched all the tables.
SIGMOD 2016
Shrainik Jain
http://uwescience.github.io/sqlshare/
Notas del editor
Let me give you a brief example of a project a little further upstream that the incubation program can provide access to.
This work is in a space of open data sharing platforms, along with Socrata here in Seattle, products from Google and Microsoft, and a number of other companies.
Two observations motivate the products in this space:
First, there’s a movement toward open data that has researchers, government agencies, and even companies exposing their data assets online for use by others for reasons of transparency, efficiency, accountability. Even for commercial data, there are marketplaces emerging to facilitate the buying and selling of data. All of these use cases need new technology. So that’s one reason.
Second, if you’re going to use someone else’s data, you need it to be as accessible as possible. In particular, you need to help data analysts use the data “had previously been the realm of programmers and DB adminsistrators” – here I’m quoting Benjamin Romano from in an Xconomy article about Socrata.
SQLShare is an open data system, but emphasizes rich data manipulation rather than just fetch and retrieval, interoperability with external tools and existing databases, local or cloud deployments, and built-in services for data integration, profiling, and visualization.
Ginger mentioned this system in her talk – we have maintained a production deployment here on campus for three years focusing on science users. Our observation is that science use cases are a predictor for commercial use cases – businesses are beginning to use data the same way scientists always have – they collect it aggressively, torture it with analytics, use it to make predictions about the world. So we think if we can handle these difficult science use cases that we will also be addressing a significant commercial problem.
Solutions are emerging, powered by the open data movement.
Socrata, a local Seattle company, has built a very successful business of helping cities jailbreak their data, and are now engaged in climbing the application stack to support analytics and visualization.
Essentially every url of the form data.yourcity.gov is powered by Socrata’s technology
Data, People, and Infrastructure
and you can extend this model to the database layer to help share services like backup, recovery, caching, load balancing
If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well.
And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload.
For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas.
You can discover public datasets to join with, like Alon worked on with Fusion Tables
You can recommend visualizations automatically
You can automatically infer and attach metadata – semi-automatic data curation.
A big globally shared data lake
“Precision Medicine for Databases”
So we developed SQLShare to support a very simple workflow: you can upload data “as is” from spreadsheets or anything. It’s in the cloud, so no need to install or design a database.
You can immediately begin writing queries, right in your browser, and put queries on top of queries on top of queries.
Then you can share the results online: Your colleagues can browse the science questions and see the SQL that answers it. ta out.
----
Key ideas to get data in: a) Use the cloud to avoid having to install and run a databaseb) Give up on the schema -- just throw your data in "as is" and do "lazy integration.”
c) Use some magic to automate parsing, integration, recommendations, and more.
Key ideas to get data out:
a) Associate science questions (in English) with each SQL query -- makes them easy to understand and easy to find.b) Saving and reusing queries is a first class requirement. Given examples, it's easy to modify it into an "adjacent" query.
c) Expose the whole system through a REST API to make it easy to bring new client applications online.
Lots of features you can imagine here – anything you can do with a youtube video, you should be able to do with a query: share it, rate it, “more like this”, recommendations,
We are exploring some of these.
We see non-programmers who write these wonderful 40-line queries.
This one does interval queries on genomic sequences. She doesn’t write any R, any Python, but she can do this, and she’s no longer dependent on staff programmers.
If you go up a level, you have what you might call “query as a service” – you’re querying your own data, but you might query others peoples data as well.
And, even if you want to remain logically isolated, you can still benefit from services that are powered by mining everyone’s schema, data, and workload.
For example, query recommendation in the past assumed a fixed schema. With this model, you can recommend “idioms” across different schemas.
You can discover public datasets to join with, like Alon worked on with Fusion Tables
You can recommend visualizations automatically
You can automatically infer and attach metadata – semi-automatic data curation.
A big globally shared data lake
Express these plans
Optimize these plans
Compile these plans
Execute these plans
Express these plans
Optimize these plans
Compile these plans
Execute these plans
So our approach is to model this overlap in capabilities as its own language.
We start
We hoist
NOTES:
Optimizations enable?
with better semantics on a hash table join with UDFs, can do redundant computation elimination, code motion from UDF
Can you just run this in a database and expect good performance. Of course not.
But is this a fundamentally bad idea to run it this way?
Maybe not.
This is the complexity of three matrix multiply algorithms plotted against the sparsi – a naïve sparse
Now let’s do
If you can automate, you can precompute specualtively.
On Big Data: interactive, on the web may or may not be feasible for big big data, but a good model for visualization recommendation can enable speculative generation.
Why do we care about lifetime?
Table usage predictions for caching and partitioning. Move from reactive to proactive physical design services.
Query idioms are consistent, while the data is fleeting. Not exact queries as in a streaming system, but the “methods” are reused over and over.
Extracting and optimizing these idioms across tenants is our goal.
And processing power, either as raw processor speed or via novel multi-core and many-core architectures, is also continuing to increase exponentially…
… but human cognitive capacity is remaining constant. How can computing technologies help scientists make sense out of these vast and complex data sets?
We want to give a little background of our project before we launch into it, so we will discuss the problem we are trying to solve.
Essentially, we want to remove the speed-bump of data handling from the scientists.