A quick, off-the-cuff talk about why I think SQL is good for scientists. Please send me notes correcting my Python, arguing, or asking for more information! And see the tutorial at: http://uwescience.github.io/sqlshare
1. A case for teaching
SQL to scientists
Daniel Halperin
#w2tbac @SESYNC 2013-07-09
2. SQL: think like data
• SQL is a Language for expressing Queries
over Structured data.
• vs Python/R, SQL is
• strictly less powerful
• better for concisely, clearly, and efficiently
expressing data manipulation
• ... and anecdotally, “many” scripts written
by scientists just manipulate data
3. Claim 1: SQL is
Concise & Clear
• English questions often translate
directly into SQL
• Scripting languages have a lot of language
overhead -- syntactic sugar
• Let’s see some (admittedly biased)
examples
4. with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
5. with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
SELECT COUNT(*) AS cnt
FROM file
6. with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
7. with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
SELECT *
FROM file
WHERE value > 5
8. What does this code do?
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
9. What does this code do?
with open(‘file.txt’) as input_file:
tot_counts = defaultdict(0)
for line in input_file:
tot_counts[line.split()[3]] += int(line.split()[4])
for value in tot_counts:
print value, tot_counts[value]
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
10. What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
11. What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
<Complicated stuff with dictionaries>
12. Claim 2: SQL is Efficient
Scaling up your data
• What happens when Python/R data
doesn’t fit in memory? Crash, or rewrite
much more complicated code
• All databases automatically,
transparently spill to disk, and are
heavily optimized for performance
13. Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./highly_optimized_code.py < TB.dataset > GB.result
14. Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
15. Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
1) Dive into the complex code and modify its
internals to filter inside
2) Suffer the long running time of the first program
16. Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
Gives their query a
name, but doesn’t
execute it!
17. Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
18. Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
Fast!
19. SQL for Science
• UW’s SQLShare - open, view-oriented,
web database service
• Easy data import, public & private sharing,
permalinks (DOI support coming)
• Use a series of views instead of scripts for:
• data cleaning, transformation, integration
• simple stats, analytics, format conversion
• provenance and publishing
• mashups: integrated with R, Sage, etc.
20. escience.washington.edu/sqlshare
“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces. Previously, we were using
huge directory trees and plain text files. Now we can accomplish a
10 minute 100 line script in 1 line of SQL.”
- Andrew D White, grad student in UW Chem Eng
“I have had two students who are struggling with R come up and tell me
how much more they like working in SQLShare.”
- Robin Kodner, as asst professor at Western Washington U
"That [SQL query that finished in 1 second] took
me a week [manually in Excel]!"
- Robin Kodner, as postdoc at UW Oceanography
* yes, we need (and are interested in) more than anecdotes!!