SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
A case for teaching
SQL to scientists
Daniel Halperin
#w2tbac @SESYNC 2013-07-09
SQL: think like data
• SQL is a Language for expressing Queries
over Structured data.
• vs Python/R, SQL is
• strictly less powerful
• better for concisely, clearly, and efficiently
expressing data manipulation
• ... and anecdotally, “many” scripts written
by scientists just manipulate data
Claim 1: SQL is
Concise & Clear
• English questions often translate
directly into SQL
• Scripting languages have a lot of language
overhead -- syntactic sugar
• Let’s see some (admittedly biased)
examples
with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
with open(‘file.txt’) as input_file:
cnt = 0
for line in input_file:
cnt += 1
print cnt
What does this code do?
SELECT COUNT(*) AS cnt
FROM file
with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
with open(‘file.txt’) as input_file:
for line in input_file:
if int(line.split()[3]) > 5:
print line
What does this code do?
SELECT *
FROM file
WHERE value > 5
What does this code do?
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
What does this code do?
with open(‘file.txt’) as input_file:
tot_counts = defaultdict(0)
for line in input_file:
tot_counts[line.split()[3]] += int(line.split()[4])
for value in tot_counts:
print value, tot_counts[value]
SELECT value, SUM(counts) AS tot_count
FROM file
GROUP BY value
What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
What does this code do?
SELECT census.county,
electoral.votes / census.population AS voting_rate
FROM electoral, census
WHERE electoral.county = census.county
<Complicated stuff with dictionaries>
Claim 2: SQL is Efficient
Scaling up your data
• What happens when Python/R data
doesn’t fit in memory? Crash, or rewrite
much more complicated code
• All databases automatically,
transparently spill to disk, and are
heavily optimized for performance
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./highly_optimized_code.py < TB.dataset > GB.result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
Claim 2: SQL is Efficient
Say you inherit a really well-engineered Python script
./simple_data_filter.py < GB.result > MB.answer
./highly_optimized_code.py < TB.dataset > GB.result
But are only interested in a small fraction of the result
1) Dive into the complex code and modify its
internals to filter inside
2) Suffer the long running time of the first program
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
Gives their query a
name, but doesn’t
execute it!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
Claim 2: SQL is Efficient
CREATE VIEW their_query AS
SELECT <... their code ...>
FROM terabyte_dataset
SELECT *
FROM their_query
WHERE <... your filter ...>
Gives their query a
name, but doesn’t
execute it!
Combine both
queries and optimize
together!
Fast!
SQL for Science
• UW’s SQLShare - open, view-oriented,
web database service
• Easy data import, public & private sharing,
permalinks (DOI support coming)
• Use a series of views instead of scripts for:
• data cleaning, transformation, integration
• simple stats, analytics, format conversion
• provenance and publishing
• mashups: integrated with R, Sage, etc.
escience.washington.edu/sqlshare
“An undergraduate student and I are working with gigabytes of tabular
data derived from analysis of protein surfaces. Previously, we were using
huge directory trees and plain text files. Now we can accomplish a
10 minute 100 line script in 1 line of SQL.”
- Andrew D White, grad student in UW Chem Eng
“I have had two students who are struggling with R come up and tell me
how much more they like working in SQLShare.”
- Robin Kodner, as asst professor at Western Washington U
"That [SQL query that finished in 1 second] took
me a week [manually in Excel]!"
- Robin Kodner, as postdoc at UW Oceanography
* yes, we need (and are interested in) more than anecdotes!!
SQL can do more than
you think (here vs R)

Más contenido relacionado

La actualidad más candente

Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Citus Data
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositoriesfeiwin
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingCodeScience
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and DatabaseIRJET Journal
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma APIKyle Banerjee
 

La actualidad más candente (8)

Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 
Scrutiny 2
Scrutiny 2Scrutiny 2
Scrutiny 2
 
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
Tutorial: Implementing your first Postgres extension | PGConf EU 2019 | Burak...
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package TestingWEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
WEBINAR: Proven Patterns for Loading Test Data for Managed Package Testing
 
The LINQ Between XML and Database
The LINQ Between XML and DatabaseThe LINQ Between XML and Database
The LINQ Between XML and Database
 
Getting Started with the Alma API
Getting Started with the Alma APIGetting Started with the Alma API
Getting Started with the Alma API
 

Destacado

Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Omar Zenteno-Fuentes
 
Timeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenTimeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenSally Sen
 
教案2
教案2教案2
教案2Amy Li
 
โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์noeiinoii
 
教學簡報
教學簡報教學簡報
教學簡報Amy Li
 
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowWhat Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowMozu
 
Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Rakhi Tips Provider
 
โครงงานคอม
โครงงานคอมโครงงานคอม
โครงงานคอมnoeiinoii
 
Tips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugTips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugRakhi Tips Provider
 

Destacado (18)

Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
Nutrición saludable. am can res inst tomado de http colonoscopy.ru patient br...
 
Killer Bugs From Outer Space
Killer Bugs From Outer SpaceKiller Bugs From Outer Space
Killer Bugs From Outer Space
 
งานคอม
งานคอมงานคอม
งานคอม
 
Timeless Fashion Necklaces for Women
Timeless Fashion Necklaces for WomenTimeless Fashion Necklaces for Women
Timeless Fashion Necklaces for Women
 
Lorain
LorainLorain
Lorain
 
教案2
教案2教案2
教案2
 
teachin ESP
teachin ESPteachin ESP
teachin ESP
 
โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์โครงร่างโครงงานคอมพิวเตอร์
โครงร่างโครงงานคอมพิวเตอร์
 
教學簡報
教學簡報教學簡報
教學簡報
 
Campus Democracy
Campus DemocracyCampus Democracy
Campus Democracy
 
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of TomorrowWhat Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
What Retailers Must Do Today to Meet the Consumer Expectations of Tomorrow
 
Daily option news letter 09 july 2013
Daily option news letter 09 july 2013Daily option news letter 09 july 2013
Daily option news letter 09 july 2013
 
664 2
664 2664 2
664 2
 
โครงงานคอม
โครงงานคอมโครงงานคอม
โครงงานคอม
 
INDOKON BETON INSTAN
INDOKON BETON INSTANINDOKON BETON INSTAN
INDOKON BETON INSTAN
 
Tips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 AugTips for option market with newsletter: 5 Aug
Tips for option market with newsletter: 5 Aug
 
Equity Newsletter For 3-October
Equity Newsletter For 3-OctoberEquity Newsletter For 3-October
Equity Newsletter For 3-October
 
Jadia jn-pierre-now-arguing-with-the-imf
Jadia jn-pierre-now-arguing-with-the-imfJadia jn-pierre-now-arguing-with-the-imf
Jadia jn-pierre-now-arguing-with-the-imf
 

Similar a A case for teaching SQL to scientists

Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfKultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfShaNatasha1
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsKellyn Pot'Vin-Gorman
 
Intelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIntelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIAEME Publication
 
Building a Testable Data Access Layer
Building a Testable Data Access LayerBuilding a Testable Data Access Layer
Building a Testable Data Access LayerTodd Anglin
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudIke Ellis
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersMichael Rys
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseSandesh Rao
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)Michael Rys
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Jim Czuprynski
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveIlyas F ☁☁☁
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 
Using sql server in c sharp
Using sql server in c sharpUsing sql server in c sharp
Using sql server in c sharpFaruk Alkan
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016James Serra
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23Dan Boutin
 

Similar a A case for teaching SQL to scientists (20)

Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdfKultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
Kultam MM UI - MySQL for Data Analytics and Business Intelligence.pdf
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Taming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI OptionsTaming the shrew, Optimizing Power BI Options
Taming the shrew, Optimizing Power BI Options
 
Intelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversionIntelligent query converter a domain independent interfacefor conversion
Intelligent query converter a domain independent interfacefor conversion
 
Building a Testable Data Access Layer
Building a Testable Data Access LayerBuilding a Testable Data Access Layer
Building a Testable Data Access Layer
 
Move a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloudMove a successful onpremise oltp application to the cloud
Move a successful onpremise oltp application to the cloud
 
U-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for DevelopersU-SQL - Azure Data Lake Analytics for Developers
U-SQL - Azure Data Lake Analytics for Developers
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 
Top 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous DatabaseTop 20 FAQs on the Autonomous Database
Top 20 FAQs on the Autonomous Database
 
Taming the shrew Power BI
Taming the shrew Power BITaming the shrew Power BI
Taming the shrew Power BI
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)ADL/U-SQL Introduction (SQLBits 2016)
ADL/U-SQL Introduction (SQLBits 2016)
 
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
Vote Early, Vote Often: From Napkin to Canvassing Application in a Single Wee...
 
Azure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep DiveAzure Data Lake Analytics Deep Dive
Azure Data Lake Analytics Deep Dive
 
70487.pdf
70487.pdf70487.pdf
70487.pdf
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
Using sql server in c sharp
Using sql server in c sharpUsing sql server in c sharp
Using sql server in c sharp
 
What's new in SQL Server 2016
What's new in SQL Server 2016What's new in SQL Server 2016
What's new in SQL Server 2016
 
Nyc web perf-final-july-23
Nyc web perf-final-july-23Nyc web perf-final-july-23
Nyc web perf-final-july-23
 

Último

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxleah joy valeriano
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A Beña
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4JOYLYNSAMANIEGO
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Último (20)

Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptxMusic 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
Music 9 - 4th quarter - Vocal Music of the Romantic Period.pptx
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
 
Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4Daily Lesson Plan in Mathematics Quarter 4
Daily Lesson Plan in Mathematics Quarter 4
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 

A case for teaching SQL to scientists

  • 1. A case for teaching SQL to scientists Daniel Halperin #w2tbac @SESYNC 2013-07-09
  • 2. SQL: think like data • SQL is a Language for expressing Queries over Structured data. • vs Python/R, SQL is • strictly less powerful • better for concisely, clearly, and efficiently expressing data manipulation • ... and anecdotally, “many” scripts written by scientists just manipulate data
  • 3. Claim 1: SQL is Concise & Clear • English questions often translate directly into SQL • Scripting languages have a lot of language overhead -- syntactic sugar • Let’s see some (admittedly biased) examples
  • 4. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do?
  • 5. with open(‘file.txt’) as input_file: cnt = 0 for line in input_file: cnt += 1 print cnt What does this code do? SELECT COUNT(*) AS cnt FROM file
  • 6. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do?
  • 7. with open(‘file.txt’) as input_file: for line in input_file: if int(line.split()[3]) > 5: print line What does this code do? SELECT * FROM file WHERE value > 5
  • 8. What does this code do? SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 9. What does this code do? with open(‘file.txt’) as input_file: tot_counts = defaultdict(0) for line in input_file: tot_counts[line.split()[3]] += int(line.split()[4]) for value in tot_counts: print value, tot_counts[value] SELECT value, SUM(counts) AS tot_count FROM file GROUP BY value
  • 10. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county
  • 11. What does this code do? SELECT census.county, electoral.votes / census.population AS voting_rate FROM electoral, census WHERE electoral.county = census.county <Complicated stuff with dictionaries>
  • 12. Claim 2: SQL is Efficient Scaling up your data • What happens when Python/R data doesn’t fit in memory? Crash, or rewrite much more complicated code • All databases automatically, transparently spill to disk, and are heavily optimized for performance
  • 13. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./highly_optimized_code.py < TB.dataset > GB.result
  • 14. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result
  • 15. Claim 2: SQL is Efficient Say you inherit a really well-engineered Python script ./simple_data_filter.py < GB.result > MB.answer ./highly_optimized_code.py < TB.dataset > GB.result But are only interested in a small fraction of the result 1) Dive into the complex code and modify its internals to filter inside 2) Suffer the long running time of the first program
  • 16. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset Gives their query a name, but doesn’t execute it!
  • 17. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together!
  • 18. Claim 2: SQL is Efficient CREATE VIEW their_query AS SELECT <... their code ...> FROM terabyte_dataset SELECT * FROM their_query WHERE <... your filter ...> Gives their query a name, but doesn’t execute it! Combine both queries and optimize together! Fast!
  • 19. SQL for Science • UW’s SQLShare - open, view-oriented, web database service • Easy data import, public & private sharing, permalinks (DOI support coming) • Use a series of views instead of scripts for: • data cleaning, transformation, integration • simple stats, analytics, format conversion • provenance and publishing • mashups: integrated with R, Sage, etc.
  • 20. escience.washington.edu/sqlshare “An undergraduate student and I are working with gigabytes of tabular data derived from analysis of protein surfaces. Previously, we were using huge directory trees and plain text files. Now we can accomplish a 10 minute 100 line script in 1 line of SQL.” - Andrew D White, grad student in UW Chem Eng “I have had two students who are struggling with R come up and tell me how much more they like working in SQLShare.” - Robin Kodner, as asst professor at Western Washington U "That [SQL query that finished in 1 second] took me a week [manually in Excel]!" - Robin Kodner, as postdoc at UW Oceanography * yes, we need (and are interested in) more than anecdotes!!
  • 21. SQL can do more than you think (here vs R)