SlideShare una empresa de Scribd logo
1 de 30
Bill Howe, PhD
Director of
Research, Scalable Data
Analytics
University of Washington
eScience Institute
Big Data Curricula at the
University of Washington
eScience Institute
8/7/2013 Bill Howe, UW 1
2
“It’s a great time to be a data geek.”
-- Roger Barga, Microsoft Research
“The greatest minds of my generation are trying
to figure out how to make people click on ads”
-- Jeff Hammerbacher, co-founder, Cloudera
1. Theory (last 2000 yrs)
2. Experiment (last 200
yrs)
3. Simulation (last 50 yrs)
4. Data-Driven Discovery
(last 5 yrs)
The University of Washington
eScience Institute
• Rationale
– The exponential increase in sensors is transitioning all fields of science
and engineering from data-poor to data-rich
– As a result, the techniques and technologies of data science must be
widely practiced and widely adopted
• Mission
– Advance the forefront of research both in modern data science
techniques and technologies, and in the fields that depend upon them
• Strategy
– Provide an umbrella organization for Big Data activities at UW and
beyond (new curricula, collaborations, funding sources, hiring practices)
– Bootstrap a national network of partners and peer institutes
– Attract, develop, and retain “Pi-shaped people”
8/7/2013 Bill Howe, UW 4
π-shaped researchers
Broad in many areas; deep in at least two
UW Data Science Education Efforts
8/7/2013 Bill Howe, UW 6
Students Non-Students
CS/Informatics Non-Major
professionals researchers
undergrads grads undergrads grads
UWEO Data Science Certificate
Graduate Certificate in Big Data
CS Data Management Courses
eScience workshops
Intro to data programming
eScience Masters (planned)
MOOC: Intro to Data Science
Incubator: On-the-job-training
Previous courses:
Scientific Data Management, Graduate CS, Summer 2006, Portland State University
Scientific Data Management, Graduate CS, Spring 2010, University of Washington
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 7
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 8
8/7/2013 Bill Howe, UW 9
• 8600 completed all programming assignments
• 7000 earned a certificate
Syllabus
• Data Science Landscape (~1 week)
• Data Manipulation at Scale
– Relational Databases (~1 week)
– MapReduce (~1 week)
– NoSQL (~1 week)
• Analytics
– Statistics Pearls (~1 week)
– Machine Learning Pearls (~1 week)
• Visualization (~1 week)
8/7/2013 Bill Howe, UW 12
8/7/2013 Bill Howe, UW 13
tools abstr.
desk cloud
structs stats
hackers analysts
This Course
8/7/2013 Bill Howe, UW 14
What are the abstractions of
data science?
tools abstr.
“Data Jujitsu”
“Data Wrangling”
“Data Munging”
Translation: “We have no idea what
this is all about”
8/7/2013 Bill Howe, UW 15
matrices and linear algebra?
relations and relational algebra?
objects and methods?
files and scripts?
data frames and functions?
What are the abstractions of
data science?
tools abstr.
16
Data Access Hitting a Wall
Current practice based on data download (FTP/GREP)
Will not scale to the datasets of tomorrow
• You can GREP 1 MB in a second
• You can GREP 1 GB in a minute
• You can GREP 1 TB in 2 days
• You can GREP 1 PB in 3 years.
• Oh!, and 1PB ~5,000 disks
• At some point you need
indices to limit search
parallel data search and analysis
• This is where databases can help
• You can FTP 1 MB in 1 sec
• You can FTP 1 GB / min (~1$)
• … 2 days and 1K$
• … 3 years and 1M$
desk cloud
[slide src: Jim Gray]
US faces shortage of 140,000 to 190,000
people “with deep analytical skills, as well
as 1.5 million managers and analysts with
the know-how to use the analysis of big
data to make effective decisions.”
8/7/2013 Bill Howe, UW 17
--Mckinsey Global Institute
hackers analysts
Three types of tasks:
8/7/2013 Bill Howe, UW 18
1) Preparing to run a model
2) Running the model
3) Interpreting the results
Gathering, cleaning, integrating, restructuring,
transforming, loading, filtering, deleting, combining,
merging, verifying, extracting, shaping, massaging
“80% of the work”
-- Aaron Kimball
“The other 80% of the work”
-- Aaron Kimball
structs stats
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science Projects
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 19
New Phd Track: “Big Data U”
• Open to all departments
• New courses to “level the playing field”
– “Molecular Biology for Computer Scientists” offered this Fall
• Dual advising in two disciplines
• Joint projects leading to multiple theses
– Each methods thesis will include domain impact component
– Each domain thesis will include methods impact component
• Contribution to a shared cyberinfrastructure
– Software engineering experience as a side effect
• “Application Assistantships”
– Like RAs and TAs; focused on solving a concrete problem
8/7/2013 Bill Howe, UW 20
Magda
Balazinska
Carlos
Guestrin
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 21
Data Science Incubator: Motivation
• We need the right people
– We produce “builders,” but 99% of them go to industry to
“make people click on ads”
– They aren’t motivated by writing papers
– No viable career path in the academy
• We need the right processes
– Hands-on, extended, intensive experience is required to
produce π-shaped people
– Data-driven discovery requires intensive collaboration
8/7/2013 Bill Howe, UW 22
Science Domains
Stats, Computer
Science, Applied Math
• “Where’s the funding?”
• “How does this help me write a paper in my field”?
• Thin collaborations; nobody to work on the short-
term, high-risk, high-impact “triage” projects
• “Does method X work on dataset Y?”
Domain Labs
Research Programmers
• Expensive; doesn’t scale
• “Code Monkey” – No viable career path
• Can’t attract top people
• No sharing, no community, no cross-pollination
Data Science Incubator: Structure
• Recruit top-flight data science talent
• Give them autonomy to select collaborations and projects
• Promote them according to “altmetrics” and project impact
– “Data Scientist”  “Senior Data Scientist”  “Technical Fellow”
– “Data Science Fellows”
• Perhaps non-tenure, but 3-5 year commitments
• Funded with contributions from Academic units, IT,
Libraries, and soft money
8/7/2013 Bill Howe, UW 25
Data Science Incubator: Seed Grants
• Domain researchers submit Seed Grant applications
for short, intensive 1-6 month projects
– Reviewed by the Data Scientists themselves
• Awardees send 1+ students, postdocs, staff, or faculty
to come and physically sit in the incubator space X
days per week for the project duration
– Application may or may not include funding for the student
8/7/2013 Bill Howe, UW 26
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Domain Labs
Incubator
• Data Scientists have their own identity and prestige
• Cross-pollination between disciplines
• Awardees leave with skills and knowledge; become “disciples”
Three Activities
• Massively Open Online Course
• New Phd Tracks in Big Data
• An Incubator for Data Science
• Other actitivites I won’t discuss
– Undergraduate “Data Wizardry” Courses
– 2-day Bootcamps in Python, SQL, GitHub, …
– Certificate Programs in Data Science
– Hackathons
8/7/2013 Bill Howe, UW 29
MOOC “Introduction to Data Science:”
https://www.coursera.org/course/datasci
Certificate program:
http://www.pce.uw.edu/courses/data-science-intro
8/7/2013 Bill Howe, UW 30
http://escience.washington.edu
billhowe@cs.washington.edu

Más contenido relacionado

La actualidad más candente

Making Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbMaking Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbPhilip Bourne
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Projectmwe400
 
Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Micah Altman
 
Information is beautiful
Information is beautifulInformation is beautiful
Information is beautifulMargaret Lawson
 
Towards a Platform for Global Health
Towards a Platform for Global HealthTowards a Platform for Global Health
Towards a Platform for Global HealthPhilip Bourne
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...African Open Science Platform
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentPhilip Bourne
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisPhilip Bourne
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam Universitymwe400
 
Health Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataHealth Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataPhilip Bourne
 
BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020Philip Bourne
 
Bw dave pattern lidp
Bw dave pattern lidpBw dave pattern lidp
Bw dave pattern lidpgregynog
 
Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Carly Strasser
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 

La actualidad más candente (19)

25
2525
25
 
Making Biomedical Research More Like Airbnb
Making Biomedical Research More Like AirbnbMaking Biomedical Research More Like Airbnb
Making Biomedical Research More Like Airbnb
 
20
2020
20
 
Wire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub ProjectWire Workshop: Overview slides for ArchiveHub Project
Wire Workshop: Overview slides for ArchiveHub Project
 
Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...Complicating the Question of Access (and Value) with University Press Publica...
Complicating the Question of Access (and Value) with University Press Publica...
 
2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review2015 Kno.e.sis Center Annual Review
2015 Kno.e.sis Center Annual Review
 
Information is beautiful
Information is beautifulInformation is beautiful
Information is beautiful
 
Towards a Platform for Global Health
Towards a Platform for Global HealthTowards a Platform for Global Health
Towards a Platform for Global Health
 
The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...The role of libraries and information professionals during the Big Data Era/ ...
The role of libraries and information professionals during the Big Data Era/ ...
 
The NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training EnvironmentThe NIH Commons: A Cloud-based Training Environment
The NIH Commons: A Cloud-based Training Environment
 
Moving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT AnalysisMoving Forward with Open Data Science - SWOT Analysis
Moving Forward with Open Data Science - SWOT Analysis
 
Internet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam UniversityInternet Archives and Social Science Research - Yeungnam University
Internet Archives and Social Science Research - Yeungnam University
 
Health Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big DataHealth Policy and Management as it Relates to Big Data
Health Policy and Management as it Relates to Big Data
 
BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020BD2K @ NIH - A Vision Through 2020
BD2K @ NIH - A Vision Through 2020
 
Bw dave pattern lidp
Bw dave pattern lidpBw dave pattern lidp
Bw dave pattern lidp
 
Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?Cal Poly - Data Management: Who knew it was a hot topic?
Cal Poly - Data Management: Who knew it was a hot topic?
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
The African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan VeldsmanThe African Open Science Platform/Susan Veldsman
The African Open Science Platform/Susan Veldsman
 
Today's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's CitizensToday's Data Grow Tomorrow's Citizens
Today's Data Grow Tomorrow's Citizens
 

Similar a Big Data Curricula at the UW eScience Institute, JSM 2013

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...datacite
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsNicole Vasilevsky
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesDaniel S. Katz
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaUniversity of Washington
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Keith Webster
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data JournalMarieke Guy
 
Yafei (debbie) Liang resume
Yafei (debbie) Liang resume  Yafei (debbie) Liang resume
Yafei (debbie) Liang resume YafeiDebbieLiang
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...SEAD
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?James Howison
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data ScienceFeyzi R. Bagirov
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?Daniel S. Katz
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data ThingsKatina Toufexis
 
2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distrddm314
 

Similar a Big Data Curricula at the UW eScience Institute, JSM 2013 (20)

2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Teaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate StudentsTeaching Data Science to Undergraduate Students
Teaching Data Science to Undergraduate Students
 
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
NISO Virtual Conference Scientific Data Management: Caring for Your Instituti...
 
00-01 DSnDA.pdf
00-01 DSnDA.pdf00-01 DSnDA.pdf
00-01 DSnDA.pdf
 
Scientific Software Challenges and Community Responses
Scientific Software Challenges and Community ResponsesScientific Software Challenges and Community Responses
Scientific Software Challenges and Community Responses
 
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
Sept 18 NISO Webinar: Research Data Curation, Part 2: Libraries and Big Data ...
 
XLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and MyriaXLDB South America Keynote: eScience Institute and Myria
XLDB South America Keynote: eScience Institute and Myria
 
Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...Immersive informatics - research data management at Pitt iSchool and Carnegie...
Immersive informatics - research data management at Pitt iSchool and Carnegie...
 
The Rise of the Data Journal
The Rise of the Data JournalThe Rise of the Data Journal
The Rise of the Data Journal
 
Yafei liang resume
Yafei liang resumeYafei liang resume
Yafei liang resume
 
Yafei (debbie) Liang resume
Yafei (debbie) Liang resume  Yafei (debbie) Liang resume
Yafei (debbie) Liang resume
 
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
Changing the Curation Equation: A Data Lifecycle Approach to Lowering Costs a...
 
Big Data
Big Data Big Data
Big Data
 
Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?Scientific Software - what happens after the grant?
Scientific Software - what happens after the grant?
 
Introduction to Big Data and Data Science
Introduction to Big Data and Data ScienceIntroduction to Big Data and Data Science
Introduction to Big Data and Data Science
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
20160414 23 Research Data Things
20160414 23 Research Data Things20160414 23 Research Data Things
20160414 23 Research Data Things
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
Yafei liang resume
Yafei liang resume Yafei liang resume
Yafei liang resume
 
2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr2017-09-08 skunkworks q&a information session v1.0 distr
2017-09-08 skunkworks q&a information session v1.0 distr
 

Más de University of Washington

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)University of Washington
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceUniversity of Washington
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureUniversity of Washington
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsUniversity of Washington
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsUniversity of Washington
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceUniversity of Washington
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionUniversity of Washington
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DUniversity of Washington
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe University of Washington
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsUniversity of Washington
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareUniversity of Washington
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchUniversity of Washington
 

Más de University of Washington (20)

Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)Database Agnostic Workload Management (CIDR 2019)
Database Agnostic Workload Management (CIDR 2019)
 
Data Responsibly: The next decade of data science
Data Responsibly: The next decade of data scienceData Responsibly: The next decade of data science
Data Responsibly: The next decade of data science
 
Thoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State LegislatureThoughts on Big Data and more for the WA State Legislature
Thoughts on Big Data and more for the WA State Legislature
 
The Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore EnvironmentsThe Other HPC: High Productivity Computing in Polystore Environments
The Other HPC: High Productivity Computing in Polystore Environments
 
Big Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD ModelsBig Data + Big Sim: Query Processing over Unstructured CFD Models
Big Data + Big Sim: Query Processing over Unstructured CFD Models
 
Data, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data ScienceData, Responsibly: The Next Decade of Data Science
Data, Responsibly: The Next Decade of Data Science
 
Democratizing Data Science in the Cloud
Democratizing Data Science in the CloudDemocratizing Data Science in the Cloud
Democratizing Data Science in the Cloud
 
Science Data, Responsibly
Science Data, ResponsiblyScience Data, Responsibly
Science Data, Responsibly
 
Data Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data InteractionData Science, Data Curation, and Human-Data Interaction
Data Science, Data Curation, and Human-Data Interaction
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Intro to Data Science Concepts
Intro to Data Science ConceptsIntro to Data Science Concepts
Intro to Data Science Concepts
 
Big Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&DBig Data Talent in Academic and Industry R&D
Big Data Talent in Academic and Industry R&D
 
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
Big Data Middleware: CIDR 2015 Gong Show Talk, David Maier, Bill Howe
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Myria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) ScientistsMyria: Analytics-as-a-Service for (Data) Scientists
Myria: Analytics-as-a-Service for (Data) Scientists
 
eResearch New Zealand Keynote
eResearch New Zealand KeynoteeResearch New Zealand Keynote
eResearch New Zealand Keynote
 
Data science curricula at UW
Data science curricula at UWData science curricula at UW
Data science curricula at UW
 
Enabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShareEnabling Collaborative Research Data Management with SQLShare
Enabling Collaborative Research Data Management with SQLShare
 
Virtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible ResearchVirtual Appliances, Cloud Computing, and Reproducible Research
Virtual Appliances, Cloud Computing, and Reproducible Research
 
End-to-End eScience
End-to-End eScienceEnd-to-End eScience
End-to-End eScience
 

Último

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxShobhayan Kirtania
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...anjaliyadav012327
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 

Último (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
The byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptxThe byproduct of sericulture in different industries.pptx
The byproduct of sericulture in different industries.pptx
 
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
JAPAN: ORGANISATION OF PMDA, PHARMACEUTICAL LAWS & REGULATIONS, TYPES OF REGI...
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 

Big Data Curricula at the UW eScience Institute, JSM 2013

  • 1. Bill Howe, PhD Director of Research, Scalable Data Analytics University of Washington eScience Institute Big Data Curricula at the University of Washington eScience Institute 8/7/2013 Bill Howe, UW 1
  • 2. 2 “It’s a great time to be a data geek.” -- Roger Barga, Microsoft Research “The greatest minds of my generation are trying to figure out how to make people click on ads” -- Jeff Hammerbacher, co-founder, Cloudera
  • 3. 1. Theory (last 2000 yrs) 2. Experiment (last 200 yrs) 3. Simulation (last 50 yrs) 4. Data-Driven Discovery (last 5 yrs)
  • 4. The University of Washington eScience Institute • Rationale – The exponential increase in sensors is transitioning all fields of science and engineering from data-poor to data-rich – As a result, the techniques and technologies of data science must be widely practiced and widely adopted • Mission – Advance the forefront of research both in modern data science techniques and technologies, and in the fields that depend upon them • Strategy – Provide an umbrella organization for Big Data activities at UW and beyond (new curricula, collaborations, funding sources, hiring practices) – Bootstrap a national network of partners and peer institutes – Attract, develop, and retain “Pi-shaped people” 8/7/2013 Bill Howe, UW 4
  • 5. π-shaped researchers Broad in many areas; deep in at least two
  • 6. UW Data Science Education Efforts 8/7/2013 Bill Howe, UW 6 Students Non-Students CS/Informatics Non-Major professionals researchers undergrads grads undergrads grads UWEO Data Science Certificate Graduate Certificate in Big Data CS Data Management Courses eScience workshops Intro to data programming eScience Masters (planned) MOOC: Intro to Data Science Incubator: On-the-job-training Previous courses: Scientific Data Management, Graduate CS, Summer 2006, Portland State University Scientific Data Management, Graduate CS, Spring 2010, University of Washington
  • 7. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 7
  • 8. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 8
  • 10. • 8600 completed all programming assignments • 7000 earned a certificate
  • 11.
  • 12. Syllabus • Data Science Landscape (~1 week) • Data Manipulation at Scale – Relational Databases (~1 week) – MapReduce (~1 week) – NoSQL (~1 week) • Analytics – Statistics Pearls (~1 week) – Machine Learning Pearls (~1 week) • Visualization (~1 week) 8/7/2013 Bill Howe, UW 12
  • 13. 8/7/2013 Bill Howe, UW 13 tools abstr. desk cloud structs stats hackers analysts This Course
  • 14. 8/7/2013 Bill Howe, UW 14 What are the abstractions of data science? tools abstr. “Data Jujitsu” “Data Wrangling” “Data Munging” Translation: “We have no idea what this is all about”
  • 15. 8/7/2013 Bill Howe, UW 15 matrices and linear algebra? relations and relational algebra? objects and methods? files and scripts? data frames and functions? What are the abstractions of data science? tools abstr.
  • 16. 16 Data Access Hitting a Wall Current practice based on data download (FTP/GREP) Will not scale to the datasets of tomorrow • You can GREP 1 MB in a second • You can GREP 1 GB in a minute • You can GREP 1 TB in 2 days • You can GREP 1 PB in 3 years. • Oh!, and 1PB ~5,000 disks • At some point you need indices to limit search parallel data search and analysis • This is where databases can help • You can FTP 1 MB in 1 sec • You can FTP 1 GB / min (~1$) • … 2 days and 1K$ • … 3 years and 1M$ desk cloud [slide src: Jim Gray]
  • 17. US faces shortage of 140,000 to 190,000 people “with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.” 8/7/2013 Bill Howe, UW 17 --Mckinsey Global Institute hackers analysts
  • 18. Three types of tasks: 8/7/2013 Bill Howe, UW 18 1) Preparing to run a model 2) Running the model 3) Interpreting the results Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging “80% of the work” -- Aaron Kimball “The other 80% of the work” -- Aaron Kimball structs stats
  • 19. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science Projects • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 19
  • 20. New Phd Track: “Big Data U” • Open to all departments • New courses to “level the playing field” – “Molecular Biology for Computer Scientists” offered this Fall • Dual advising in two disciplines • Joint projects leading to multiple theses – Each methods thesis will include domain impact component – Each domain thesis will include methods impact component • Contribution to a shared cyberinfrastructure – Software engineering experience as a side effect • “Application Assistantships” – Like RAs and TAs; focused on solving a concrete problem 8/7/2013 Bill Howe, UW 20 Magda Balazinska Carlos Guestrin
  • 21. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 21
  • 22. Data Science Incubator: Motivation • We need the right people – We produce “builders,” but 99% of them go to industry to “make people click on ads” – They aren’t motivated by writing papers – No viable career path in the academy • We need the right processes – Hands-on, extended, intensive experience is required to produce π-shaped people – Data-driven discovery requires intensive collaboration 8/7/2013 Bill Howe, UW 22
  • 23. Science Domains Stats, Computer Science, Applied Math • “Where’s the funding?” • “How does this help me write a paper in my field”? • Thin collaborations; nobody to work on the short- term, high-risk, high-impact “triage” projects • “Does method X work on dataset Y?”
  • 24. Domain Labs Research Programmers • Expensive; doesn’t scale • “Code Monkey” – No viable career path • Can’t attract top people • No sharing, no community, no cross-pollination
  • 25. Data Science Incubator: Structure • Recruit top-flight data science talent • Give them autonomy to select collaborations and projects • Promote them according to “altmetrics” and project impact – “Data Scientist”  “Senior Data Scientist”  “Technical Fellow” – “Data Science Fellows” • Perhaps non-tenure, but 3-5 year commitments • Funded with contributions from Academic units, IT, Libraries, and soft money 8/7/2013 Bill Howe, UW 25
  • 26. Data Science Incubator: Seed Grants • Domain researchers submit Seed Grant applications for short, intensive 1-6 month projects – Reviewed by the Data Scientists themselves • Awardees send 1+ students, postdocs, staff, or faculty to come and physically sit in the incubator space X days per week for the project duration – Application may or may not include funding for the student 8/7/2013 Bill Howe, UW 26
  • 27. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 28. Domain Labs Incubator • Data Scientists have their own identity and prestige • Cross-pollination between disciplines • Awardees leave with skills and knowledge; become “disciples”
  • 29. Three Activities • Massively Open Online Course • New Phd Tracks in Big Data • An Incubator for Data Science • Other actitivites I won’t discuss – Undergraduate “Data Wizardry” Courses – 2-day Bootcamps in Python, SQL, GitHub, … – Certificate Programs in Data Science – Hackathons 8/7/2013 Bill Howe, UW 29
  • 30. MOOC “Introduction to Data Science:” https://www.coursera.org/course/datasci Certificate program: http://www.pce.uw.edu/courses/data-science-intro 8/7/2013 Bill Howe, UW 30 http://escience.washington.edu billhowe@cs.washington.edu

Notas del editor

  1. Observe the world vs. Observe the dataInstruments vs. Algorithms
  2. So in part as an attempt to relate “eSciene” and “data science,” and in part to make sure the idea of data science wasn’t completely taken over by the machine learning people, we ran a massively open online course last Spring called Introduction to Data ScienceWe taught Scalable Databases, MapReduce, Statistics, Machine Learning, Visualization
  3. “Data Jujitsu”“Data Wrangling”“Data Munging”
  4. Our collaborators tell us that loading data into memory with R is the major bottleneck.It actually changes the science they can do:I would say that we can start answering questions about macro-ecology (study of relationships between organisms and their environment at large spatial scales).