Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Keynote I gave at the Scipyconf Argentina 2014 conference
The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.
What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?
I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Succeeding in academia despite doing good_software
1. Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Varoquaux
Ga¨el
Strong is the power
of the dark side
2. Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Varoquaux
Ga¨el
Strong is the power
of the dark side
with Python
3. Hacking academia for fun and profit
Thoughts on succeeding in academia despite doing good software
Varoquaux
Ga¨el
Strong is the power
of the dark side
despite good
software?
11. Growing up as a geek scientist
I did a PhD in
quantum physics
G Varoquaux 5
12. Growing up as a geek scientist
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Best training ever
for agile project
managementG Varoquaux 6
13. Growing up as a geek scientist
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument control
G Varoquaux 7
14. Growing up as a geek scientist
I did a PhD in
quantum physics
Vacuum (leaks)
Electronics (shorts)
Lasers (mis-alignment)
Computers were only one
of the many moving parts
Matlab
Instrument controlShaped my vision
of computing as a
means to an end
G Varoquaux 7
17. Success
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars
How / why did I switch?
Fernando Perez (IPython), Prabhu
Ramachandran (Mayavi), Eric Jones
(Enthought), Travis Oliphant (Numpy)...
G Varoquaux 8
18. Success
2011
Tenured researcher
in computer science
Today
Growing team with
data science
rock stars
How / why did I switch?
Fernando Perez (IPython), Prabhu
Ramachandran (Mayavi), Eric Jones
(Enthought), Travis Oliphant (Numpy)...
Learning fast is more impor-
tant than knowing something
Bye bye physics
G Varoquaux 8
19. And now...
What I do nowadays:
Machine learning to understand brain function
Cognitive neuroscience:
Link neural activity to thoughts and cognition
G Varoquaux 9
20. And now...
What I do nowadays:
Machine learning to understand brain function
Learn a bilateral link
between brain activity and cognitive function
G Varoquaux 9
21. Software is the new medium of scientific
method
Galileo’s notes
G Varoquaux 10
22. The scientific method
1. Make conjectures
2. Derive prediction
3. Carry experiments
4. Confirm or infirm conjectures
G Varoquaux 11
23. The scientific method
1. Make conjectures
2. Derive prediction
3. Carry experiments
4. Confirm or infirm conjectures
Software is everywhere
Data-mining
Computational models
Computer-controled experiments
Data analysis
G Varoquaux 11
24. The scientific method
1. Make conjectures
2. Derive prediction
3. Carry experiments
4. Confirm or infirm conjectures
Code is often the very language in
which predictions are expressed
Models are now more complex than a simple
formula or sentence
G Varoquaux 11
25. Enabling falsification: reproducible science
Replicating
A 3rd party redoing the work
Code and data made available
Reproducing
New analysis on different data / code coming to the
same conclusion
Reusing
Applying the approach to a new problem
Let us enable reusable research
G Varoquaux 12
26. Enabling falsification: reproducible science
Replicating
A 3rd party redoing the work
Code and data made available
Reproducing
New analysis on different data / code coming to the
same conclusion
Reusing
Applying the approach to a new problem
Let us enable reusable research
Arguments for BSD license
No strings attached
Can tinker with it
G Varoquaux 12
27. Reusable science ñ evidence accumulation
Accumulation of scientific knowledge
and learning formal representations
G Varoquaux 13
28. Reusable science ñ evidence accumulation
Accumulation of scientific knowledge
and learning formal representations
Akin to a review paper of the field
But a mathematical model is more testable
Machine learning:
engineering knowledge from data
G Varoquaux 13
29. Reusable science ñ evidence accumulation
Accumulation of scientific knowledge
and learning formal representations
Akin to a review paper of the field
But a mathematical model is more testable
“A theory is a good theory if it satisfies two requirements:
It must accurately describe a large class of observa-
tions on the basis of a model that contains only a few
arbitrary elements, and it must make definite predic-
tions about the results of future observations.”
Stephen Hawking, A Brief History of Time.
G Varoquaux 13
31. The advancement of knowledge
Imagine a circle that contains human knowledge
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
32. The advancement of knowledge
By the time you finish elementary school, you know a little
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
33. The advancement of knowledge
High school takes you a little bit further
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
34. The advancement of knowledge
With a bachelors degree, you gain a speciality
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
35. The advancement of knowledge
A master’s degree deepens this speciality
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
36. The advancement of knowledge
Research papers take you to the edge of human knowledge
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
37. The advancement of knowledge
Once you are at the boundary, you focus
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
38. The advancement of knowledge
You push at the boundary for a few years
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
39. The advancement of knowledge
And one day it yields
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
40. The advancement of knowledge
That dent you’ve made, is called a PhD
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
41. The advancement of knowledge
Of course, the world looks different to you now
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
42. The advancement of knowledge
But don’t forget the big picture
PhD
Courtesy of Matt Might, via Stefan van der Waalt
G Varoquaux 15
43. The advancement of knowledge
This is an optimistic view
Biology
Maths
Computer science
Physics
Economy
Literature
History
G Varoquaux 15
44. The advancement of knowledge
This is an optimistic view
Biology
Maths
Computer science
Physics
Economy
Literature
History
I want to
be there
G Varoquaux 15
46. Translationnal computional science
Computational science
The use of computers and mathematical models to
address scientific research
Translationnal science
In medecine: bring bench science to medical practice
G Varoquaux 16
47. Translationnal computional science
Computational science
The use of computers and mathematical models to
address scientific research
Translationnal science
In medecine: bring bench science to medical practice
Translational
computational science?
G Varoquaux 16
48. Pick a problem to work on
Take the “easy” route
There needs to be a market screeming for the
software (in academia and in industry)
Refine your vision
Pull, not push
Design driven be need
G Varoquaux 17
51. Pick the right battles: viable projects
Project idea
A software implementing:
i) machine learning
and ii) neuroimaging
and iii) a graphical user interface
and iv) 3D plotting
G Varoquaux 19
52. Pick the right battles: viable projects
Project idea
A software implementing:
i) machine learning
and ii) neuroimaging
and iii) a graphical user interface
and iv) 3D plotting
G Varoquaux 19
53. Pick the right battles: viable projects
Project idea
A software implementing:
i) machine learning
and ii) neuroimaging
and iii) a graphical user interface
and iv) 3D plotting
Define project scope and vision
Break down projects by expertise
Don’t solve hard problems
Know the software landscape
Don’t target markets that will not
yield contributors
Need a vision = elevator pitch
G Varoquaux 19
54. Pick the right battles: viable projects
Project idea
A software implementing:
i) machine learning
and ii) neuroimaging
and iii) a graphical user interface
and iv) 3D plotting
Define project scope and vision
Break down projects by expertise
Don’t solve hard problems
Know the software landscape
Don’t target markets that will not
yield contributors
Need a vision = elevator pitch
Your research (PhD) probably does not qualify
ñ need to cherry-pick contributions
G Varoquaux 19
55. Open source and community development
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
“Hey Gael, I take it you’re too
busy. That’s okay, I spent a day
trying to install XXX and I think
I’ll succeed myself. Next time
though please don’t ignore my
emails, I really don’t like it. You
can say, ‘sorry, I have no time to
help you.’ Just don’t ignore.”
G Varoquaux 20
56. Open source and community development
Code maintenance too expensive to be alone
scikit-learn „ 300 email/month nipy „ 45 email/month
joblib „ 45 email/month mayavi „ 30 email/month
Your “benefits” come from a fraction of the code
Data loading? Maybe?
Standard algorithms? Nah
Share the common code...
...to avoid dying under code
Code becomes less precious with time
And somebody might contribute features
G Varoquaux 20
57. Community development in scikit-learn
Huge feature set:
benefits of a large team
Project growth:
More than 200 contributors
„ 12 core contributors
1 full-time INRIA programmer
from the start
Estimated cost of development: $ 6 millions
COCOMO model,
http://www.ohloh.net/p/scikit-learn
G Varoquaux 21
58. Communities: many eyes makes code fast
L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer
G Varoquaux 22
60. What’s in a scientific-computing environment
G Varoquaux 24
61. The scientific workflow agile
Interaction...
Ñ script...
Ñ module...
ý interaction again...
Consolidation,
progressively
Low tech and short
turn-around times
G Varoquaux 25
62. Choose your weapons
Python, what else?
Interactive language
Easy to read / write
General purpose
G Varoquaux 26
63. Choose your weapons
Python, what else?
Interactive language
Easy to read / write
General purpose
Old virtual machine /
compiler
Younger languages
promissing (Julia)
but will they get
adoption beyond science?
G Varoquaux 26
64. Choose your weapons
Python, what else?
+Numpy arrays
Shoe-horn your data in a
numpy array, and you’ve won
personnally disappointed
that pandas drifted away
G Varoquaux 26
65. Software architecture for science
“Scriptability” is paramount
In an application: MVC (model, view, controller)
Model
Numerical or
data-processing
core
View
Ouput: graphs,
or files
Must enable
headless use
Controller
Input: dialogs,
or an API
Avoid input as files:
not expressive
Dialogs should never be far from the code
Dialog generation: traits, IPython widgets
Reactive programming:
dialogs modify object, and the model updates
Don’t own the main
G Varoquaux 27
66. Software architecture for science
“Scriptability” is paramount
In an application: MVC (model, view, controller)
Model
Numerical or
data-processing
core
View
Ouput: graphs,
or files
Must enable
headless use
Controller
Input: dialogs,
or an API
Avoid input as files:
not expressive
Dialogs should never be far from the code
Dialog generation: traits, IPython widgets
Reactive programming:
dialogs modify object, and the model updates
Don’t own the main
In Mayavi: script generation for free
G Varoquaux 27
68. You need quality
Quality will give you users
Bugs give you bad rap
Quality will give you developers
Contribute to learn and improve
Quality will make your developers happy
People need to be proud of their work
Do less, do better
Goes against the grant-system incentive
G Varoquaux 29
69. Quality: what & how
Great documentation
Simplify, but don’t dumb down
Focus on what the user is trying to solve
Great APIs
Example-based development
If something is hard to explain, rethink the concepts
Limit the number of different concepts and objects
Consistency, consistency, consistency
Good numerics
Write tests based on mathematical properties
When a user finds an instability, write a new test
G Varoquaux 30
70. Quality: what & how
Great documentation
Simplify, but don’t dumb down
Focus on what the user is trying to solve
Great APIs
Example-based development
If something is hard to explain, rethink the concepts
Limit the number of different concepts and objects
Consistency, consistency, consistency
Good numerics
Write tests based on mathematical properties
When a user finds an instability, write a new test
Quality enables reuse
Beyond mere reproducibility
G Varoquaux 30
72. Be productive
“If you spend too much time thinking about a
thing, you’ll never get it done.” — Bruce Lee
G Varoquaux 31
73. Limited resources
Limited resources are good
Need success in the short term, not the long term
The startup culture: fail fast
Quickly identify non-viable projects
The simpest solution that works is the best
G Varoquaux 32
74. Short cycles, limited ambitions
Keep coming back to your users
Release early, release often
G Varoquaux 33
75. Simplicity
Complexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,
Woodfield 1979]
25% increase in problem complexity
ñ 100% increase in code complexity
G Varoquaux 34
76. Simplicity
Complexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,
Woodfield 1979]
25% increase in problem complexity
ñ 100% increase in code complexity
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
Avoid feature creep
G Varoquaux 34
77. Simplicity
Complexity increase superlinearly
[An Experiment on Unit Increase in Problem Complexity,
Woodfield 1979]
25% increase in problem complexity
ñ 100% increase in code complexity
The 80/20 rule
80% of the usecases can be solved
with 20% of the lines of code
Avoid feature creep
Use objects sparingly
Don’t use classes for the sake of it
G Varoquaux 34
79. Software engineering good practices
Version control
Use git + github
Unit testing
If it’s not tested, it’s broken or soon will be.
Make a package,
with controlled dependencies and compilation
...
G Varoquaux 36
80. Research ‰ production
Need to adapt software-engineering principles
Ó
Good naming is free
Use functions, not scripts
Version control is very cheap
Tests are more expensive... Considering if goals
stabilizes
Build chains are hard
Go down the chain as your research progress
You can think of shipping a software only if it was
viable to go completely down the chain
G Varoquaux 37
82. Mayavi: 3D visualization in Python
Success factors
Building upon VTK Great power
Component model (UI)
Internals open to the world
ñ from interaction to scripting
Limiting factors
Building upon VTK A lot of complexity
Codebase too complex and object-oriented
(bound to VTK)
Users of GUIs do not turn into developers
Composition is an API killer
G Varoquaux 39
83. Mayavi: 3D visualization in Python
Success factors
Building upon VTK Great power
Component model (UI)
Internals open to the world
ñ from interaction to scripting
Limiting factors
Building upon VTK A lot of complexity
Codebase too complex and object-oriented
(bound to VTK)
Users of GUIs do not turn into developers
Composition is an API killer
G Varoquaux 39
84. joblib: computational workflow patterns
Parallel for loop
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
On-demand dispatch to ease memory consumption
Threading and processes backends
G Varoquaux 40
85. joblib: computational workflow patterns
Parallel for loop
>>> from joblib import Parallel, delayed
>>> Parallel(n jobs=2)(delayed(sqrt)(i**2)
... for i in range(8))
[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0]
Memoize pattern
mem = joblib.Memory(cachedir=’.’)
g = mem.cache(f)
b = g(a) # computes a using f
c = g(a) # retrieves results from store
G Varoquaux 40
86. joblib: computational workflow patterns
Success factors
Simplicity of use
Patterns we really, really need (pull not push)
G Varoquaux 41
87. joblib: computational workflow patterns
Success factors
Simplicity of use
Patterns we really, really need (pull not push)
Limiting factor
Vision of the project unclear
Positioning with regards to landscape unclear
(IPython, where are you headed?)
Tricky code inside
G Varoquaux 41
88. scikit-learn: machine learning in Python
Success factors
Right project vision
Machine learning without learning the machinery
Black box that can be opened
Right trade-off between ”just works” and versatility
(think Apple vs Linux)
We’re not going to solve all the problems for you
I don’t solve hard problems
Feature-engineering, domain-specific cases...
Python is a programming language. Use it.
Cover all the 80% usecases in one package
G Varoquaux 42
89. scikit-learn: machine learning in Python
Success factors
Right project vision
High-level programming
- Optimize algorithmes, not for loops
- Know perfectly Numpy and scipy
All significant data should be in arrays
Avoid memory copies, rely on blas/lapack
- Use Cython, quad not C/C++
G Varoquaux 42
91. scikit-learn: machine learning in Python
Success factors
Right project vision
High-level programming
Good API design
- separate data from operations
- Object API exposes a data-processing language
fit, predict, transform, score, partial fit
- Instantiated without data but with all parameters
G Varoquaux 42
92. scikit-learn: machine learning in Python
Success factors
Right project vision
High-level programming
Good API design
Great community
- Github + code review
G Varoquaux 42
93. scikit-learn: machine learning in Python
Success factors
Right project vision
High-level programming
Good API design
Great community
Great documentation
G Varoquaux 42
94. scikit-learn: machine learning in Python
Success factors
Right project vision
High-level programming
Good API design
Great community
Great documentation
Limiting factors
Tricky numerical code
Our own success ñ huge volume
G Varoquaux 42
95. @GaelVaroquaux
Succeeding in academia despite doing good software
1 Game the system
It’s about convincing a tenure committee
Code must contribute to a scientific problem
96. @GaelVaroquaux
Succeeding in academia despite doing good software
1 Game the system
2 Not all battles can be fought
Make sure that there is a market
Don’t solve hard problems
Problems that matter for science and industry
97. @GaelVaroquaux
Succeeding in academia despite doing good software
1 Game the system
2 Not all battles can be fought
3 Make good software
That actually answers scientists needs
With quality, software engineering
Relying on a communauty
Usability matters
98. @GaelVaroquaux
Succeeding in academia despite doing good software
1 Game the system
2 Not all battles can be fought
3 Make good software
Now go out, and code!