Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Coding for science and innovation

Slides for my keynote at Scipy 2017

https://youtu.be/eVDDL6tgsv8

Computing has been driving forward a revolution in how science and technology can solve new problems. Python has grown to be a central player in this game, from computational physics to data science. I would like to explore some lessons learned doing science with Python as well as doing Python libraries for science. What are the ingredients that the scientists need? What technical and project-management choices drove the success of projects I've been involved with? How do these demands and offers shape our ecosystem?

In this talk, I'd like to share a few thoughts on how we code for science and innovation, with the modest goal of changing the world.

  • Sé el primero en comentar

Coding for science and innovation

  1. 1. Coding for science and innovation Ga¨el Varoquaux to change the world!
  2. 2. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science G Varoquaux 2
  3. 3. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Nuclear physics Fluid dynamics Chemistry G Varoquaux 2
  4. 4. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology G Varoquaux 2
  5. 5. Science The process of discovering knowledge and mechanisms Computing is a central part of how we do science Science + Computers = Computational science Psychology Marketting Data science: using data to acquire insights G Varoquaux 2
  6. 6. Science The process of discovering knowledge and mechanisms “Science is not a political construct or a belief sys- tem. Scientific progress depends on openness, trans- parency, and the free flow of ideas and people.” — Dr. Rush Holt, CEO of AAAS, testimony to the House Committee on Science, Space, and Tech- nology, Feb 8, 2017 G Varoquaux 3
  7. 7. Science The process of discovering knowledge and mechanisms Science helps shaping society Growth in a time of debt [Reinhart & Rogoff 2010]: Wrong conclusions due to flawed Excel processing ⇒ Public debt blamed for financial crisis (Osborne UK MP) Autism and vaccines: forged study: [Wakefield et al, Lancet 1998] ⇒ Drop in vaccination, measles outbreak Loss of trust in science is very costly G Varoquaux 3
  8. 8. Innovation Putting the right technology to the right use G Varoquaux 4
  9. 9. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company G Varoquaux 4
  10. 10. Innovation Putting the right technology to the right use Light blub: Invented ∼ 1835 by Lindsay Extra progress: vaccum pumps (Swan ∼ 1880) Economics: availability of electric power ⇒ Edison’s company Outbox: company digitizing physical mail But citizens aren’t the USPS customers, junk mailers are ⇒ No cooperation from USPS, Outbox dies Power balances drive innovation as much as technology G Varoquaux 4
  11. 11. Coding for science and innovation: Computing is the new electricity: a driver for change With new data sources, it reaches beyond physics & engineering G Varoquaux 5
  12. 12. Coding for science and innovation: 1 Coding as a scientist 2 Building software for science 3 An ecosystem G Varoquaux 6
  13. 13. 1 Coding as a scientist G Varoquaux 7
  14. 14. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions Psychology G Varoquaux 8
  15. 15. 1 Data in brain sciences The mental world cognition, emotions autism, depression Historically studied via verbal interactions The brain an organ: neurons, firing Imaging brain activity Quantitative data G Varoquaux 8
  16. 16. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] Comparing the brain activity of many subjects Supervised machine learning to discriminate Autism G Varoquaux 9
  17. 17. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks Unsupervised feature learning complex model fit to 1Tb data G Varoquaux 9
  18. 18. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections Information geometry, Lie algebra... G Varoquaux 9
  19. 19. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn G Varoquaux 9
  20. 20. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition G Varoquaux 9
  21. 21. 1 One example of our work: biomarkers of Autism [Abraham...Varoquaux, 2017] 1. Extract brain networks 2. Per-subject connections 3. Supervised learning Scikit-learn Limits to impact: Cannot outperform clinicians that define Autism/Control Psychiatrists unhappy with current blurry definition But not ready to accept black-box algorithmic definition Lots of moving parts Practitionners need to make the tools theirs G Varoquaux 9
  22. 22. 1 A quest for trust: reproducible research “if it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do“ — V. Stodden, The scientific method in practice Computational reproducibility: Automate everything Control the environment G Varoquaux 10
  23. 23. 1 Automate everything Just a simple matter of programming G Varoquaux 11
  24. 24. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay G Varoquaux 11
  25. 25. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Mayavi Reflexivity between dialogs and objects Record mode G Varoquaux 11
  26. 26. 1 Automate everything... Some operations work better with a human in the loop Scientific research is an iterative process Tension between needs for interaction and replay Jupyter, and its widgets: Exploring the space between interaction and code G Varoquaux 11
  27. 27. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge G Varoquaux 12
  28. 28. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility G Varoquaux 12
  29. 29. 1 Beyond computational reproducibility Make every computational step reproducible, and good science will emerge Estimating the reproducibility of psychological science [Science 2015] 36% of effects replicate Reasons: Statistical challenges — analysis degrees of freedom Weak insentives — winner’s curse in publication Seldom computational reproducibility I think that reproducibility is a misnomer. What matters is that operations be verifiable or reusable. G Varoquaux 12
  30. 30. In practice, the best way to improve research is to use the right (conceptual) tools. G Varoquaux 13
  31. 31. 1 Managing complexity In practice, the best way to improve research is to use the right (conceptual) tools. The everyday roadblock is cognitive load Machine learning, brain anatomy, psychology R, Python, shell scripts Funding agencies, reviewer 3, courting VCs G Varoquaux 14
  32. 32. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload G Varoquaux 15
  33. 33. Coding as a scientist Final code should be auditable, ideally reusable Tension between interactive computing & automating Main enemy: cognitive overload In the industry Reusable Verifiable? Not for silicon valley, but in insurance, healthcare, banking... Moving data-scientist code to production? Software projects going over budget? G Varoquaux 15
  34. 34. Code quality in exploratory work Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... G Varoquaux 16
  35. 35. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering G Varoquaux 16
  36. 36. Code quality in exploratory workIncreasingcost ? Use pyflakes in your editor seriously Coding convention, good naming Version control Use git + github Code review Unit testing If it’s not tested, it’s broken or soon will be Make a package controlled dependencies and compilation ... Avoid premature software engineering Over versus under engineering Goal is generating insights / moving in new spaces Experimentation for intuitions and proofs of concepts ⇒ new ideas As the path becomes clear: consolidation is great for that Heavy engineering too early freezes bad ideas G Varoquaux 16
  37. 37. 2 Building software for science The point of view of the developer Libraries are what enables us to scale: Abstractions reduce cognitive load Code reuse gets us further G Varoquaux 17
  38. 38. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them ni nilearn Make it easy to answer neuroimaging problems with them G Varoquaux 18
  39. 39. 2 Examples of such libraries scikit-learn Make research in machine-learning models and algorithm useable to people who do not understand them Challenges: Variety of that space Statistical concepts coding concepts ni nilearn Make it easy to answer neuroimaging problems with them Challenges: Onboarding technology-adverse users G Varoquaux 18
  40. 40. 2 Tools that reduce cognitive overload It’s a design problem G Varoquaux 19
  41. 41. 2 Tools that reduce cognitive overload Jonathan Ive, an industrial designer, is #4 at Apple Code different. G Varoquaux 20
  42. 42. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 21
  43. 43. 2 Some API design principles for the scipy stack Consistency, consistency, consistency np.save(file, obj) pickle.dump(obj, file) fmin(...maxiter=10) lsq linear(...max iter=10) Creates cognitive overload Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 22
  44. 44. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes Objects have hidden states, Objects have no universal interface, entry point, output A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 23
  45. 45. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts How much do usage patterns carry out across the library? Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 24
  46. 46. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Facilitates working with multiple libraries together Easier to get up to speed with a given library Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 25
  47. 47. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Change of behavior depending on input type Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 26
  48. 48. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Interfaces define objects Incompatible behaviors lead to bugs (eg np.matrix) Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 27
  49. 49. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Properties obfuscate the data model of the object Properties can create hidden compute costs Shallow is better than deep Error messages matter Be Pythonic G Varoquaux 28
  50. 50. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Objects are understood by their surface Composition creates cognitive overload Error messages matter Be Pythonic G Varoquaux 29
  51. 51. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Explain the problem Print the offending value Be Pythonic G Varoquaux 30
  52. 52. 2 Some API design principles for the scipy stack Consistency, consistency, consistency Functions are easier to understand than classes A library should hinge on a small number of concepts Common data containers make the ecosystem stronger Each function should have one and only one purpose Code for interfaces, but don’t overdo duck typing Properties are for impedance matching Shallow is better than deep Error messages matter Be Pythonic Avoid syntax hacks G Varoquaux 31
  53. 53. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) G Varoquaux 32
  54. 54. 2 Scikit-learn API Scikit-learn cheat sheet Scikit-learn Fit and predict >>> estimator = Estimator(param1=param1) >>> estimator.fit(X train, y train) >>> y test = estimator.predict(X test) Transform data >>> X red = estimator.transform(X test) The estimator is a “contract” (slightly more elaborate than above) It has created an ecosystem of packages Based on duck-typing, not inheritence G Varoquaux 32
  55. 55. 2 numpy arrays 03878794797927 01790752701578 94071746124797 54970718717887 0495190 03878794797927 01790752701578 94071746124797 54970718717887 495190 ndarray Abstraction over pointers & operation Contract: the memory layout IMHO, gone too far in number of methods (163) The array protocol makes it easy to quack like an array PS: The ecosystem needs categorical dtypes in numpy G Varoquaux 33
  56. 56. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn G Varoquaux 34
  57. 57. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the scikit-learn website: Examples G Varoquaux 34
  58. 58. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn User flow on the nilearn website: Examples G Varoquaux 34
  59. 59. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery G Varoquaux 34
  60. 60. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Restructured text formatting Capturing outputs Links to function docs +Creates Jupyter notebooks G Varoquaux 34
  61. 61. 2 Example-driven development The 3-liner as the new cool Teaching others Teaching yourself Write examples that solve end problems Iterate on your API until these are simple Mayavi scikit-learn nilearn Sphinx-gallery: compiling scripts in an examples gallery Insert links to examples containing a function G Varoquaux 34
  62. 62. 2 Building great documentation Focus on explaining concepts (hint: write plain English) Less is more: prioritize, avoid redundancy Code examples must be short (link to full tutorial examples) Links everywhere: users will land at the wrong place Teach with the docs Plan for maintenance of docs: Continuous integration Check links Runs examples Doctests G Varoquaux 35
  63. 63. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html G Varoquaux 36
  64. 64. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Resource intensive CI: Data ⇒ Fight for good open data Computation ⇒ Find good algorithms and tradeoffs Forces us to distill the literature (as a review) G Varoquaux 36
  65. 65. 2 Reusable science scikit-learn is the new machine-learning textbook nilearn is the new neuroimaging review article Experiments reproduced at each commit eg: brain reading nilearn.github.io/auto examples/02 decoding/plot miyawaki reconstruction.html Package development consolidates science and moves it outside the lab G Varoquaux 36
  66. 66. 3 An ecosystem A bird’s eye view on scientific packages G Varoquaux 37
  67. 67. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads A small number of packages are used by many 1 f distribution, preferential attachment G Varoquaux 38
  68. 68. 3 Packages of the Python ecosystem 1 10 100 1000 10000 Package rank 104 105 106 107 108 109 NumberofPyPIdownloads numpy#49 scikit-learn #110 joblib #431 nilearn #2877 simplejson #1 six #2setuptools#3 A small number of packages are used by many 1 f distribution, preferential attachment nilearn relies on scikit-learn & joblib that rely on numpy... G Varoquaux 38
  69. 69. 3 Standing on the shoulders of maintainers May 31th: pip broken https://github.com/pypa/ setuptools/pull/1043 Left-pad: How left-padding strings broke the Internet A Javascript package for left padding strings was removed from node’s package manager, breaking all the websites that depended on it. G Varoquaux 39
  70. 70. 3 Dependencies Beyond installation, a challenge is to ensure package versions play way together: correctness of the code Breakage of backward compability yields irreconcilable dependencies G Varoquaux 40
  71. 71. 3 Dependencies and their upgrade It’s a fact: users hate upgrading If it ain’t broken, don’t fix it even if it is, apparently G Varoquaux 41
  72. 72. 3 Declaring undependence? Monolythic packages with no dependencies... But: Scaling is hard Complexity grows as square of codebase size [Woodfield 1979] User support grows with userbase size G Varoquaux 42
  73. 73. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research G Varoquaux 43
  74. 74. 3 Core software is infrastructure Everybody uses it everyday In industry, education, & research It needs maintenance Like roads (or openSSL, to prevent heartbleed) Central infrastructure packages are “boring” They are understaffed and underfunded References: “Roads and Bridge” Ford foundation report Excellent talk by Heather Miller https://www.youtube.com/watch?v=17yy5BwIiTw G Varoquaux 43
  75. 75. @GaelVaroquaux Coding for science and innovation New science High value of bringing new methods to a field ⇒ Enable domain-specialists Rapid interation, but with automation & consolidation Software tools Scientists are limited by cognitive load ⇒ Design of API and documentation in libraries Libraries make science reproducible and reusable An ecosystem Central packages hold the ecosystem together Thanks to: the scipy community

×