Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman

•Descargar como PPTX, PDF•

1 recomendación•484 vistas

To productionize data science work (and have it taken seriously by software engineers, CTOs, clients, or the open source community), you need to write tests! Except… how can you test code that performs nondeterministic tasks like natural language parsing and modeling? This talk presents an approach to testing probabilistic functions in code, illustrated with concrete examples written for Pytest.

Tecnología

Do Your
Homework!
Writing Tests for Data
Science and Natural
Language Processing
David Waterman
github.com/drwaterman/pydatadctesting

Agenda 1. Why test your code?
2. Our problem: text analysis, Natural
Language Processing
3. Practical Implementation

How Does
Homework
Work?
Why give students homework problems
and the matching solutions?

If you know
where you
are going,
you can tell if
you are
moving in the
right
direction

Many of our daily tasks are like
homework problems

Why Test
Your Code
For others:
 Reduce bugs
 Improve feedback
loops
 Speeds up iteration
 Makes code more
reusable
 Increases
confidence in the
system
For you:
 Earns trust and
confidence in you and
your work
 Earns respect from devs
and engineers (for whom
it’s not optional)
 Allows you to submit to
open source projects
 Probably required for
acceptance into external
code base

Challenges
• Per the long-term agreement,
WABCO will supply its single-piston
air disc brake (ADB) technology,
MAXX, for the manufacturing of
Hyundai’s new medium-duty trucks,
which are expected to start from
August 2019.
• Timken has become the sole
supplier of needle roller bearings to
Volkswagen Transmission.
• Dana Corp. has begun supplying
Ford Motor Co. with its
thermoplastic cylinder-head-cover
modules for the automaker's 3.0-
liter Duratec V-6 engine.
The structure of text is
domain specific.
Regular people don’t
talk like this:

Our
Approach:
“Gold” Tests
 Human reads the text
 Identifies relationship
from text
 Puts relationship into
machine-friendly
format (JSON, YAML)
 Writes a test for the
relationship
 Write and rewrite
code to pass the test

Recommendation:
Pytest as your
framework
➕ More Pythonic
➕ Easy to write fast - less
boilerplate
➕ Can still run unittests,
doctests, and nose
➕ Readable, pretty output
(including HTML reports)
➕ Great documentation &
guides
➖ It’s not a builtin

What to
Test
 Expected output
 Invalid input
 Edge cases
Data ModelsCode
 Data is valid
 Types are correct
 Missing values are
handled correctly
 Format is correct
 Produces
expected results
 Can be used to
benchmark
 Monitor for
model drift

EXAMPLES
Repo: https://github.com/drwaterman/pydatadctesting

Pytest
features
useful for
Data
Science
▪ Fixtures – For when you need something
repeatedly over multiple tests (Loading
test data, making a connection,
preprocessing data)
▪ Skip and Xfail – For when you know what
to test for but the code doesn’t pass yet
▪ Comparing images/plots
– Available in matplotlib
▪ Benchmarking a model

Some Nice
Pytest
Options
Save your pytest
command line
arguments in a
shell script
pytest --html=test-logs/testreport.html --self-
contained-html --cov=my_module --cov-report term-
missing -r aPp test
▪ --html: Where to save the html test report
▪ --self-contained-html: Save everything in one html
file (no external CSS, etc.)
▪ --cov=: What modules to include in the coverage
report
▪ --cov-report term-missing: Terminal report w/
missing line numbers
▪ -r aPp: display test results summary at the end
▪ test: The location in which to run the tests

CONCLUSION
 Pytest is easy!
 Start now
 It will earn you trust and respect
 It is possible to use it even if your code is
stochastic
Time for questions!

Más contenido relacionado

La actualidad más candente

High-Performance Python

Work-Bench

Here in DS team in WIX we want to help to create stunning sites by applying recent achievement of AI research to production. Since Data Science engineering practices are still not fully shaped we found out that it is crucial to bring the best practices from software engineering - give Data Scientist ability to deliver models fast without loss in quality and computation efficiency to stay competitive in this overhyped market. To achieve this we are developing our own infrastructure for creating pipelines and deploying them to production with minimum (to none) engineer involvement. This talk will cover initial motivation, solved technical issues and lessons learned while building such ML delivery system. Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/continuous-delivery-of-ml-pipelines-to-production

Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"

Fwdays

Managing and Versioning Machine Learning Models in Python

Simon Frid

Local Search Optimization for Hyper-Parameter Tuning: Many machine learning algorithms are sensitive to their hyper-parameter settings, lacking good universal rule-of-thumb defaults. In this talk we discuss the use of black-box local search optimization (LSO) for machine learning hyper-parameter tuning. Viewed as a black-box objective function of hyper-parameters, machine learning algorithms create a difficult class of optimization problems. The corresponding objective functions involved tend to be nonsmooth, discontinuous, unpredictably computationally expensive, requiring support for both continuous, categorical, and integer variables. Further evaluations can fail for a variety of reasons such as early exits due to node failure or hitting max time. Additionally, not all hyper-parameter combinations are compatible (creating so called “hidden constraints”). In this context, we apply a parallel hybrid derivative-free optimization algorithm that can make progress despite these difficulties providing significantly improved results over default settings with minimal user interaction. Further, we will address efficient parallel paradigms for different types of machine learning problems, while exploring the importance of validation to avoid overfitting and emphasizing that even for small data problems, the need to perform cross validations can create computationally intense functions that benefit from a distributed/threaded environment.

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...

MLconf

Originally given at MLConf NYC 2017. All large machine learning pipelines have tunable parameters, commonly referred to as hyperparameters. Hyperparameter optimization is the process by which we find the values for these parameters that cause our system to perform the best. SigOpt provides a Bayesian optimization platform that is commonly used for hyperparameter optimization, and I’m going to share some of the common problems we’ve seen when integrating into machine learning pipelines.

Common Problems in Hyperparameter Optimization

SigOpt

Today's state-of-the-art machine learning models are more powerful and easy to use than ever before, however, they require massive amounts of training data. Traditionally, these training datasets require slow and often prohibitively expensive manual labeling by domain experts. Instead, in Snorkel, users write "labeling functions" to heuristically label data; Snorkel then uses modern, theoretically-grounded modeling techniques to clean and integrate the resulting training data, without requiring any manual labeling. In a wide range of applications from medical image monitoring to text information extraction to industrial deployments over web data, Snorkel provides a radically faster and more flexible to build machine learning applications, by letting users programmatically build and manipulate training data rather than label it by hand. Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/creating-and-managing-data-with-snorkel

Braden Hancock "Programmatically creating and managing training data with Sno...

Fwdays

Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.

Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...

MLconf

Online Machine Learning: introduction and examples

Felipe

Julia + R for Data Science

Work-Bench

Building a Machine Learning Platform at Quora: Each month, over 100 million people use Quora to share and grow their knowledge. Machine learning has played a critical role in enabling us to grow to this scale, with applications ranging from understanding content quality to identifying users’ interests and expertise. By investing in a reusable, extensible machine learning platform, our small team of ML engineers has been able to productionize dozens of different models and algorithms that power many features across Quora. In this talk, I’ll discuss the core ideas behind our ML platform, as well as some of the specific systems, tools, and abstractions that have enabled us to scale our approach to machine learning.

Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016

MLconf

How to connect the agile principle "Working software over comprehensive documentation" with eXtreme Programming values of Honest communication and Rapid feedback and practices as TDD, Continuous integration, Whole team and Small releases. We will analyze why blending those ideas and techniques together in the real world, remove any need of upfront documentation and increases quality, communication and confidence.

Death to project documentation with eXtreme Programming

Alex Fernandez

Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

MLconf

Using dataset versioning in data science

Venkata Pingali

10 more lessons learned from building Machine Learning systems

Xavier Amatriain

Machine Learning In Production

Samir Bessalah

Scott Clark, CEO, SigOpt, at The AI Conference 2017

MLconf

Data Science Challenges in Personal Program Analysis

Work-Bench

Machine Intelligence at Google Scale: Tensor Flow and Cloud Machine Learning: The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn’t scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x – 40x with Google’s distributed training infrastructure.

Kaz Sato, Evangelist, Google at MLconf ATL 2016

MLconf

Improving data interoperability in Python and R

Wes McKinney

Implications of GPT-3

Raven Jiang

La actualidad más candente (20)

High-Performance Python

Oleksii Moskalenko "Continuous Delivery of ML Pipelines to Production"

Managing and Versioning Machine Learning Models in Python

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...

Common Problems in Hyperparameter Optimization

Braden Hancock "Programmatically creating and managing training data with Sno...

Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...

Online Machine Learning: introduction and examples

Julia + R for Data Science

Nikhil Garg, Engineering Manager, Quora at MLconf SF 2016

Death to project documentation with eXtreme Programming

Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15

Using dataset versioning in data science

10 more lessons learned from building Machine Learning systems

Machine Learning In Production

Scott Clark, CEO, SigOpt, at The AI Conference 2017

Data Science Challenges in Personal Program Analysis

Kaz Sato, Evangelist, Google at MLconf ATL 2016

Improving data interoperability in Python and R

Implications of GPT-3

Similar a Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman

Web performance

Major Ye

Measuring Your Code

Nate Abele

Illustrated Code: Building Software in a Literate Way Andreas Zeller, CISPA Helmholtz Center for Information Security Notebooks – rich, interactive documents that join together code, documentation, and outputs – are all the rage with data scientists. But can they be used for actual software development? In this talk, I share experiences from authoring two interactive textbooks – fuzzingbook.org and debuggingbook.org – and show how notebooks not only serve for exploring and explaining code and data, but also how they can be used as software modules, integrating self-checking documentation, tests, and tutorials all in one place. The resulting software focuses on the essential, is well-documented, highly maintainable, easily extensible, and has a much higher shelf life than the "duct tape and wire” prototypes frequently found in research and beyond.

Illustrated Code (ASE 2021)

CISPA Helmholtz Center for Information Security

Trends in Agile Testing by Lisa Crispin

Directi Group

Bodo Value Guide.pdf

GregHanchin1

Automated tests

Damian Sromek

Scaling Streaming - Concepts, Research, Goals

kamaelian

Big data, data science, machine learning is coming to a lot of companies. Everyone is used to the creation of ordinary software, but BD/DS/ML requires special care. Managers and developers may get unfamiliar problems and I want to tell you about them and solutions - no money and nerves should be wasted. Everyone has heard of data science, machine learning and big data. Many companies are starting to build up teams and run projects. Everyone knows how to develop, deliver and deploy ordinary software, but data-driven software is a different animal. Scientists, developers and managers may not be familiar with the issues that may come up.

DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...

DevOpsDays Riga

Data science tools of the trade

Fangda Wang

The Professional Programmer

Dave Cross

SPCA2013 - Test-driven Development with SharePoint 2013 and Visual Studio

NCCOMMS

When develpment met test(shift left testing)

SangIn Choung

What would Jesus Developer do?

Lukáš Čech

10 Ways To Improve Your Code

ConSanFrancisco123

Whats In Your QA Tool Belt?

Walter Mamed

2014 toronto-torbug

c.titus.brown

DMYTRO SOBKO, Lead automation QA engineer @EPAM. We are well aware of how to test the REST API with N endpoints, with relational and non-relational (NonSQL) databases. Same thing with UI testing. Frameworks like Selenium, Selenide, Selenoid are not a mystery to anyone. Moreover, creating a reliable, extensible and really cool automated test framework for such applications from scratch is not difficult. But what about BigData projects that have no back-end or front-end in the classical sense? How can we test them? What parts should we cover with tests in the first place? And, besides, how do we introduce automation and make it an effective way for such projects? Dmytro will show you how to create a test framework for Cloud Big Data projects from scratch and to develop it in the most optimal way using the most interesting technologies.

Testing Big Data solutions fast and furiously

Katherine Golovinova

Understanding TDD - theory, practice, techniques and tips.

Malinda Kapuruge

Testing 101

Noam Barkai

Test Driven Development

ZendCon

Similar a Do Your Homework! Writing tests for Data Science and Stochastic Code - David Waterman (20)

Web performance

Measuring Your Code

Illustrated Code (ASE 2021)

Trends in Agile Testing by Lisa Crispin

Bodo Value Guide.pdf

Automated tests

Scaling Streaming - Concepts, Research, Goals

DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...

Data science tools of the trade

The Professional Programmer

SPCA2013 - Test-driven Development with SharePoint 2013 and Visual Studio

When develpment met test(shift left testing)

What would Jesus Developer do?

10 Ways To Improve Your Code

Whats In Your QA Tool Belt?

2014 toronto-torbug

Testing Big Data solutions fast and furiously

Understanding TDD - theory, practice, techniques and tips.

Testing 101

Test Driven Development

Más de PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...

PyData

Unit testing data with marbles - Jane Stewart Adams, Leif Walsh

PyData

TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

PyData

Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma

PyData

Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...

PyData

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

PyData

Words in Space - Rebecca Bilbro

PyData

Pydata beautiful soup - Monica Puerto

PyData

1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...

PyData

Extending Pandas with Custom Types - Will Ayd

PyData

Measuring Model Fairness - Stephen Hoover

PyData

What's the Science in Data Science? - Skipper Seabold

PyData

Applying Statistical Modeling and Machine Learning to Perform Time-Series For...

PyData

Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward

PyData

The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

PyData

Deprecating the state machine: building conversational AI with the Rasa stack...

PyData

Towards automating machine learning: benchmarking tools for hyperparameter tu...

PyData

Using GANs to improve generalization in a semi-supervised setting - trying it...

PyData

LightFields.jl: Fast 3D image reconstruction for VR applications - Hector And...

PyData

Extracting relevant Metrics with Spectral Clustering - Evelyn Trautmann

PyData

Más de PyData (20)