1. Correlation Does Not Mean
Causation
Testing insights into DataOps, Big Data Analytics, and AI
Peter Varhol
2. About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• AWS certified
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
3. What You Will Learn
• How AI systems make the determinations they do based on data.
• Why big data is so important in analytics and AI.
• What are we actually learning when we work with AI and analytics
systems.
4. Agenda
• The Evolution of data
• The Role of DataOps
• Logistics of Big Data
• Using data to train machine learning
systems
• Bias in data
• Summary
5. The Evolution of Data
• Thirty years ago
• Hardware was king
• Twenty years ago
• Software ruled the roost
• Ten years ago
• Hardware and software went to the cloud
• Today
• Nothing matters but data
6. How Did This Happen?
• Prices fell with commodity hardware
• Storage became much less expensive
• We developed better software abstractions
• Operating systems became standardized
• Nicholas Carr was wrong – software did matter
• Business decision-makers became comfortable with data
• “Gut feel” is no longer an acceptable basis for decision-making
7. How Did This Happen?
• Storage is cheap
• We can easily store and retrieve terabytes of data
• Processing power is fast
• It doesn’t take long to operate on large datasets
• Data can produce information
• Decision-making became more refined
8. What Does This Mean?
• The business is now using data as an integral part of decision-making
• That data is often in real time
• Data is also critical to machine learning applications
• IT has to keep data up to date and clean
• Old data is worse than useless
• We need a data pipeline similar to DevOps
• Data Information seamlessly
9. What is DataOps?
• Data collection is a natural part of business operations
• No out of cycle effort required
• Data collection, storage, workflow, integration, and analytics
deployment in a consistent, repeatable process
• Plus data about your data
10. DataOps Versus DevOps
• Data can be designed to follow flow principles similar to DevOps
• Process
• Automation
• Data production and workflow is important to effective data
consumption
• Cross-functional teams are essential in both
• Developers, testers
• DBAs, report writers
11. Why Would We Want To?
• Many teams don’t know how to handle big data
• Defining a practice provides guidance
• We need information in real time
• We can’t wait for the next monthly report
• It helps companies better understand their data
• Data is now front and center
12. Principles of DataOps
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront
design
• Cross-functional ownership of operations over siloed responsibilities
• https://www.dataopsmanifesto.org/
13. Why We Need DataOps
• Data is a valuable commodity
• It’s not simply a biproduct of our work
• We need to reap intelligence from data
• In real time
• With a standard process
• We must get it into the hands of those who need it
• For analysis
• For decision-making
14. Why We Need DataOps
• Auditability – Versioning every output and input, from source data to
data science experiments to trained model, means that you can show
exactly how the model was created and where it was implemented.
• Reliability – Deploy quickly but with increased consistency and quality
• Repeatability – Automating ensures a repeatable process
• Productivity – Providing a self-service environment with access to
curated data sets
15. We’re All In This Together
• Data is a team sport
• DBA
• Report writer
• Data scientist/analyst
• Ops person
• Tester
• And more
16. The Logistics of Big Data
• We get data from a variety of sources
• Our own databases
• Measurement of processes
• Natural and social science
• A single source is no longer enough
• We tie together sales, weather reports, more
• This can’t be done manually
17. Big Data and Machine Learning
• Our intelligent systems learn through data
• The more data, the better (usually)
• Algorithms manipulate the data to draw a conclusion
• It can seem like intelligence because that’s how we make decisions
• Your algorithms are your competitive advantage
• And the better your data, the more effective your algorithms
18. Using Data to Train Machine Learning
Systems
• Big Data is used for “training” machine learning systems
• Data is fed through a series of nonlinear algorithms that adjust parameters in
response
• We tend to believe it infallible
• Um, no
• Data is only as good as how we select and collect it
• And results are only as good as the data
19. The Limitations of Data
• Data is typically a sample or representation of a real-world
circumstance
• Not necessarily exact
• And not necessarily correct
• Data can be misinterpreted
• That doesn’t mean what you think it means
20. Bias and Machine Learning Systems
• Worst of all, data can be biased
• It may not accurately and consistently represent the problem domain
• That’s a problem
• And all data is biased in some way
• And we need to understand our data bias
21. Where Do Biases Come From?
• Data selection
• We choose training data that represents only one segment of the domain
• We limit our training data to certain times or seasons
• We overrepresent one population
• Or
• The problem domain has subtly changed
22. Where Do Biases Come From?
• Latent bias
• Concepts become incorrectly correlated
• Correlation does not mean causation
• But it is high enough to believe
• We could be promoting stereotypes
• This describes Amazon’s problem
23. Where Do Biases Come From?
• Interaction bias
• We may focus on keywords that users apply incorrectly
• User incorporates slang or unusual words
• “That’s bad, man”
• The story of Microsoft Tay
• It wasn’t bad, it was trained that way
24. Why Does Bias Matter?
• Wrong answers
• Often with no recourse
• Subtle discrimination (legal or illegal)
• And no one knows it
• Suboptimal results
• We’re not getting it right often enough
• Although bias may also have value
25. Delivering in the Clutch
• Machines treat all events as equal
• Humans recognize the importance of some events
• And sometimes can rise to the occasion
• There is no mechanism for code to do this
• We could have data and algorithms to recognize
the importance of a specific event
• But the software cannot “improve” its answer
• This is less a bias than an inherent weakness
26. The Human in the Loop
• We don’t understand complex software systems
• Disasters often happen because software behaves in unexpected
ways
• Human oversight may prevent disasters, or wrong decisions
• Can we overcome human bias?
• The problem is that machines respond too quickly
• In many cases, there is not enough time for human oversight
• Aircraft, autonomous vehicles need to respond instantly
27. Where Testing Fits In
• Data must be accurate
• How do we make it so?
• Humans need to be proactive
• We test – objectively
• We anticipate
• Bias
• Wrong answers
• Puzzles
• Ensuring data represents the problem domain
28. How to Test
• Many scenarios
• Hundreds or thousands
• With detailed documentation
• Edge cases
• The data may not be there for them
• Think outside the box
• Try to create a model from test results
• I understand how this works
29. Conclusions
• Data is central to all applications
• Big data is the norm
• Managed by DataOps
• But data can’t make our decisions for us
• Put data in its proper role
• But the burden is on us
• How can we respond when response time is in seconds?
DataOps Principles
1. Continually satisfy your customer:
Our highest priority is to satisfy the customer through the early and continuous delivery of valuable analytic insights from a couple of minutes to weeks.
2. Value working analytics:
We believe the primary measure of data analytics performance is the degree to which insightful analytics are delivered, incorporating accurate data, atop robust frameworks and systems.
3. Embrace change:
We welcome evolving customer needs, and in fact, we embrace them to generate competitive advantage. We believe that the most efficient, effective, and agile method of communication with customers is face-to-face conversation.
4. It's a team sport:
Analytic teams will always have a variety of roles, skills, favorite tools, and titles. A diversity of backgrounds and opinions increases innovation and productivity.
5. Daily interactions:
Customers, analytic teams, and operations must work together daily throughout the project.
6. Self-organize:
We believe that the best analytic insight, algorithms, architectures, requirements, and designs emerge from self-organizing teams.
7. Reduce heroism:
As the pace and breadth of need for analytic insights ever increases, we believe analytic teams should strive to reduce heroism and create sustainable and scalable data analytic teams and processes.
8. Reflect:
Analytic teams should fine-tune their operational performance by self-reflecting, at regular intervals, on feedback provided by their customers, themselves, and operational statistics.
9. Analytics is code:
Analytic teams use a variety of individual tools to access, integrate, model, and visualize data. Fundamentally, each of these tools generates code and configuration which describes the actions taken upon data to deliver insight.
10. Orchestrate:
The beginning-to-end orchestration of data, tools, code, environments, and the analytic teams work is a key driver of analytic success.
11. Make it reproducible:
Reproducible results are required and therefore we version everything: data, low-level hardware and software configurations, and the code and configuration specific to each tool in the toolchain.
12. Disposable environments:
We believe it is important to minimize the cost for analytic team members to experiment by giving them easy to create, isolated, safe, and disposable technical environments that reflect their production environment.
13. Simplicity:
We believe that continuous attention to technical excellence and good design enhances agility; likewise simplicity--the art of maximizing the amount of work not done--is essential.
14. Analytics is manufacturing:
Analytic pipelines are analogous to lean manufacturing lines. We believe a fundamental concept of DataOps is a focus on process-thinking aimed at achieving continuous efficiencies in the manufacture of analytic insight.
15. Quality is paramount:
Analytic pipelines should be built with a foundation capable of automated detection of abnormalities (jidoka) and security issues in code, configuration, and data, and should provide continuous feedback to operators for error avoidance (poka yoke).
16. Monitor quality and performance:
Our goal is to have performance, security and quality measures that are monitored continuously to detect unexpected variation and generate operational statistics.
17. Reuse:
We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the repetition of previous work by the individual or team.
18. Improve cycle times:
We should strive to minimize the time and effort to turn a customer need into an analytic idea, create it in development, release it as a repeatable production process, and finally refactor and reuse that product.
Auditability – Versioning every output and input, from source data to data science experiments to trained model, means that you can show exactly how the model was created and where it was implemented. Reliability – Incorporating MLOps enables you the ability not just to deploy quickly but with increased consistency and quality. Repeatability – Automating every process helps you ensure a repeatable process, including how the machine learning model is deployed, evaluated, training, and versioned. Productivity – Providing a self-service environment with access to curated data sets allow data scientists and data engineers to waste less time with invalid or missing data and move faster.Read More https://techbullion.com/a-basic-guide-to-understanding-machine-learning-operations/?utm_content=151262637&utm_medium=social&utm_source=linkedin&hss_channel=lcp-28618310