SlideShare una empresa de Scribd logo
1 de 31
Correlation Does Not Mean
Causation
Testing insights into DataOps, Big Data Analytics, and AI
Peter Varhol
About me
• International speaker and writer
• Graduate degrees in Math, CS, Psychology
• Technology communicator
• AWS certified
• Former university professor, tech journalist
• Cat owner and distance runner
• peter@petervarhol.com
What You Will Learn
• How AI systems make the determinations they do based on data.
• Why big data is so important in analytics and AI.
• What are we actually learning when we work with AI and analytics
systems.
Agenda
• The Evolution of data
• The Role of DataOps
• Logistics of Big Data
• Using data to train machine learning
systems
• Bias in data
• Summary
The Evolution of Data
• Thirty years ago
• Hardware was king
• Twenty years ago
• Software ruled the roost
• Ten years ago
• Hardware and software went to the cloud
• Today
• Nothing matters but data
How Did This Happen?
• Prices fell with commodity hardware
• Storage became much less expensive
• We developed better software abstractions
• Operating systems became standardized
• Nicholas Carr was wrong – software did matter
• Business decision-makers became comfortable with data
• “Gut feel” is no longer an acceptable basis for decision-making
How Did This Happen?
• Storage is cheap
• We can easily store and retrieve terabytes of data
• Processing power is fast
• It doesn’t take long to operate on large datasets
• Data can produce information
• Decision-making became more refined
What Does This Mean?
• The business is now using data as an integral part of decision-making
• That data is often in real time
• Data is also critical to machine learning applications
• IT has to keep data up to date and clean
• Old data is worse than useless
• We need a data pipeline similar to DevOps
• Data  Information seamlessly
What is DataOps?
• Data collection is a natural part of business operations
• No out of cycle effort required
• Data collection, storage, workflow, integration, and analytics
deployment in a consistent, repeatable process
• Plus data about your data
DataOps Versus DevOps
• Data can be designed to follow flow principles similar to DevOps
• Process
• Automation
• Data production and workflow is important to effective data
consumption
• Cross-functional teams are essential in both
• Developers, testers
• DBAs, report writers
Why Would We Want To?
• Many teams don’t know how to handle big data
• Defining a practice provides guidance
• We need information in real time
• We can’t wait for the next monthly report
• It helps companies better understand their data
• Data is now front and center
Principles of DataOps
• Individuals and interactions over processes and tools
• Working analytics over comprehensive documentation
• Customer collaboration over contract negotiation
• Experimentation, iteration, and feedback over extensive upfront
design
• Cross-functional ownership of operations over siloed responsibilities
• https://www.dataopsmanifesto.org/
Why We Need DataOps
• Data is a valuable commodity
• It’s not simply a biproduct of our work
• We need to reap intelligence from data
• In real time
• With a standard process
• We must get it into the hands of those who need it
• For analysis
• For decision-making
Why We Need DataOps
• Auditability – Versioning every output and input, from source data to
data science experiments to trained model, means that you can show
exactly how the model was created and where it was implemented.
• Reliability – Deploy quickly but with increased consistency and quality
• Repeatability – Automating ensures a repeatable process
• Productivity – Providing a self-service environment with access to
curated data sets
We’re All In This Together
• Data is a team sport
• DBA
• Report writer
• Data scientist/analyst
• Ops person
• Tester
• And more
The Logistics of Big Data
• We get data from a variety of sources
• Our own databases
• Measurement of processes
• Natural and social science
• A single source is no longer enough
• We tie together sales, weather reports, more
• This can’t be done manually
Big Data and Machine Learning
• Our intelligent systems learn through data
• The more data, the better (usually)
• Algorithms manipulate the data to draw a conclusion
• It can seem like intelligence because that’s how we make decisions
• Your algorithms are your competitive advantage
• And the better your data, the more effective your algorithms
Using Data to Train Machine Learning
Systems
• Big Data is used for “training” machine learning systems
• Data is fed through a series of nonlinear algorithms that adjust parameters in
response
• We tend to believe it infallible
• Um, no
• Data is only as good as how we select and collect it
• And results are only as good as the data
The Limitations of Data
• Data is typically a sample or representation of a real-world
circumstance
• Not necessarily exact
• And not necessarily correct
• Data can be misinterpreted
• That doesn’t mean what you think it means
Bias and Machine Learning Systems
• Worst of all, data can be biased
• It may not accurately and consistently represent the problem domain
• That’s a problem
• And all data is biased in some way
• And we need to understand our data bias
Where Do Biases Come From?
• Data selection
• We choose training data that represents only one segment of the domain
• We limit our training data to certain times or seasons
• We overrepresent one population
• Or
• The problem domain has subtly changed
Where Do Biases Come From?
• Latent bias
• Concepts become incorrectly correlated
• Correlation does not mean causation
• But it is high enough to believe
• We could be promoting stereotypes
• This describes Amazon’s problem
Where Do Biases Come From?
• Interaction bias
• We may focus on keywords that users apply incorrectly
• User incorporates slang or unusual words
• “That’s bad, man”
• The story of Microsoft Tay
• It wasn’t bad, it was trained that way
Why Does Bias Matter?
• Wrong answers
• Often with no recourse
• Subtle discrimination (legal or illegal)
• And no one knows it
• Suboptimal results
• We’re not getting it right often enough
• Although bias may also have value
Delivering in the Clutch
• Machines treat all events as equal
• Humans recognize the importance of some events
• And sometimes can rise to the occasion
• There is no mechanism for code to do this
• We could have data and algorithms to recognize
the importance of a specific event
• But the software cannot “improve” its answer
• This is less a bias than an inherent weakness
The Human in the Loop
• We don’t understand complex software systems
• Disasters often happen because software behaves in unexpected
ways
• Human oversight may prevent disasters, or wrong decisions
• Can we overcome human bias?
• The problem is that machines respond too quickly
• In many cases, there is not enough time for human oversight
• Aircraft, autonomous vehicles need to respond instantly
Where Testing Fits In
• Data must be accurate
• How do we make it so?
• Humans need to be proactive
• We test – objectively
• We anticipate
• Bias
• Wrong answers
• Puzzles
• Ensuring data represents the problem domain
How to Test
• Many scenarios
• Hundreds or thousands
• With detailed documentation
• Edge cases
• The data may not be there for them
• Think outside the box
• Try to create a model from test results
• I understand how this works
Conclusions
• Data is central to all applications
• Big data is the norm
• Managed by DataOps
• But data can’t make our decisions for us
• Put data in its proper role
• But the burden is on us
• How can we respond when response time is in seconds?
Thank You
• Peter Varhol
peter@petervarhol.com

Más contenido relacionado

La actualidad más candente

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
The Hive
 

La actualidad más candente (20)

What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Building Better Models Faster Using Active Learning
Building Better Models Faster Using Active LearningBuilding Better Models Faster Using Active Learning
Building Better Models Faster Using Active Learning
 
Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field Managing Data Science | Lessons from the Field
Managing Data Science | Lessons from the Field
 
H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014
 
Enterprise Machine Learning Governance
Enterprise Machine Learning Governance Enterprise Machine Learning Governance
Enterprise Machine Learning Governance
 
Large Scale Modeling Overview
Large Scale Modeling OverviewLarge Scale Modeling Overview
Large Scale Modeling Overview
 
Leveraged Analytics at Scale
Leveraged Analytics at ScaleLeveraged Analytics at Scale
Leveraged Analytics at Scale
 
Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi Data science unit 1 By: Professor Lili Saghafi
Data science unit 1 By: Professor Lili Saghafi
 
Reproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with JupyterReproducible Dashboards and other great things to do with Jupyter
Reproducible Dashboards and other great things to do with Jupyter
 
Big Data Rampage
Big Data RampageBig Data Rampage
Big Data Rampage
 
A quick overview of Eaagle
A quick overview of EaagleA quick overview of Eaagle
A quick overview of Eaagle
 
Wtf is data science?
Wtf is data science?Wtf is data science?
Wtf is data science?
 
JDO 2019: Data Science for Developers - Matthew Renze
JDO 2019: Data Science for Developers -  Matthew RenzeJDO 2019: Data Science for Developers -  Matthew Renze
JDO 2019: Data Science for Developers - Matthew Renze
 
Hadoop Meets Scrum
Hadoop Meets ScrumHadoop Meets Scrum
Hadoop Meets Scrum
 
Keynote at Spark Summit
Keynote at Spark SummitKeynote at Spark Summit
Keynote at Spark Summit
 
Keynote at Big Data Tech Con SF 2014
Keynote at Big Data Tech Con SF 2014Keynote at Big Data Tech Con SF 2014
Keynote at Big Data Tech Con SF 2014
 
The future of jobs
The future of jobsThe future of jobs
The future of jobs
 
How to succeed at data without even trying!
How to succeed at data without even trying!How to succeed at data without even trying!
How to succeed at data without even trying!
 
Agile data science
Agile data scienceAgile data science
Agile data science
 

Similar a Correlation does not mean causation

Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
Clint Campbell
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann
 
Ignite Presentation
Ignite PresentationIgnite Presentation
Ignite Presentation
Brad Stauber
 

Similar a Correlation does not mean causation (20)

Using Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps PracticesUsing Machine Learning to Optimize DevOps Practices
Using Machine Learning to Optimize DevOps Practices
 
Not fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational valuesNot fair! testing ai bias and organizational values
Not fair! testing ai bias and organizational values
 
Not fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational valuesNot fair! testing AI bias and organizational values
Not fair! testing AI bias and organizational values
 
[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic[DSC Europe 22] Govern your event streams - Ivan Dundovic
[DSC Europe 22] Govern your event streams - Ivan Dundovic
 
Top BI trends and predictions for 2017
Top BI trends and predictions for 2017Top BI trends and predictions for 2017
Top BI trends and predictions for 2017
 
Protecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test dataProtecting privacy with fuzzy-feeling test data
Protecting privacy with fuzzy-feeling test data
 
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership GrantPOWRR Tools: Lessons learned from an IMLS National Leadership Grant
POWRR Tools: Lessons learned from an IMLS National Leadership Grant
 
EPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdfEPF-datagov-part1-1.pdf
EPF-datagov-part1-1.pdf
 
Industrial Data Science
Industrial Data ScienceIndustrial Data Science
Industrial Data Science
 
Data Detectives - Presentation
Data Detectives - PresentationData Detectives - Presentation
Data Detectives - Presentation
 
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
 
Testing a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatraceTesting a movingtarget_quest_dynatrace
Testing a movingtarget_quest_dynatrace
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Ray Scott - Agile Solutions – Leading with Test Data Management - EuroSTAR 2012
Ray Scott - Agile Solutions – Leading with Test Data Management - EuroSTAR 2012Ray Scott - Agile Solutions – Leading with Test Data Management - EuroSTAR 2012
Ray Scott - Agile Solutions – Leading with Test Data Management - EuroSTAR 2012
 
Tech essentials for Product managers
Tech essentials for Product managersTech essentials for Product managers
Tech essentials for Product managers
 
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
 
Democratizing Data Science in the Enterprise
Democratizing Data Science in the EnterpriseDemocratizing Data Science in the Enterprise
Democratizing Data Science in the Enterprise
 
Ignite Presentation
Ignite PresentationIgnite Presentation
Ignite Presentation
 
Agility for big data
Agility for big data Agility for big data
Agility for big data
 
Analytics in business
Analytics in businessAnalytics in business
Analytics in business
 

Más de Peter Varhol

Más de Peter Varhol (13)

DevOps and the Impostor Syndrome
DevOps and the Impostor SyndromeDevOps and the Impostor Syndrome
DevOps and the Impostor Syndrome
 
162 the technologist of the future
162   the technologist of the future162   the technologist of the future
162 the technologist of the future
 
Digital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolisDigital transformation through devops dod indianapolis
Digital transformation through devops dod indianapolis
 
Making disaster routine
Making disaster routineMaking disaster routine
Making disaster routine
 
What Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing TeamsWhat Aircrews Can Teach Testing Teams
What Aircrews Can Teach Testing Teams
 
Identifying and measuring testing debt
Identifying and measuring testing debtIdentifying and measuring testing debt
Identifying and measuring testing debt
 
What aircrews can teach devops teams ignite
What aircrews can teach devops teams igniteWhat aircrews can teach devops teams ignite
What aircrews can teach devops teams ignite
 
Talking to people lightning
Talking to people lightningTalking to people lightning
Talking to people lightning
 
Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011Varhol oracle database_firewall_oct2011
Varhol oracle database_firewall_oct2011
 
Qa test managed_code_varhol
Qa test managed_code_varholQa test managed_code_varhol
Qa test managed_code_varhol
 
Talking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps toolTalking to people: the forgotten DevOps tool
Talking to people: the forgotten DevOps tool
 
How do we fix testing
How do we fix testingHow do we fix testing
How do we fix testing
 
Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012Moneyball peter varhol_starwest2012
Moneyball peter varhol_starwest2012
 

Último

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Correlation does not mean causation

  • 1. Correlation Does Not Mean Causation Testing insights into DataOps, Big Data Analytics, and AI Peter Varhol
  • 2. About me • International speaker and writer • Graduate degrees in Math, CS, Psychology • Technology communicator • AWS certified • Former university professor, tech journalist • Cat owner and distance runner • peter@petervarhol.com
  • 3. What You Will Learn • How AI systems make the determinations they do based on data. • Why big data is so important in analytics and AI. • What are we actually learning when we work with AI and analytics systems.
  • 4. Agenda • The Evolution of data • The Role of DataOps • Logistics of Big Data • Using data to train machine learning systems • Bias in data • Summary
  • 5. The Evolution of Data • Thirty years ago • Hardware was king • Twenty years ago • Software ruled the roost • Ten years ago • Hardware and software went to the cloud • Today • Nothing matters but data
  • 6. How Did This Happen? • Prices fell with commodity hardware • Storage became much less expensive • We developed better software abstractions • Operating systems became standardized • Nicholas Carr was wrong – software did matter • Business decision-makers became comfortable with data • “Gut feel” is no longer an acceptable basis for decision-making
  • 7. How Did This Happen? • Storage is cheap • We can easily store and retrieve terabytes of data • Processing power is fast • It doesn’t take long to operate on large datasets • Data can produce information • Decision-making became more refined
  • 8. What Does This Mean? • The business is now using data as an integral part of decision-making • That data is often in real time • Data is also critical to machine learning applications • IT has to keep data up to date and clean • Old data is worse than useless • We need a data pipeline similar to DevOps • Data  Information seamlessly
  • 9. What is DataOps? • Data collection is a natural part of business operations • No out of cycle effort required • Data collection, storage, workflow, integration, and analytics deployment in a consistent, repeatable process • Plus data about your data
  • 10. DataOps Versus DevOps • Data can be designed to follow flow principles similar to DevOps • Process • Automation • Data production and workflow is important to effective data consumption • Cross-functional teams are essential in both • Developers, testers • DBAs, report writers
  • 11. Why Would We Want To? • Many teams don’t know how to handle big data • Defining a practice provides guidance • We need information in real time • We can’t wait for the next monthly report • It helps companies better understand their data • Data is now front and center
  • 12. Principles of DataOps • Individuals and interactions over processes and tools • Working analytics over comprehensive documentation • Customer collaboration over contract negotiation • Experimentation, iteration, and feedback over extensive upfront design • Cross-functional ownership of operations over siloed responsibilities • https://www.dataopsmanifesto.org/
  • 13. Why We Need DataOps • Data is a valuable commodity • It’s not simply a biproduct of our work • We need to reap intelligence from data • In real time • With a standard process • We must get it into the hands of those who need it • For analysis • For decision-making
  • 14. Why We Need DataOps • Auditability – Versioning every output and input, from source data to data science experiments to trained model, means that you can show exactly how the model was created and where it was implemented. • Reliability – Deploy quickly but with increased consistency and quality • Repeatability – Automating ensures a repeatable process • Productivity – Providing a self-service environment with access to curated data sets
  • 15. We’re All In This Together • Data is a team sport • DBA • Report writer • Data scientist/analyst • Ops person • Tester • And more
  • 16. The Logistics of Big Data • We get data from a variety of sources • Our own databases • Measurement of processes • Natural and social science • A single source is no longer enough • We tie together sales, weather reports, more • This can’t be done manually
  • 17. Big Data and Machine Learning • Our intelligent systems learn through data • The more data, the better (usually) • Algorithms manipulate the data to draw a conclusion • It can seem like intelligence because that’s how we make decisions • Your algorithms are your competitive advantage • And the better your data, the more effective your algorithms
  • 18. Using Data to Train Machine Learning Systems • Big Data is used for “training” machine learning systems • Data is fed through a series of nonlinear algorithms that adjust parameters in response • We tend to believe it infallible • Um, no • Data is only as good as how we select and collect it • And results are only as good as the data
  • 19. The Limitations of Data • Data is typically a sample or representation of a real-world circumstance • Not necessarily exact • And not necessarily correct • Data can be misinterpreted • That doesn’t mean what you think it means
  • 20. Bias and Machine Learning Systems • Worst of all, data can be biased • It may not accurately and consistently represent the problem domain • That’s a problem • And all data is biased in some way • And we need to understand our data bias
  • 21. Where Do Biases Come From? • Data selection • We choose training data that represents only one segment of the domain • We limit our training data to certain times or seasons • We overrepresent one population • Or • The problem domain has subtly changed
  • 22. Where Do Biases Come From? • Latent bias • Concepts become incorrectly correlated • Correlation does not mean causation • But it is high enough to believe • We could be promoting stereotypes • This describes Amazon’s problem
  • 23. Where Do Biases Come From? • Interaction bias • We may focus on keywords that users apply incorrectly • User incorporates slang or unusual words • “That’s bad, man” • The story of Microsoft Tay • It wasn’t bad, it was trained that way
  • 24. Why Does Bias Matter? • Wrong answers • Often with no recourse • Subtle discrimination (legal or illegal) • And no one knows it • Suboptimal results • We’re not getting it right often enough • Although bias may also have value
  • 25. Delivering in the Clutch • Machines treat all events as equal • Humans recognize the importance of some events • And sometimes can rise to the occasion • There is no mechanism for code to do this • We could have data and algorithms to recognize the importance of a specific event • But the software cannot “improve” its answer • This is less a bias than an inherent weakness
  • 26. The Human in the Loop • We don’t understand complex software systems • Disasters often happen because software behaves in unexpected ways • Human oversight may prevent disasters, or wrong decisions • Can we overcome human bias? • The problem is that machines respond too quickly • In many cases, there is not enough time for human oversight • Aircraft, autonomous vehicles need to respond instantly
  • 27. Where Testing Fits In • Data must be accurate • How do we make it so? • Humans need to be proactive • We test – objectively • We anticipate • Bias • Wrong answers • Puzzles • Ensuring data represents the problem domain
  • 28. How to Test • Many scenarios • Hundreds or thousands • With detailed documentation • Edge cases • The data may not be there for them • Think outside the box • Try to create a model from test results • I understand how this works
  • 29. Conclusions • Data is central to all applications • Big data is the norm • Managed by DataOps • But data can’t make our decisions for us • Put data in its proper role • But the burden is on us • How can we respond when response time is in seconds?
  • 30.
  • 31. Thank You • Peter Varhol peter@petervarhol.com

Notas del editor

  1. DataOps Principles 1. Continually satisfy your customer: Our highest priority is to satisfy the customer through the early and continuous delivery of valuable analytic insights from a couple of minutes to weeks. 2. Value working analytics: We believe the primary measure of data analytics performance is the degree to which insightful analytics are delivered, incorporating accurate data, atop robust frameworks and systems. 3. Embrace change: We welcome evolving customer needs, and in fact, we embrace them to generate competitive advantage. We believe that the most efficient, effective, and agile method of communication with customers is face-to-face conversation. 4. It's a team sport: Analytic teams will always have a variety of roles, skills, favorite tools, and titles. A diversity of backgrounds and opinions increases innovation and productivity. 5. Daily interactions: Customers, analytic teams, and operations must work together daily throughout the project. 6. Self-organize: We believe that the best analytic insight, algorithms, architectures, requirements, and designs emerge from self-organizing teams. 7. Reduce heroism: As the pace and breadth of need for analytic insights ever increases, we believe analytic teams should strive to reduce heroism and create sustainable and scalable data analytic teams and processes. 8. Reflect: Analytic teams should fine-tune their operational performance by self-reflecting, at regular intervals, on feedback provided by their customers, themselves, and operational statistics. 9. Analytics is code: Analytic teams use a variety of individual tools to access, integrate, model, and visualize data. Fundamentally, each of these tools generates code and configuration which describes the actions taken upon data to deliver insight. 10. Orchestrate: The beginning-to-end orchestration of data, tools, code, environments, and the analytic teams work is a key driver of analytic success. 11. Make it reproducible: Reproducible results are required and therefore we version everything: data, low-level hardware and software configurations, and the code and configuration specific to each tool in the toolchain. 12. Disposable environments: We believe it is important to minimize the cost for analytic team members to experiment by giving them easy to create, isolated, safe, and disposable technical environments that reflect their production environment. 13. Simplicity: We believe that continuous attention to technical excellence and good design enhances agility; likewise simplicity--the art of maximizing the amount of work not done--is essential. 14. Analytics is manufacturing: Analytic pipelines are analogous to lean manufacturing lines. We believe a fundamental concept of DataOps is a focus on process-thinking aimed at achieving continuous efficiencies in the manufacture of analytic insight. 15. Quality is paramount: Analytic pipelines should be built with a foundation capable of automated detection of abnormalities (jidoka) and security issues in code, configuration, and data, and should provide continuous feedback to operators for error avoidance (poka yoke). 16. Monitor quality and performance: Our goal is to have performance, security and quality measures that are monitored continuously to detect unexpected variation and generate operational statistics. 17. Reuse: We believe a foundational aspect of analytic insight manufacturing efficiency is to avoid the repetition of previous work by the individual or team. 18. Improve cycle times: We should strive to minimize the time and effort to turn a customer need into an analytic idea, create it in development, release it as a repeatable production process, and finally refactor and reuse that product.
  2. Auditability – Versioning every output and input, from source data to data science experiments to trained model, means that you can show exactly how the model was created and where it was implemented. Reliability – Incorporating MLOps enables you the ability not just to deploy quickly but with increased consistency and quality. Repeatability – Automating every process helps you ensure a repeatable process, including how the machine learning model is deployed, evaluated, training, and versioned. Productivity – Providing a self-service environment with access to curated data sets allow data scientists and data engineers to waste less time with invalid or missing data and move faster. Read More https://techbullion.com/a-basic-guide-to-understanding-machine-learning-operations/?utm_content=151262637&utm_medium=social&utm_source=linkedin&hss_channel=lcp-28618310