Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
End to End Machine Learning for Aspiring
Data Scientist
-S r i v a t s a n S r i n i v a s a n
h t t p s : / / w w w . l i...
Before you proceed.. Stop.. Read .. Proceed at your own terms 
This presentation is not to complain on online courses and...
Courses vs Enterprise need
ML Code
XGBoost CNN
SVM
RegressionKNN
Neural Networks
Random Forest
What most online courses and Academics focuses on …..
...
How enterprise production solution looks like ……
Image: Hidden debt in machine learning
If you see below “Data Science Hierarchy of Needs” as Hill climbing,
Academia puts you on top of the Hill and real world i...
Education (Courses/Academics) vs Enterprise
Education Enterprise
Focus on Model Accuracy
and usage of algorithms
Focus on
...
For most online courses
Data Science = ML Code + Some Data Analysis
In Reality
Data Science = ML Code + Data Analysis + Da...
5 Biggest Challenge for Enterprise deploying ML solution
• Data Collection
• Deploying and Reproducing the model in produc...
Components of End to End Machine Learning
Data
Collection
Data
Analysis/Cle
aning
Data
Organization
and
Transformation
Feature
Engineering
Model
Training
Model
Eval...
ML Components and Skills/Role mapping
Components Primary Responsibility Secondary Responsibility
Problem Definition Busine...
Most of the Role Definition in previous slide can be found online, let me talk about AI
Champion as not much is mentioned ...
Few Components of End to End ML Explained
(Will cover more details on each on my LinkedIn post)
Data Collection
• Data is typically collected and centralized from variety of sources either into Data Lake or Data Wareho...
Data Analysis and Validation
Inspect and clean data to discover useful information that can further help in modeling AI dr...
Data Organization and Transformation
Data collected from source systems into Data ecosystem are typically at granular leve...
Model Deployment
Few key things to remember while deploying models to production or integrating models with business proce...
Model Monitoring
Machine Learning today is essential for running some of our critical business process. ML is deployed in ...
Other Key components to succeed in Enterprise Machine Learning
Structured and modularized code base
Experiment tracking fo...
Food for Thought
Food for thought #1 - Various point of Failure in ML Lifecycle
Machine Learning cycle is not complete post deployment. Mod...
Food for thought #2 - Infrastructure
Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence...
Food for thought #3 - Cloud for AI/ML
Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-clou...
Food for thought #4 - Stay simple as long as possible
Fitting simple models and if accuracy is low, Do you immediately jum...
Food for thought #5 - Data Science and Agile
There is lot of misconception on use of Agile for Data Science. Data Science ...
Fact
Traditional ML algorithms can scale on large datasets. There are distributed
frameworks that can train your model on ...
To Summarize
Plan for investing in right
Infrastructure (GPU, CPU,
Cloud) to accelerate model
development process
Only 20%...
Thank You and Stay Tuned on LinkedIn for more info
on End to End Data Science Pipeline
Follow or search with hashtag #end2...
Próxima SlideShare
Cargando en…5
×

Real World End to End machine Learning Pipeline

8.313 visualizaciones

Publicado el

Purpose of this presentation is to highlight how end to end machine learning looks like in real world enterprise. This is to provide insight to aspiring data scientist who have been through courses or education in ML that mostly focus on ML algorithms and not end to end pipeline.

Architecture and components mentioned in Slide 11 will be discussed in detailed in series of post on LinkedIn over the course of next few month

To get updates on this follow me on LinkedIn or search/follow hashtag #end2endDS. Post will be active in August 2019 and will be posted till September 2019

Publicado en: Datos y análisis

Real World End to End machine Learning Pipeline

  1. 1. End to End Machine Learning for Aspiring Data Scientist -S r i v a t s a n S r i n i v a s a n h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b / 1
  2. 2. Before you proceed.. Stop.. Read .. Proceed at your own terms  This presentation is not to complain on online courses and academics but to highlight the difference in expectation between these courses and what enterprise need Doing data science has it’s own set of challenges and multiple failure points. Some of the information I will be sharing on Linkedin will cover in detail on those failure points and how to overcome the same If you are Aspiring to be in Data Science this presentation and series of post that I will be sharing over next few months will take you through end to end machine learning cycle in typical organization -> Use this information to fill in the skills that can get you closer to industry needs. -> Use this content to define strategy for yourself to land a job in enterprise world. You can search for post using hashtag #end2endDS in LinkedIn content or follow me on LinkedIn to get updates as I post in LinkedIn h t t p s : / / w w w . l i n k e d i n . c o m / i n / s r i v a t s a n - s r i n i v a s a n - b 8 1 3 1 b / Content on this topic will be posted between 29th July and 27th September, 2019. The frequency will purely depend on bandwidth I have. On average you can expect 1 or max 2 posts in a week I will also summarize key take away in article as well as update this presentation over time Every data scientist need not be expert in entire ML pipeline but it is good for them to know the process - Happy Learning
  3. 3. Courses vs Enterprise need
  4. 4. ML Code XGBoost CNN SVM RegressionKNN Neural Networks Random Forest What most online courses and Academics focuses on ….. Statistical Techniques Basic Data Analysis
  5. 5. How enterprise production solution looks like …… Image: Hidden debt in machine learning
  6. 6. If you see below “Data Science Hierarchy of Needs” as Hill climbing, Academia puts you on top of the Hill and real world is when one understand the path to climb is the most difficult one Image Source: Hackernoon
  7. 7. Education (Courses/Academics) vs Enterprise Education Enterprise Focus on Model Accuracy and usage of algorithms Focus on deployment/Integration. Balance between accuracy and explain-ability Focus on increasing complexity of Models for better accuracy Keep it simple as much as possible and as long as possible Data Mostly comes in Single or few Files Data comes from multiple enterprise system. Need to be integrated, cross referenced and summarized Data size is Typically Small to Medium Data size ranges from Medium to Very Large Data typically is 80% clean Data is 80% noisy Limited Tools More Tools + Dev Ops + Cloud + Other Craps Do it at decent Pace Agile (Not now, don’t make me talk)
  8. 8. For most online courses Data Science = ML Code + Some Data Analysis In Reality Data Science = ML Code + Data Analysis + Data Collection + Data Engineering + Software Engineering + Dev Ops + BI Engineer + Product Manager Note: If you are coming from premier institute that addresses all of the reality. Please feel free to exit the presentation
  9. 9. 5 Biggest Challenge for Enterprise deploying ML solution • Data Collection • Deploying and Reproducing the model in production • Model Monitoring • Keeping model relevant by adopting to changing business scenarios • Communicate and interpret model output to various stakeholders
  10. 10. Components of End to End Machine Learning
  11. 11. Data Collection Data Analysis/Cle aning Data Organization and Transformation Feature Engineering Model Training Model Evaluation and Validation Model Deployment Model Re-calibration (Some steps might be optional on case basis) Business Understanding Data Understanding Model Monitoring Model Drift Analysis Components of End to End Machine Learning Pipeline in Real World Problem Definition Model Explanation (Local and Global) Health Dashboard, Reports & Alerts Model Training (Iterative/Some steps might be optional on case basis) Model Management and Governance Data Management Model and Application Logging Pipeline Orchestrator Infrastructure/Dev Ops/Automation Data Drift Analysis Data Validation/An omalies detection Model Integration and SLA understanding
  12. 12. ML Components and Skills/Role mapping Components Primary Responsibility Secondary Responsibility Problem Definition Business Owner, AI Champion Product Owner Business Understanding Product Owner, Business Owner, AI Champion ML Engineer Data Understanding Data Engineer, ML Engineer, Product Owner Business Owner/Analyst Model Integration and SLA understanding ML Engineer, Data Engineer, Software Engineer Business Owner, Product Owner Data Collection Data Engineer, Data Analyst Data Analysis/ Cleaning Data Engineer, Data Analyst Data Organization/Transformation Data Engineer, ML Engineer Data Analyst Data Validation/Anomaly Detection Data Analyst, Data Engineer Feature Engineering ML Engineer Data Engineer Model Training ML Engineer Model Evaluation/validation ML Engineer Business Owner, Model Governance team Model Monitoring Operations Engineer, ML Engineer BI Engineer Model Deployment Software Engineer, Data Engineer, ML Engineer Data Drift/Model Drift Operations Engineer, ML Engineer BI Engineer, ML Engineer Dashboard/Reports BI Engineer Business Owner, Product Owner Note: Depending on size of ML project, One person might play multiple role or there might be multiple person required for single role. Some role might also be part time or some components can be built as capability that can be leveraged across projects
  13. 13. Most of the Role Definition in previous slide can be found online, let me talk about AI Champion as not much is mentioned on it…. AI Champion (Head of Analytics or Sometimes CAO himself) is responsible for driving intelligent insights backed by data science capability within enterprise. He also owns the resulting ROI or Impact numbers on delivering intelligent solution. He leads the data science team by developing policies, strategies and propagates culture of experimentation and research. He and his team are also responsible for working with business stakeholders in planning, identifying, prioritizing and Implementing AI use cases You can find more details here: https://www.linkedin.com/pulse/identifying-prioritizing-artificial-intelligence-use-cases-srivatsan This role might be more relevant in mid to large size organization where organization has multiple use cases to deliver and AI Champion helps enterprise focus on prioritizing use case that can be fit for AI as well as generate substantial business value
  14. 14. Few Components of End to End ML Explained (Will cover more details on each on my LinkedIn post)
  15. 15. Data Collection • Data is typically collected and centralized from variety of sources either into Data Lake or Data Warehouse or any enterprise data ecosystem • Data is sourced from High volume transactional systems like ERP, Sales etc. or from High velocity IOT devices, POS systems etc • Data takes variety of shapes - Structured, Semi Structured and Unstructured sources of data • Data takes variety of forms - Batch, Streaming, API, Alternate Data etc. • While ingesting data is one part of the puzzle, data also needs to be cataloged, secured and governed Further Reading: https://www.linkedin.com/pulse/think-data-first-before-being-ai-srivatsan-srinivasan “Define a efficient Data Strategy that is simple to implement and help accelerate on AI strategy”
  16. 16. Data Analysis and Validation Inspect and clean data to discover useful information that can further help in modeling AI driven intelligent solution. Purpose of Data Analysis and Validation is to understand • What is characteristic of my data and how does my data look like? • Are there any outliers or errors in the data? • How does independent variable respond to target variable? • Base statistics out of analysis phase is used against production inference data to identify if the data has evolved (drifted) from the underlying assumptions than what the model was trained on? Further Reading: https://www.linkedin.com/pulse/tensorflow-extended-tfx-data-analysis-validation-drift-srinivasan/ “Understanding your data is key step to insight”
  17. 17. Data Organization and Transformation Data collected from source systems into Data ecosystem are typically at granular level not directly consumable by ML model. Sources are as well spread across multiple domain. Take marketing as example data might be spread across customer, product, transaction systems, loyalty etc. Data Organization and Transformation is to make data consumable for ML models and as well make data accessible for self service Raw data typically in TB is cleansed, aggregated in a form that can be fed into model directly. This is where most heavy lifting work happens in close collaboration with Business, Data Engineers, ML Engineers and Data Analyst Integrate Explore Aggregate Model Deploy Monitor Raw Data (TB-PB) Model Input Data (MB-GB) 60% 40% Data Engineering and Data Analyst ML Engineer, Data Engineer and Software Engineer Insight (KB)
  18. 18. Model Deployment Few key things to remember while deploying models to production or integrating models with business process Further Reading: https://www.linkedin.com/pulse/ml-model-deployment-considerations-srivatsan-srinivasan/ https://www.linkedin.com/pulse/integrating-machine-learning-models-within-matured-srinivasan/ • Training deployment skew - Models developed on historical sources might have to be deployed in streaming flow or in edge of network/devices • Not everything can be flask’ed or exposed as service. Deployment scenario varies based on technology in business process, inference SLA etc • Keep model pipeline as simple as possible. Avoid spaghetti pipeline code • Provision for experimentation of new models when implementing deployment framework - Champion/Challenger or A/B testing based model deployment and analysis • Training deployment skew – Features that are hard to compute in inference time or features that were forward computed during training time (This may sound not so sensible but trust me have seen enterprises doing such mistake)
  19. 19. Model Monitoring Machine Learning today is essential for running some of our critical business process. ML is deployed in decision making substituting or replacing humans and needs to be monitored continuously as it is making decisions Ongoing monitoring of ML models is essential to evaluate whether the assumptions that model was developed on is not drifted and is performing as intended. Model can drift due to changes in business assumption, Changes or issues with data, market conditions that might need adjustment among others Ongoing monitoring highlight scenarios when model might need re-calibration. For some business process it can be yearly for some it can be as frequently as daily. Plan for monitoring the models continuously -> Alert on drift in data, concept or model. Business today evolves rapidly and assumptions on which models are trained on becomes quickly invalidated. You want to know before your models starts making wrong predictions
  20. 20. Other Key components to succeed in Enterprise Machine Learning Structured and modularized code base Experiment tracking for reproducibility Version Control of ML code, data and Experiment results Dev Ops for both Infrastructure and Model deployment Orchestrator for Data and Model pipeline Logging deployment runtime critical info and making it searchable
  21. 21. Food for Thought
  22. 22. Food for thought #1 - Various point of Failure in ML Lifecycle Machine Learning cycle is not complete post deployment. Model needs to be monitored continuously and be prepared for failure at any part of pre and post modeling exercise • Failure during experimentation. This is ideal case as well if you figure out the problem earlier. • Failure during development by not thinking about real world inference scenario. Using features that are hard or impossible to compute during inference • Failure post deployment where few models did not generate business value they were supposed to • Failure post deployment to keep up with even changing data landscape. These model need to have frequent re-calibration or need to have some form of continuous learning • Failure in using right performance metrics. Think from your business to succeed not for model to succeed Further Reading – Reasons why ML project fail: https://www.linkedin.com/pulse/top-reasons-why-artificial-intelligence-projects-fail-srinivasan/
  23. 23. Food for thought #2 - Infrastructure Further Reading – https://www.linkedin.com/pulse/accelerating-artificial-intelligence-initiatives-srivatsan-srinivasan/ Enterprises hiring artificial intelligence and machine learning expert without right infrastructure and tools is like “Hiring astronauts to drive a bullock cart” Building data science capability within enterprise must be thought ground up right from selection of silicon chip. Data Engineering and ML process are typically compute and memory intensive and on large dataset the infrastructure has to be thought ground up. Data scientist typically performs 100’s of iteration to come up with right algorithm, hyper parameters, metrics. Not having right infrastructure can derail enterprise getting onto machine learning Plan for Infrastructure with right kind of hardware (GPU, CPU, HPC etc), technologies (Hadoop, Kubernetes etc.) and tools (Spark ML, Tensorflow, scikit etc.) that can distribute ML/DL pipelines for faster hypothesis and value generation Cloud is very good alternative to accelerate ML journey where you can spin up compute on demand and tear down when not needed
  24. 24. Food for thought #3 - Cloud for AI/ML Further Reading – https://www.linkedin.com/pulse/artificial-intelligence-google-cloud-platform-srivatsan-srinivasan/ https://www.linkedin.com/pulse/data-analytics-google-cloud-platform-srivatsan-srinivasan/ Cloud is key component of AI/ML journey especially for enterprise that needs Agility to meet the huge compute demand needed to run ML jobs Key benefits cloud provide are Scale - Instant access to hundreds of compute instances Speed - Easy availability of specialized device like (GPU/TPU) that can help accelerate AI development Cloud AI API's - Quick jump start into complex activities rather build from scratch. For cases like speech to text or language translation, enterprise as well might lack data to build models with high accuracy as available in cloud Cloud AutoML - Train high quality models specific to business needs with citizen data scientist or even by business users Cloud Bursting - With advances in Hybrid Cloud, start small in local data center and use cloud to scale AI compute
  25. 25. Food for thought #4 - Stay simple as long as possible Fitting simple models and if accuracy is low, Do you immediately jump to complex models? Try below 2 steps before moving to trendy and complex algorithms Follow your model output -> Listen to what your algorithm metrics says. Drill down into misclassification scenarios and see if you are able to find any interesting pattern Be Curious and Creative with your data -> Try to see if you find any pattern or relationship in data that has ability to influence your model outcome. Lot can be solved by proper EDA and feature engineering If you are still not meeting the performance targets go for complex models in increments. The steps you performed above is still relevant and can be input to your complex models to enhance decision boundary In some critical business process 84% of simple model performance might be better than 86% of complex models
  26. 26. Food for thought #5 - Data Science and Agile There is lot of misconception on use of Agile for Data Science. Data Science outcome depends on continuous experimentation where as Agile focuses on early and continuous delivery throughout the development lifecycle First thing to remember Agile is set of guiding principles and not set in stone methodology. Agile can be tailored to one’s unique Data Science need Here is one way of doing data science in Agile way especially the machine learning part • Don't set strict deliverables at the end of every sprint • Use daily/weekly meeting to get road blockers alone not daily status • As soon as you have working model (Say every sprint or 2) with decent accuracy put it in private beta mode. Private beta mode or dark mode is where model generate output but it is not actioned on. This will help us monitor the data with real world information and test its reliability • Keep updating private beta as you build models with better performance accuracy • Launch the private beta model to small percentage of live traffic. Collect feedback based on response from end users • Keep increasing the volume of transaction to model in frequent interval until all traffic is diverted and feedback/outcome is met In real world there are scenarios where ML model might not get you same value that was seen during training/evaluation phase. In this case agile delivery allows machine learning projects to be value and outcome focused and to achieve project objectives in a timely manner.
  27. 27. Fact Traditional ML algorithms can scale on large datasets. There are distributed frameworks that can train your model on large dataset and are very effective in learning from large dataset as well. Choose technology based on your business and data needs If your tabular data is big in size, switch to deep learning. Traditional ML will not work Machine Learning will eventually replace existing rules in legacy system Think ML as initially technology for complementing your legacy rules. One can reduce the complexity of rules by introducing ML solution. It can eventually replace but it is always better to have some deterministic rules complementing your probabilistic ML models Machine Learning is the new “Magic Wand” for making your business process smart and intelligent Do not take a non ML problem and try to fit ML into it. Use ML when you believe it will add value to the business process. You can make your business process smart by advance analytics or statistical techniques as well Data science is more than what AutoML can currently do. It will be assistant to Data Scientist taking care of boring part of Data Scientist and have them focus more on delivering business value AutoML will replace and automate data science work Myth Food for thought #5 - Myth v/s Fact Further Reading on AutoML – https://www.linkedin.com/pulse/fear-data-scientist-called-autophobia-srivatsan-srinivasan/
  28. 28. To Summarize Plan for investing in right Infrastructure (GPU, CPU, Cloud) to accelerate model development process Only 20% or less of actual pipeline is ML code
  29. 29. Thank You and Stay Tuned on LinkedIn for more info on End to End Data Science Pipeline Follow or search with hashtag #end2endDS in LinkedIn to get updates

×