SlideShare una empresa de Scribd logo
1 de 30
Descargar para leer sin conexión
Learning from Data
Busy Professional’s guide to machine learning
@govindk
http://govindkanshi.wordpress.com
Agenda
• What we know
• What we do not know
• Process
• What to measure
• Challenge with Model
• Challenge with Data
• Resources
• Software
• Books
What we know
• Reports made from data
• KPIs made of data
• Dashboards made of data
• They all measure known metrics, questions
What we do not know
• Will this person turn delinquent in x years based on his profile
(age/income/background…)
• Which kind of process, machine will fail
• Which people/things are similar to each other – find me a pattern
• Prevent people from readmission into Hospital
• Why - because we do not know the question and
database/applications do not have oob functionality.
We are already using applied ML results
• Mails get despammed
• Kinect recognizes our gestures
• Facebook recognizes our photos
• Siri/Cortana – recognize our voice commands
• Watson used some
• Search uses many
• Recommendation is there in face
So then
• Learn from data
• How
• Create a model of the data
• Test the model for error and use it
Unsupervised
• Clustering
• Customer segmentation
• Topic identification
• Number of algorithms
• Hierarchical (distance as measure – generally Euclidian )
• Agglomerative ( start with n groups and start merging them)
• Single Link (2 at time) vs divisive (start single – break it down)
Simple way
• Group folks on
• Height
• What you eat
• Where you are from (state)
• Next time a new person comes in – let us predict
Demos
• USArrests Data
• Wine Data
Challenges and next steps
• How many groups/clusters
• How many miss-groupings (Evaluation)
• Associate Topics & after Clustering what
• Once clusters are formed – some one can name them
• Now run supervised methods on data to learn more
Supervised learning
• Given a label L for a attributes (a1,a2,a3..)
• Learn the model which can predict the label based on attributes
Simple way to understand Classification
• Let us say we are labelled north indian, south indian
• How
• Attributes (language, food, movie language, music …)
• Basically learning the link between
• An observed data X and
• A variable y usually called target or labels.
Supervised
• Data
• One dataset for training which has label
• One dataset for testing
• Example
• Classification (spam, order data, disease data, Kinect gesture)
• Classification
• binary vs. multiclass
• Regression (sales)
• Ranking
• Search
• Predictive maintenance
• Recommendation
• Netflix - Netflix competition = SVD
Demos
• Trees
• DecisionTree – Python (show train and test, validation)
• Decision tree – R
• BigML (nw dependent)
• Challenge –
• one input every time
Few more terms to overcome data issues
• Bagging – (used with tree models) (bias reduction)
• Train an ensemble of models from Bootstrap samples
• Get a vote amongst models
• Class predicted by majority of the model wins
• Get an average if outputs are scores or probabilities
• * Bootstrap – denotes different random sample of dataset
• Boosting (variance reduction)
• Like Bagging but penalizes & learns from misclassification
• Challenge of assigning “weights” misclassified instances to penalize
• Start with higher weight say 1 and keep reducing till error comes down
Demo
• RandomForest
• n training data out of N, at each decision node of the tree, it randomly selects
m input features from the total M input features (m ~ M^0.5) and learns a
decision tree from it. Finally each tree in the forest vote for the result.
• Evaluation
• Loss function to margins (penalize mis-classification, reward +ve)
Regression
• Explain relationship betwee two variables (dependent vs
independent)
• Simple linear - y = W0 + W1x1 + W2x2 + …
• Estimate the weights to predict y
• Multivariate
Demos
• Excel
• SimpleLinear -R
• RandomForest – Wine
• Evaluate by applying loss function to residuals
What to meaure
• Data
• Cross Validation
• n-fold cross-validation
• Leave-one-out validation
• Hold out
• Eod – how much data is enough, is there bias in data (only certain kind of labels)
• Model Results
• Contingency table(true negatives & false positive are bad )
• ROC & AUC (coverage curve) (true positive vs false positives)
• Precision/Recall (from search world)
• F-measure
• Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
Is Model working right
Predicted +ve Predicted -ve
Actual +ve 40 15 55
Actual -ve 5 40 45
45 55 100
Precision 40/45
Recall 40/55
F measure (Harmonic mean) 2/((1/prec) + (1/rec))
Accuracy TPR(40) + TPN(55)/ (40+15+5+40)
How much accuracy is enough
Lift – How much better than random guessing
Lift and accuracy do not have correlation
Challenge with Model
• Overfitting
• Avoid Bias and have less variance
• Use Regularization
• L1 (Ridge)
• L2 (Lasso)
• If time permits show the alpha effect
• Look for “overfitting model” , “bias and variance”
Challenge with Data
• Categorical, ordinal, quantitative
• Measures – mean, median, variance, std deviation, range, shape (skewness)
• Always observe to get “feel”/smell of data
• Discretize/Thresholding (convert quantitative feature)
• Missing feature(s) –
• What do you do – median, avg
• Data encoding
• Create new from existing vs encode in different way
Feature engineering
• Feature selection
• Intuition, testing co-relation
• Subset (Start small and increase) based on some error function
• Feature extraction
• New k dimensions – as combination of older d dimensions
• Linear
• PCA (find the variance by projecting – explains impact of outliers)
• LDA (supervised method for dimension redn for classification)
• FA(Factor Analysis), Multidimensional Scaling(distance between points)
• IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
What we could not cover
• Mechanisms
• Reinforcement Learning (punishment/rewards to learn better)
• Algorithm types
• Perceptron (back propogation, som, ..)
• SVM
• LDA and friends for unstructured world
• Regression(ols,logistic,stepwise,mars)
• Regularization (ridge/lasso)
• Trees (GBM,c4.5, ID3…)
• Bayesian
• Kernel (radial)
• Deep learning(DBN, Boltzman..)
• Clustering (Expectation Max)
• Recommendation
• Probability (distributions) & Linear Algebra
• Constraint Solving and Optimization (Solver, OpenSolver..)
Tools
• R
• Scikit
• Theano
• Weka
• Kmine
• Recommender (.net….)
• DataTau
• BigML
• WiseIO
• Skytree
• SAS/SPSS
• YHatr
Books
• Bishop
• Alpyadin
• John Foreman
• PyMC – Search query (Bayesian-Methods-for-Hackers)
• Scikit –
• jakevdp – “scikit jake 2014 tutorial”
• Olvier – “scikit olvier grasel tutorial”
• Recommender (http://mymedialite.net/) – Zeno Ganter
What you will be doing
• Data
• Touch/feel (visualize),breathe it in
• Cleaning, scaling/normalization
• Selecting
• Algorithm (chose the task)
• Classification
• Regression
• Ranking (recommendation, search results)
• Amongst
• Evaluate Algorithm against each other & refine/calibrate
• AUC, ROC, RMSE etc…
If time & net permits Yhatr demo
• Because you need to deploy,test & use the model
• Yhatr provides good host (theirs and host your own)
Thanks for your time
• Please fill the evaluation form
• See you next time
Reference

Más contenido relacionado

La actualidad más candente

Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning IntroductionPranav Prakash
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lectureazuring
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016Grigoris C
 
Evaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response TheoryEvaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response TheoryCSIRO
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?SwiftKeyComms
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsankit_ppt
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data ScienceThinkful
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Math problem solving service
Math problem solving serviceMath problem solving service
Math problem solving serviceChaejungMaeng
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueXavier Amatriain
 
Modeling and Aggregation of Complex Annotations
Modeling and Aggregation of Complex AnnotationsModeling and Aggregation of Complex Annotations
Modeling and Aggregation of Complex AnnotationsAlexander Braylan
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersFunctional Imperative
 
pattern classification
pattern classificationpattern classification
pattern classificationRanjan Ganguli
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsXavier Amatriain
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data ScienceCarlos Edo
 

La actualidad más candente (20)

Machine Learning Introduction
Machine Learning IntroductionMachine Learning Introduction
Machine Learning Introduction
 
Active learning
Active learningActive learning
Active learning
 
Active learning lecture
Active learning lectureActive learning lecture
Active learning lecture
 
[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016[Eestec] Machine Learning online seminar 1, 12 2016
[Eestec] Machine Learning online seminar 1, 12 2016
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
L15. Machine Learning - Black Art
L15. Machine Learning - Black ArtL15. Machine Learning - Black Art
L15. Machine Learning - Black Art
 
Evaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response TheoryEvaluating algorithms using Item Response Theory
Evaluating algorithms using Item Response Theory
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighborsMl1 introduction to-supervised_learning_and_k_nearest_neighbors
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
 
Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Math problem solving service
Math problem solving serviceMath problem solving service
Math problem solving service
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 
Modeling and Aggregation of Complex Annotations
Modeling and Aggregation of Complex AnnotationsModeling and Aggregation of Complex Annotations
Modeling and Aggregation of Complex Annotations
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
 
[系列活動] 機器學習速遊
[系列活動] 機器學習速遊[系列活動] 機器學習速遊
[系列活動] 機器學習速遊
 
pattern classification
pattern classificationpattern classification
pattern classification
 
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systemsBIG2016- Lessons Learned from building real-life user-focused Big Data systems
BIG2016- Lessons Learned from building real-life user-focused Big Data systems
 
Predict the Oscars with Data Science
Predict the Oscars with Data SciencePredict the Oscars with Data Science
Predict the Oscars with Data Science
 

Destacado

Azure provisioning at your control
Azure provisioning at your controlAzure provisioning at your control
Azure provisioning at your controlGovind Kanshi
 
Choosing right data store & processing
Choosing right data store & processingChoosing right data store & processing
Choosing right data store & processingGovind Kanshi
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interactionGovind Kanshi
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to heroGovind Kanshi
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsBarry Feldman
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome EconomyHelge Tennø
 

Destacado (6)

Azure provisioning at your control
Azure provisioning at your controlAzure provisioning at your control
Azure provisioning at your control
 
Choosing right data store & processing
Choosing right data store & processingChoosing right data store & processing
Choosing right data store & processing
 
Mtc learnings from isv & enterprise interaction
Mtc learnings from isv & enterprise  interactionMtc learnings from isv & enterprise  interaction
Mtc learnings from isv & enterprise interaction
 
AzureML – zero to hero
AzureML – zero to heroAzureML – zero to hero
AzureML – zero to hero
 
The Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post FormatsThe Six Highest Performing B2B Blog Post Formats
The Six Highest Performing B2B Blog Post Formats
 
The Outcome Economy
The Outcome EconomyThe Outcome Economy
The Outcome Economy
 

Similar a Learning from data

Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceDamianMingle
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - BengaluruKunal Jain
 
Data science
Data scienceData science
Data scienceallytech
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning IntroductionDong Guo
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptxNIKHILGR3
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxPrincePatel272012
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Robert Williams
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureIvo Andreev
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onwordSulman Ahmed
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxIvo Andreev
 

Similar a Learning from data (20)

Creativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data ScienceCreativity and Curiosity - The Trial and Error of Data Science
Creativity and Curiosity - The Trial and Error of Data Science
 
machine learning
machine learningmachine learning
machine learning
 
Machine learning
Machine learning Machine learning
Machine learning
 
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
 
Mini datathon - Bengaluru
Mini datathon - BengaluruMini datathon - Bengaluru
Mini datathon - Bengaluru
 
Mini datathon
Mini datathonMini datathon
Mini datathon
 
Data science
Data scienceData science
Data science
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
PPT s09-machine vision-s2
PPT s09-machine vision-s2PPT s09-machine vision-s2
PPT s09-machine vision-s2
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
 
DataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptxDataAnalyticsIntroduction and its ci.pptx
DataAnalyticsIntroduction and its ci.pptx
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...Altron presentation on Emerging Technologies: Data Science and Artificial Int...
Altron presentation on Emerging Technologies: Data Science and Artificial Int...
 
Ml - A shallow dive
Ml  - A shallow diveMl  - A shallow dive
Ml - A shallow dive
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Unit-1.ppt
Unit-1.pptUnit-1.ppt
Unit-1.ppt
 
Data mining Basics and complete description onword
Data mining Basics and complete description onwordData mining Basics and complete description onword
Data mining Basics and complete description onword
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Último (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Learning from data

  • 1. Learning from Data Busy Professional’s guide to machine learning @govindk http://govindkanshi.wordpress.com
  • 2. Agenda • What we know • What we do not know • Process • What to measure • Challenge with Model • Challenge with Data • Resources • Software • Books
  • 3. What we know • Reports made from data • KPIs made of data • Dashboards made of data • They all measure known metrics, questions
  • 4. What we do not know • Will this person turn delinquent in x years based on his profile (age/income/background…) • Which kind of process, machine will fail • Which people/things are similar to each other – find me a pattern • Prevent people from readmission into Hospital • Why - because we do not know the question and database/applications do not have oob functionality.
  • 5. We are already using applied ML results • Mails get despammed • Kinect recognizes our gestures • Facebook recognizes our photos • Siri/Cortana – recognize our voice commands • Watson used some • Search uses many • Recommendation is there in face
  • 6. So then • Learn from data • How • Create a model of the data • Test the model for error and use it
  • 7. Unsupervised • Clustering • Customer segmentation • Topic identification • Number of algorithms • Hierarchical (distance as measure – generally Euclidian ) • Agglomerative ( start with n groups and start merging them) • Single Link (2 at time) vs divisive (start single – break it down)
  • 8. Simple way • Group folks on • Height • What you eat • Where you are from (state) • Next time a new person comes in – let us predict
  • 10. Challenges and next steps • How many groups/clusters • How many miss-groupings (Evaluation) • Associate Topics & after Clustering what • Once clusters are formed – some one can name them • Now run supervised methods on data to learn more
  • 11. Supervised learning • Given a label L for a attributes (a1,a2,a3..) • Learn the model which can predict the label based on attributes
  • 12. Simple way to understand Classification • Let us say we are labelled north indian, south indian • How • Attributes (language, food, movie language, music …) • Basically learning the link between • An observed data X and • A variable y usually called target or labels.
  • 13. Supervised • Data • One dataset for training which has label • One dataset for testing • Example • Classification (spam, order data, disease data, Kinect gesture) • Classification • binary vs. multiclass • Regression (sales) • Ranking • Search • Predictive maintenance • Recommendation • Netflix - Netflix competition = SVD
  • 14. Demos • Trees • DecisionTree – Python (show train and test, validation) • Decision tree – R • BigML (nw dependent) • Challenge – • one input every time
  • 15. Few more terms to overcome data issues • Bagging – (used with tree models) (bias reduction) • Train an ensemble of models from Bootstrap samples • Get a vote amongst models • Class predicted by majority of the model wins • Get an average if outputs are scores or probabilities • * Bootstrap – denotes different random sample of dataset • Boosting (variance reduction) • Like Bagging but penalizes & learns from misclassification • Challenge of assigning “weights” misclassified instances to penalize • Start with higher weight say 1 and keep reducing till error comes down
  • 16. Demo • RandomForest • n training data out of N, at each decision node of the tree, it randomly selects m input features from the total M input features (m ~ M^0.5) and learns a decision tree from it. Finally each tree in the forest vote for the result. • Evaluation • Loss function to margins (penalize mis-classification, reward +ve)
  • 17. Regression • Explain relationship betwee two variables (dependent vs independent) • Simple linear - y = W0 + W1x1 + W2x2 + … • Estimate the weights to predict y • Multivariate
  • 18. Demos • Excel • SimpleLinear -R • RandomForest – Wine • Evaluate by applying loss function to residuals
  • 19. What to meaure • Data • Cross Validation • n-fold cross-validation • Leave-one-out validation • Hold out • Eod – how much data is enough, is there bias in data (only certain kind of labels) • Model Results • Contingency table(true negatives & false positive are bad ) • ROC & AUC (coverage curve) (true positive vs false positives) • Precision/Recall (from search world) • F-measure • Lift (not interested in accuracy on entire dataset, want for 5%,10% of dataset)
  • 20. Is Model working right Predicted +ve Predicted -ve Actual +ve 40 15 55 Actual -ve 5 40 45 45 55 100 Precision 40/45 Recall 40/55 F measure (Harmonic mean) 2/((1/prec) + (1/rec)) Accuracy TPR(40) + TPN(55)/ (40+15+5+40) How much accuracy is enough Lift – How much better than random guessing Lift and accuracy do not have correlation
  • 21. Challenge with Model • Overfitting • Avoid Bias and have less variance • Use Regularization • L1 (Ridge) • L2 (Lasso) • If time permits show the alpha effect • Look for “overfitting model” , “bias and variance”
  • 22. Challenge with Data • Categorical, ordinal, quantitative • Measures – mean, median, variance, std deviation, range, shape (skewness) • Always observe to get “feel”/smell of data • Discretize/Thresholding (convert quantitative feature) • Missing feature(s) – • What do you do – median, avg • Data encoding • Create new from existing vs encode in different way
  • 23. Feature engineering • Feature selection • Intuition, testing co-relation • Subset (Start small and increase) based on some error function • Feature extraction • New k dimensions – as combination of older d dimensions • Linear • PCA (find the variance by projecting – explains impact of outliers) • LDA (supervised method for dimension redn for classification) • FA(Factor Analysis), Multidimensional Scaling(distance between points) • IsoMap (geodesic distance) and Locally Linear Embedding (LLE)
  • 24. What we could not cover • Mechanisms • Reinforcement Learning (punishment/rewards to learn better) • Algorithm types • Perceptron (back propogation, som, ..) • SVM • LDA and friends for unstructured world • Regression(ols,logistic,stepwise,mars) • Regularization (ridge/lasso) • Trees (GBM,c4.5, ID3…) • Bayesian • Kernel (radial) • Deep learning(DBN, Boltzman..) • Clustering (Expectation Max) • Recommendation • Probability (distributions) & Linear Algebra • Constraint Solving and Optimization (Solver, OpenSolver..)
  • 25. Tools • R • Scikit • Theano • Weka • Kmine • Recommender (.net….) • DataTau • BigML • WiseIO • Skytree • SAS/SPSS • YHatr
  • 26. Books • Bishop • Alpyadin • John Foreman • PyMC – Search query (Bayesian-Methods-for-Hackers) • Scikit – • jakevdp – “scikit jake 2014 tutorial” • Olvier – “scikit olvier grasel tutorial” • Recommender (http://mymedialite.net/) – Zeno Ganter
  • 27. What you will be doing • Data • Touch/feel (visualize),breathe it in • Cleaning, scaling/normalization • Selecting • Algorithm (chose the task) • Classification • Regression • Ranking (recommendation, search results) • Amongst • Evaluate Algorithm against each other & refine/calibrate • AUC, ROC, RMSE etc…
  • 28. If time & net permits Yhatr demo • Because you need to deploy,test & use the model • Yhatr provides good host (theirs and host your own)
  • 29. Thanks for your time • Please fill the evaluation form • See you next time