SlideShare una empresa de Scribd logo
1 de 20
How To

Interpret Model
Performance With Cost
Functions
A publication of
Introduction
Cost functions are directly related to the performance of data mining and
predictive models. Cost functions are important because there are many ways
to design a machine learning algorithm, as well as interpret its performance.
This cost function series will help the analyst discover the underpinnings of
the algorithms and usefulness of each algorithm’s functionality. We go deep
into the statistical properties and mathematical understanding of each cost
function and explore their similarities and differences.
We hope you will have a deeper understanding of a variety of cost functions
available for classification and regression models, and be able to interpret
your models from an expert point of view.
Topics to Cover
Cost Functions for Regression Problems:
Least Squares Deviation
Least Absolute Deviation
Huber- M Cost

Cost Functions for Classification Problems:
Precision and Recall
Measuring Performance with ROC Curve
Gains and Lift
Logistic Function and Cost
Multinomial Classification- Expected Cost
Multinomial Classification- Log Likelihood
Understanding Cost Functions
The Supervised Learning problem:
1. Collection of n p-dimensional feature vectors

2. Collection of observed responses

{xi}, i = 1, n

{yi}, i = 1, n

Goal: Construct a response surface (hypothesis):

h(x)
Cost Functions describe how well a response surface h(x) fits the
available data, in essence, the goodness of fit (on a given data set): J (yi, h(xi))

Things to keep in mind:
•
•
•

Smaller values of the cost function correspond to a better fit
Machine learning goal: construct h(x) such that J is minimized
In regression, h(x) is usually directly interpretable as the predicted
response
Least Squares Deviation Cost for
Regression Problems
Defined as:

Error between what is observed and what is predicted
(difference between yi and h(xi)) is used as the prime
element of the LSD cost function.
Also known as ‘Mean Squared Error’.

Advantage: LSD has nice mathematical and statistical
properties.
Disadvantage: LSD has a known issue of outliers
Least Absolute Deviation for
Regression Problems
Defined as:

More robust when it comes to outliers than LSD (minimizes
the impact outliers make by using absolute value instead of
squared residual).
Presence of the absolute term in LAD can cause some
computational challenges.
There is another type of cost function that combines these
two (Least Squares Deviation and Least Absolute Deviation)
into an intermediate, known as Huber-M cost.
Huber-M Costs for Regression
Problems
Defined as:

Combines best qualities of LSD and LAD losses.
When residuals are small, use quadratic loss.
When the residuals exceed a certain threshold δ, Huber-M switches
from quadratic penalty (LSD) to linear penalty (LAD), thus combining
quadratic penalty for small residuals and linear penalty for large
residuals.
Parameter δ is usually set automatically to a specific percentile of
absolute residuals.
The Binary Classification Problem
Binary classification: predicting a specific outcome that can only have two
distinct values (Yes/No, +/-, 0/1)
Observed response y takes only two possible values: + and –
(+ and – are used here for convention)
Need to define a relationship between h(x) and y
Use the decision rule:
The decision rule is always governed by a user defined threshold, t. When the
response exceeds this threshold, a positive prediction will be made.
Otherwise, a negative prediction will be made for the response variable.
The Binary Classification Problem
Summary:
When constructing a response surface, a threshold
needs to be introduced to make predictions.
Classification of positives and negatives will depend on
what threshold is designated.
Evaluating Prediction Success with
Precision and Recall
Evaluate performance by focusing on how well we capture the
“+” group (assumed to be the events of interest) for a given
threshold.
After running a model, a prediction success table (also known
as a ‘confusion matrix’) can be created:

The table can contain four outcomes: true-positive (tp), false-positive
(fp), false-negative (fn), true-negative (tn).
Evaluating Prediction Success with
Precision and Recall
Measure of Success #1: Precision
The ratio of true-positives divided by true-positives and false-positives

Precision focuses on the group of true-positive predictions- what fraction
are actually positive within that group.
For instance, if fraudulent transactions are being identified and it is
predicted that 1,000 transactions are suspected to be fraudulent, precision
will tell you the actual fraction of transactions that are fraudulent within
that group of predictions. If precision is .5, it would be expected that about
500 transactions are indeed fraudulent.
Evaluating Prediction Success with
Precision and Recall
Measure of Success #2: Recall (Sensitivity)
Keeping with our example of fraudulent data, recall tells us that we know
we have captured some fraudulent transactions in the group predicted as
fraud, but what is the actual fraction of total fraudulent transactions that
exist that were captured.
There is a relationship between precision and recall. When threshold is
varied.

In an ideal case, precision = 1.0 and recall = 1.0 (meaning there are no
misclassifications). If precision is 1.0, false-positives are equal to zero. If
recall is 1.0, false-negatives are equal to zero.
Unfortunately, precision and recall cannot be maximized simultaneously,
when focusing on one side, the other will suffer.
Measuring Performance with the
ROC Curve
Receiver Operating Characteristic (ROC) Curve:
A curve that characterizes the performance of the classifier in general as we
sweep over the range of different thresholds.
ROC measures how well you capture positives, negatives, and the balance
between these two response groups.
A binary classification performance can always be represented by an ROC
Curve.
Measuring Performance with Gains
and Lift
If you have a good model, for instance, in a direct marketing campaignpeople who got a higher score are more likely to respond. Therefore when
you identify the potential targeted group, you will pick up the people who
were scored the highest. This is the underlying framework of gains and lift.

The plot of sensitivity versus support is called the Gains curve
What are the optimal gains that can be achieved? That question involves the
concept of base rate, which is represented by the number of true-positives
plus false-negatives divided by the sample size.
So in our direct marketing example, if you expect a default rate of 1% of
responders, then your base rate becomes 1%.
Direct Interpretation of Response
Using Logistic Function
Instead of taking the predicted response of the function as a scored value,
we focus on the direct interpretation of the probability of a positive
outcome.
Define: p= Prob (y= “+”) where p is the probability that y has a positive
outcome. The positive outcome is the event in focus (e.g. positive
responder to direct marketing campaign)
Probability has a characteristic of always being defined between 0 and 1.
However, probability can be converted to log odds, which no longer has
that constraint.

Define: Log-odds

Log-odds can be positive or negative, which makes it convenient for
modeling.
Probability can be expressed in terms of log-odds:
p=1/1+eh
When probability is plotted in terms of log-odds, it maps the
entire positive and negative infinity range of values to an interval
between 0 and 1 that can be interpreted as a probability space.
Probability can be converted to log-odds and vice versa. The
following graph establishes the nature of this transformability:
Multinomial ClassificationExpected Cost
Multinomial classification is working with more than two
classifications. Instead of trying to predict one of two classes,
you are working with k classes.
This is the most general form of classification problem.

One approach focuses on the performance of a classification
model as it classifies classes to one type or another, known as
expected cost.
Multinomial Classification- Log
Likelihood
In applying the log-likelihood cost function, there is a
stricter environment in which we are interested in
the model performance, but also the actual predicted
probability of the responding classes.
Conclusion
There are many different ways to
evaluate the performance of a
classification problem model.
In the end, it depends on the type of
data on hand and the goals that are
in mind. There are a host of
evaluation techniques available but
it is up to you, the data analyst, to
decide what is ultimately relevant.

Más contenido relacionado

Más de Salford Systems

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4Salford Systems
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsSalford Systems
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Salford Systems
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Salford Systems
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningSalford Systems
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerSalford Systems
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like YouSalford Systems
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To RememberSalford Systems
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetSalford Systems
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideSalford Systems
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to marsSalford Systems
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher EducationSalford Systems
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingSalford Systems
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hivSalford Systems
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning CombinationSalford Systems
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSalford Systems
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998Salford Systems
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPMSalford Systems
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7Salford Systems
 

Más de Salford Systems (20)

Datascience101presentation4
Datascience101presentation4Datascience101presentation4
Datascience101presentation4
 
Improve Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForestsImprove Your Regression with CART and RandomForests
Improve Your Regression with CART and RandomForests
 
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
Improved Predictions in Structure Based Drug Design Using Cart and Bayesian M...
 
Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications Churn Modeling-For-Mobile-Telecommunications
Churn Modeling-For-Mobile-Telecommunications
 
The Do's and Don'ts of Data Mining
The Do's and Don'ts of Data MiningThe Do's and Don'ts of Data Mining
The Do's and Don'ts of Data Mining
 
Introduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele CutlerIntroduction to Random Forests by Dr. Adele Cutler
Introduction to Random Forests by Dr. Adele Cutler
 
9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You9 Data Mining Challenges From Data Scientists Like You
9 Data Mining Challenges From Data Scientists Like You
 
Statistically Significant Quotes To Remember
Statistically Significant Quotes To RememberStatistically Significant Quotes To Remember
Statistically Significant Quotes To Remember
 
Using CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example DatasetUsing CART For Beginners with A Teclo Example Dataset
Using CART For Beginners with A Teclo Example Dataset
 
CART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User GuideCART Classification and Regression Trees Experienced User Guide
CART Classification and Regression Trees Experienced User Guide
 
Evolution of regression ols to gps to mars
Evolution of regression   ols to gps to marsEvolution of regression   ols to gps to mars
Evolution of regression ols to gps to mars
 
Data Mining for Higher Education
Data Mining for Higher EducationData Mining for Higher Education
Data Mining for Higher Education
 
Comparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modelingComparison of statistical methods commonly used in predictive modeling
Comparison of statistical methods commonly used in predictive modeling
 
Molecular data mining tool advances in hiv
Molecular data mining tool  advances in hivMolecular data mining tool  advances in hiv
Molecular data mining tool advances in hiv
 
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees:  A Winning CombinationTreeNet Tree Ensembles & CART Decision Trees:  A Winning Combination
TreeNet Tree Ensembles & CART Decision Trees: A Winning Combination
 
SPM v7.0 Feature Matrix
SPM v7.0 Feature MatrixSPM v7.0 Feature Matrix
SPM v7.0 Feature Matrix
 
SPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARSSPM User's Guide: Introducing MARS
SPM User's Guide: Introducing MARS
 
Hybrid cart logit model 1998
Hybrid cart logit model 1998Hybrid cart logit model 1998
Hybrid cart logit model 1998
 
Session Logs Tutorial for SPM
Session Logs Tutorial for SPMSession Logs Tutorial for SPM
Session Logs Tutorial for SPM
 
Some of the new features in SPM 7
Some of the new features in SPM 7Some of the new features in SPM 7
Some of the new features in SPM 7
 

Último

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Último (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

How to Interpret Model Performance with Cost Functions

  • 1. How To Interpret Model Performance With Cost Functions A publication of
  • 2. Introduction Cost functions are directly related to the performance of data mining and predictive models. Cost functions are important because there are many ways to design a machine learning algorithm, as well as interpret its performance. This cost function series will help the analyst discover the underpinnings of the algorithms and usefulness of each algorithm’s functionality. We go deep into the statistical properties and mathematical understanding of each cost function and explore their similarities and differences. We hope you will have a deeper understanding of a variety of cost functions available for classification and regression models, and be able to interpret your models from an expert point of view.
  • 3. Topics to Cover Cost Functions for Regression Problems: Least Squares Deviation Least Absolute Deviation Huber- M Cost Cost Functions for Classification Problems: Precision and Recall Measuring Performance with ROC Curve Gains and Lift Logistic Function and Cost Multinomial Classification- Expected Cost Multinomial Classification- Log Likelihood
  • 4. Understanding Cost Functions The Supervised Learning problem: 1. Collection of n p-dimensional feature vectors 2. Collection of observed responses {xi}, i = 1, n {yi}, i = 1, n Goal: Construct a response surface (hypothesis): h(x)
  • 5. Cost Functions describe how well a response surface h(x) fits the available data, in essence, the goodness of fit (on a given data set): J (yi, h(xi)) Things to keep in mind: • • • Smaller values of the cost function correspond to a better fit Machine learning goal: construct h(x) such that J is minimized In regression, h(x) is usually directly interpretable as the predicted response
  • 6. Least Squares Deviation Cost for Regression Problems Defined as: Error between what is observed and what is predicted (difference between yi and h(xi)) is used as the prime element of the LSD cost function. Also known as ‘Mean Squared Error’. Advantage: LSD has nice mathematical and statistical properties. Disadvantage: LSD has a known issue of outliers
  • 7. Least Absolute Deviation for Regression Problems Defined as: More robust when it comes to outliers than LSD (minimizes the impact outliers make by using absolute value instead of squared residual). Presence of the absolute term in LAD can cause some computational challenges. There is another type of cost function that combines these two (Least Squares Deviation and Least Absolute Deviation) into an intermediate, known as Huber-M cost.
  • 8. Huber-M Costs for Regression Problems Defined as: Combines best qualities of LSD and LAD losses. When residuals are small, use quadratic loss. When the residuals exceed a certain threshold δ, Huber-M switches from quadratic penalty (LSD) to linear penalty (LAD), thus combining quadratic penalty for small residuals and linear penalty for large residuals. Parameter δ is usually set automatically to a specific percentile of absolute residuals.
  • 9. The Binary Classification Problem Binary classification: predicting a specific outcome that can only have two distinct values (Yes/No, +/-, 0/1) Observed response y takes only two possible values: + and – (+ and – are used here for convention) Need to define a relationship between h(x) and y Use the decision rule: The decision rule is always governed by a user defined threshold, t. When the response exceeds this threshold, a positive prediction will be made. Otherwise, a negative prediction will be made for the response variable.
  • 10. The Binary Classification Problem Summary: When constructing a response surface, a threshold needs to be introduced to make predictions. Classification of positives and negatives will depend on what threshold is designated.
  • 11. Evaluating Prediction Success with Precision and Recall Evaluate performance by focusing on how well we capture the “+” group (assumed to be the events of interest) for a given threshold. After running a model, a prediction success table (also known as a ‘confusion matrix’) can be created: The table can contain four outcomes: true-positive (tp), false-positive (fp), false-negative (fn), true-negative (tn).
  • 12. Evaluating Prediction Success with Precision and Recall Measure of Success #1: Precision The ratio of true-positives divided by true-positives and false-positives Precision focuses on the group of true-positive predictions- what fraction are actually positive within that group. For instance, if fraudulent transactions are being identified and it is predicted that 1,000 transactions are suspected to be fraudulent, precision will tell you the actual fraction of transactions that are fraudulent within that group of predictions. If precision is .5, it would be expected that about 500 transactions are indeed fraudulent.
  • 13. Evaluating Prediction Success with Precision and Recall Measure of Success #2: Recall (Sensitivity) Keeping with our example of fraudulent data, recall tells us that we know we have captured some fraudulent transactions in the group predicted as fraud, but what is the actual fraction of total fraudulent transactions that exist that were captured. There is a relationship between precision and recall. When threshold is varied. In an ideal case, precision = 1.0 and recall = 1.0 (meaning there are no misclassifications). If precision is 1.0, false-positives are equal to zero. If recall is 1.0, false-negatives are equal to zero. Unfortunately, precision and recall cannot be maximized simultaneously, when focusing on one side, the other will suffer.
  • 14. Measuring Performance with the ROC Curve Receiver Operating Characteristic (ROC) Curve: A curve that characterizes the performance of the classifier in general as we sweep over the range of different thresholds. ROC measures how well you capture positives, negatives, and the balance between these two response groups. A binary classification performance can always be represented by an ROC Curve.
  • 15. Measuring Performance with Gains and Lift If you have a good model, for instance, in a direct marketing campaignpeople who got a higher score are more likely to respond. Therefore when you identify the potential targeted group, you will pick up the people who were scored the highest. This is the underlying framework of gains and lift. The plot of sensitivity versus support is called the Gains curve What are the optimal gains that can be achieved? That question involves the concept of base rate, which is represented by the number of true-positives plus false-negatives divided by the sample size. So in our direct marketing example, if you expect a default rate of 1% of responders, then your base rate becomes 1%.
  • 16. Direct Interpretation of Response Using Logistic Function Instead of taking the predicted response of the function as a scored value, we focus on the direct interpretation of the probability of a positive outcome. Define: p= Prob (y= “+”) where p is the probability that y has a positive outcome. The positive outcome is the event in focus (e.g. positive responder to direct marketing campaign) Probability has a characteristic of always being defined between 0 and 1. However, probability can be converted to log odds, which no longer has that constraint. Define: Log-odds Log-odds can be positive or negative, which makes it convenient for modeling.
  • 17. Probability can be expressed in terms of log-odds: p=1/1+eh When probability is plotted in terms of log-odds, it maps the entire positive and negative infinity range of values to an interval between 0 and 1 that can be interpreted as a probability space. Probability can be converted to log-odds and vice versa. The following graph establishes the nature of this transformability:
  • 18. Multinomial ClassificationExpected Cost Multinomial classification is working with more than two classifications. Instead of trying to predict one of two classes, you are working with k classes. This is the most general form of classification problem. One approach focuses on the performance of a classification model as it classifies classes to one type or another, known as expected cost.
  • 19. Multinomial Classification- Log Likelihood In applying the log-likelihood cost function, there is a stricter environment in which we are interested in the model performance, but also the actual predicted probability of the responding classes.
  • 20. Conclusion There are many different ways to evaluate the performance of a classification problem model. In the end, it depends on the type of data on hand and the goals that are in mind. There are a host of evaluation techniques available but it is up to you, the data analyst, to decide what is ultimately relevant.