SlideShare a Scribd company logo
1 of 39
Credibility: Evaluating what’s Been Learned
Training and Testing We measure the success of a classification procedure by using error rates (or equivalent success rates) Measuring success rate using training set is highly optimistic The error rate on training set is called resubstitution error We have a separate test set for calculating success error Test set should be independent of the training set Also some time to improve our classification technique we use a validation set When we hold out some part of training set for testing (which is now not used for training), this process is called holdout procedure
Predicting performance Expected success rate = 100 – error rate (If error rate is also in percentage) We want the true success rate Calculation of true success rate Suppose we have expected success rate(f) = s/n, where s is the number of success out of a total n instances For large value of n, f follows a normal distribution Now we will predict the true success rate (p) based on the  confidence percentage we want  For example say our f = 75%, then p will lie in [73.2%,76.7%] with 80% confidence
Predicting performance Now using properties of statistics we know that the mean of f is p and the variance is p(1-p)/n To use normal distribution we will have to make the mean of f = 0 and standard deviation = 1  So suppose our confidence = c% and we want to calculate p We will use the two tailed property of normal distribution And also that the are covered by normal distribution is taken as 100% so the are we will leave is 100 - c
Predicting performance Finally after all the manipulations we have ,true success rate as: Here,                         p -> true success rate                         f - > expected success rate                         N -> Number of instances                          Z -> Factor derived from a normal distribution table using the  100-c measure
Cross validation We use cross validation when amount of data is small and we need to have independent training and test set from it It is important that each class is represented in its actual proportions in the training and test set: Stratification  An important cross validation technique is stratified 10 fold cross validation, where the instance set is divided into 10 folds We have 10 iterations with taking a different single fold for testing and the rest 9 folds for training, averaging the error of the 10 iterations Problem: Computationally intensive
Other estimates Leave-one-out:Steps One instance is left for testing and the rest are used for training This is iterated for all the instances and the errors are averaged Leave-one-out:Advantage We use larger training sets Leave-one-out:Disadvantage Computationally intensive Cannot be stratified
Other estimates 0.632 Bootstrap Dataset of n samples is sampled n times, with replacements, to give another dataset with n instances There will be some repeated instances in the second set Here error is defined as: e = 0.632x(error in test instances) + 0.368x(error in training instances)
Comparing data mining methods Till now we were dealing with performance prediction Now we will look at methods to compare algorithms, to see which one did better We cant directly use Error rate to predict which algorithm is better as the error rate might have been calculated on different data sets So to compare algorithms we need some statistical tests We use Student’s  t- test to do this. This test help us to figure out if the mean error of two algorithm are different or not for a given confidence level
Comparing data mining methods We will use paired t-test which is a slight modification of student’s t-test Paired t-test Suppose we have unlimited data, do the following: Find k data sets from the unlimited data we have Use cross validation with each technique to get the respective outcomes: x1, x2, x3,….,xk and y1,y2,y3,……,yk mx = mean of x values and similarly my di = xi – yi Using t-statistic:
Comparing data mining methods Based on the value of k we get a degree of freedom, which enables us to figure out a z for a particular confidence value If  t <= (-z)   or  t >= (z) then, the two means differ significantly  In case t = 0 then they don’t differ, we call this null hypothesis
Predicting Probabilities Till now we were considering a scheme which when applied, results in either a correct or an incorrect prediction. This is called 0 – loss function Now we will deal with the success incase of algorithms that outputs probability distribution for e.g. Naïve Bayes
Predicting Probabilities Quadratic loss function: For a single instance there are k out comes or classes Probability vector: p1,p2,….,pk The actual out come vector is: a1,a2,a3,…..ak (where the actual outcome will be 1, rest all 0) We have to minimize the quadratic loss function given by: The minimum will be achieved when the probability vector is the true probability vector
Predicting Probabilities Informational loss function: Given by: –log(pi) Minimum is again reached at true probabilities Differences between Quadratic loss and Informational loss While quadratic loss takes all probabilities under consideration, Informational loss is based only on the class probability  While quadratic loss is bounded as its maximum output is 2, Informational loss is unbounded as it can output values up to infinity
Counting the cost Different outcomes might have different cost For example in loan decision, the cost of lending to a defaulter is far greater that the lost-business cost of  refusing a loan to a non defaulter Suppose we have two class prediction. Outcomes can be:
Counting the cost True positive rate: TP/(TP+FN) False positive rate: FP/(FP+TN) Overall success rate: Number of correct classification / Total Number of classification Error rate = 1 – success rate In multiclass case we have a confusion matrix like (actual and a random one):
Counting the cost These are the actual and the random outcome of a three class problem The diagonal represents the successful cases Kappa statistic = (D-observed  -  D-actual) / (D-perfect  -  D-actual) Here kappa statistic = (140 – 82)/(200-82) = 49.2% Kappa is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for agreements that occurs by chance Does not take cost into account
Classification with costs Example Cost matrices (just gives us the number of errors): Success rate is measured by average cost per prediction We try to minimize the costs Expected costs: dot products of vectors of class probabilities and appropriate column in cost matrix
Classification with costs Steps to take cost into consideration while testing: First use a learning method to get the probability vector (like Naïve Bayes)  Now multiple the probability vector to each column of a cost matrix one by one so as to get the cost for each class/column Select the class with the minimum(or maximum!!) cost
Cost sensitive learning Till now we included the cost factor during evaluation We will incorporate costs into the learning phase of a method We can change the ratio of instances in the training set so as to take care of costs For example we can do replication of a instances of particular class so that our learning method will give us a model with less errors of that class
Lift Charts In practice, costs are rarely known In marketing terminology the response rate is referred to as the lift factor We compare probable scenarios to make decisions A lift chart allows visual comparison Example: promotional mail out to 1,000,000 households Mail to all: 0.1%response (1000) Some data mining tool identifies subset of 100, 000 of which 0.4% respond (400) A lift of 4
Lift Charts Steps to calculate lift factor: We decide a sample size Now we arrange our data in decreasing order of the predicted probability of a class (the one which we will base our lift factor on: positive class) We calculate: Sample success proportion = Number of positive instances / Sample size  Lift factor = Sample success proportion / Data success proportion We calculate lift factor for different sample size to get  Lift Charts
Lift Charts A hypothetical lift chart
Lift Charts In the lift chart we will like to stay towards the upper left corner The diagonal line is the curve for random samples without using sorted data Any good selection will keep the lift curve above the diagonal
ROC Curves Stands for receiver operating characteristic Difference to lift charts: Y axis showspercentage of true positive  X axis shows percentage of false positives in samples ROC is a jagged curve It can be smoothened out by cross validation
ROC Curves A ROC curve
ROC Curves Ways to generate cost curves (Consider the previous diagram for reference) First way: Get the probability distribution over different folds of data Sort the data in decreasing order of the probability of yes class Select a point on X-axis and for that number of no, get the number of yes for each probability distribution Average the number of yes from all the folds and plot it
ROC Curves Second way: Get the probability distribution over different folds of data Sort the data in decreasing order of the probability of yes class Select a point on X-axis and for that number of no, get the number of yes for each probability distribution Plot a ROC for each fold individually  Average all the ROCs
ROC Curves ROC curves for two schemes
ROC Curves In the previous ROC curves: For a small, focused sample, use method A For a large one, use method B In between, choose between A and B with appropriate probabilities
Recall – precision curves In case of a search query: Recall = number of documents retrieved that are relevant / total number of documents that are relevant Precision = number of documents retrieved that are relevant / total number of documents that are retrieved
A summary          Different measures used to evaluate the false positive versus the false negative tradeoff
Cost curves Cost curves plot expected costs directly Example for case with uniform costs (i.e. error):
Cost curves Example with costs:
Cost curves C[+|-]  is the cost of predicting + when the instance is – C[-|+]  is the cost of predicting - when the instance is +
Minimum Description Length Principle The description length is defined as: Space required to describe a theory + space required to describe the theory’s mistakes Theory  = Classifier and mistakes = errors on the training data We try to minimize the description length MDL theory is the one that compresses the data the most. I.e to compress a data set we generate a model and then store the model and its mistakes We need to compute: Size of the model Space needed to encode the error
Minimum Description Length Principle The 2nd  one is easy. Just use informational loss function For  1st  we need a method to encode the model L[T] = “length” of the theory L[E|T] = training set encoded wrt the theory
Minimum Description Length Principle MDL and clustering Description length of theory: bits needed to encode the clusters. E.g. cluster centers Description length of data given theory: encode cluster membership and position relative to cluster. E.g. distance to cluster centers Works if coding scheme uses less code space for small numbers than for large ones
Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net

More Related Content

What's hot

Introduction to Bayesian Methods
Introduction to Bayesian MethodsIntroduction to Bayesian Methods
Introduction to Bayesian MethodsCorey Chivers
 
Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015Marc Sloan
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedRising Media Ltd.
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighborbutest
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationMarina Santini
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSSLNIPE
 
Logistic regression with SPSS examples
Logistic regression with SPSS examplesLogistic regression with SPSS examples
Logistic regression with SPSS examplesGaurav Kamboj
 
Decision tree for Predictive Modeling
Decision tree for Predictive ModelingDecision tree for Predictive Modeling
Decision tree for Predictive ModelingEdureka!
 
Patients Condition Classification Using Drug Reviews.pptx
Patients Condition Classification Using Drug Reviews.pptxPatients Condition Classification Using Drug Reviews.pptx
Patients Condition Classification Using Drug Reviews.pptxAnupama Kate
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSpotle.ai
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataNick Stauner
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression TreesHemant Chetwani
 

What's hot (20)

Data Mining: Outlier analysis
Data Mining: Outlier analysisData Mining: Outlier analysis
Data Mining: Outlier analysis
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Testing a claim about a proportion
Testing a claim about a proportion  Testing a claim about a proportion
Testing a claim about a proportion
 
Introduction to Bayesian Methods
Introduction to Bayesian MethodsIntroduction to Bayesian Methods
Introduction to Bayesian Methods
 
Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015Dynamic Information Retrieval Tutorial - SIGIR 2015
Dynamic Information Retrieval Tutorial - SIGIR 2015
 
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan HamedUplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
Uplift Modelling as a Tool for Making Causal Inferences at Shopify - Mojan Hamed
 
k Nearest Neighbor
k Nearest Neighbork Nearest Neighbor
k Nearest Neighbor
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & EvaluationLecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
Lecture 3: Basic Concepts of Machine Learning - Induction & Evaluation
 
Logistic regression with SPSS
Logistic regression with SPSSLogistic regression with SPSS
Logistic regression with SPSS
 
Logistic regression with SPSS examples
Logistic regression with SPSS examplesLogistic regression with SPSS examples
Logistic regression with SPSS examples
 
SEM
SEMSEM
SEM
 
Decision tree for Predictive Modeling
Decision tree for Predictive ModelingDecision tree for Predictive Modeling
Decision tree for Predictive Modeling
 
Patients Condition Classification Using Drug Reviews.pptx
Patients Condition Classification Using Drug Reviews.pptxPatients Condition Classification Using Drug Reviews.pptx
Patients Condition Classification Using Drug Reviews.pptx
 
Bayes Theorem
Bayes TheoremBayes Theorem
Bayes Theorem
 
Supervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine LearningSupervised and Unsupervised Machine Learning
Supervised and Unsupervised Machine Learning
 
Estimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale dataEstimators for structural equation models of Likert scale data
Estimators for structural equation models of Likert scale data
 
CART – Classification & Regression Trees
CART – Classification & Regression TreesCART – Classification & Regression Trees
CART – Classification & Regression Trees
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
 
Data warehousing unit 1
Data warehousing unit 1Data warehousing unit 1
Data warehousing unit 1
 

Viewers also liked

WEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And TechniquesWEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And TechniquesDataminingTools Inc
 
MS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With FunctionsMS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With FunctionsDataminingTools Inc
 
Cinnamonhotel saigon 2013_01
Cinnamonhotel saigon 2013_01Cinnamonhotel saigon 2013_01
Cinnamonhotel saigon 2013_01cinnamonhotel
 
Facebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning SystemFacebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning SystemChen Luo
 
Kidical Mass Presentation
Kidical Mass PresentationKidical Mass Presentation
Kidical Mass PresentationEugene SRTS
 
RapidMiner: Setting Up A Process
RapidMiner: Setting Up A ProcessRapidMiner: Setting Up A Process
RapidMiner: Setting Up A ProcessDataminingTools Inc
 
Wisconsin Fertility Institute: Injection Class 2011
Wisconsin Fertility Institute: Injection Class 2011Wisconsin Fertility Institute: Injection Class 2011
Wisconsin Fertility Institute: Injection Class 2011WisFertility
 

Viewers also liked (20)

WEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And TechniquesWEKA: Practical Machine Learning Tools And Techniques
WEKA: Practical Machine Learning Tools And Techniques
 
WEKA: Introduction To Weka
WEKA: Introduction To WekaWEKA: Introduction To Weka
WEKA: Introduction To Weka
 
R Statistics
R StatisticsR Statistics
R Statistics
 
LISP: Declarations In Lisp
LISP: Declarations In LispLISP: Declarations In Lisp
LISP: Declarations In Lisp
 
MS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With FunctionsMS Sql Server: Doing Calculations With Functions
MS Sql Server: Doing Calculations With Functions
 
Cinnamonhotel saigon 2013_01
Cinnamonhotel saigon 2013_01Cinnamonhotel saigon 2013_01
Cinnamonhotel saigon 2013_01
 
Test
TestTest
Test
 
Data Applied:Forecast
Data Applied:ForecastData Applied:Forecast
Data Applied:Forecast
 
Facebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning SystemFacebook: An Innovative Influenza Pandemic Early Warning System
Facebook: An Innovative Influenza Pandemic Early Warning System
 
Control Statements in Matlab
Control Statements in  MatlabControl Statements in  Matlab
Control Statements in Matlab
 
Introduction to Data-Applied
Introduction to Data-AppliedIntroduction to Data-Applied
Introduction to Data-Applied
 
Matlab Text Files
Matlab Text FilesMatlab Text Files
Matlab Text Files
 
Mphone
MphoneMphone
Mphone
 
Clickthrough
ClickthroughClickthrough
Clickthrough
 
Kidical Mass Presentation
Kidical Mass PresentationKidical Mass Presentation
Kidical Mass Presentation
 
RapidMiner: Setting Up A Process
RapidMiner: Setting Up A ProcessRapidMiner: Setting Up A Process
RapidMiner: Setting Up A Process
 
Data Applied:Tree Maps
Data Applied:Tree MapsData Applied:Tree Maps
Data Applied:Tree Maps
 
Wisconsin Fertility Institute: Injection Class 2011
Wisconsin Fertility Institute: Injection Class 2011Wisconsin Fertility Institute: Injection Class 2011
Wisconsin Fertility Institute: Injection Class 2011
 
SPSS: File Managment
SPSS: File ManagmentSPSS: File Managment
SPSS: File Managment
 
LISP:Object System Lisp
LISP:Object System LispLISP:Object System Lisp
LISP:Object System Lisp
 

Similar to WEKA: Credibility Evaluating Whats Been Learned

Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideMegan Verbakel
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)Abhimanyu Dwivedi
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.pptbutest
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.pptbutest
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxajondaree
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsSEMINARGROOT
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)NYversity
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdfssuserdce5c21
 
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxPERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxTAHIRZAMAN81
 
All PERFORMANCE PREDICTION PARAMETERS.pptx
All PERFORMANCE PREDICTION  PARAMETERS.pptxAll PERFORMANCE PREDICTION  PARAMETERS.pptx
All PERFORMANCE PREDICTION PARAMETERS.pptxtaherzamanrather
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdfgadissaassefa
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlpankit_ppt
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROAnthony Kilili
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithmsArunangsu Sahu
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.AmnaArooj13
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptBharatDaiyaBharat
 

Similar to WEKA: Credibility Evaluating Whats Been Learned (20)

Assessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's GuideAssessing Model Performance - Beginner's Guide
Assessing Model Performance - Beginner's Guide
 
working with python
working with pythonworking with python
working with python
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
 
MLlectureMethod.ppt
MLlectureMethod.pptMLlectureMethod.ppt
MLlectureMethod.ppt
 
Lecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptxLecture 3.1_ Logistic Regression.pptx
Lecture 3.1_ Logistic Regression.pptx
 
evaluation and credibility-Part 1
evaluation and credibility-Part 1evaluation and credibility-Part 1
evaluation and credibility-Part 1
 
Understanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence FunctionsUnderstanding Blackbox Prediction via Influence Functions
Understanding Blackbox Prediction via Influence Functions
 
Machine learning (5)
Machine learning (5)Machine learning (5)
Machine learning (5)
 
13ClassifierPerformance.pdf
13ClassifierPerformance.pdf13ClassifierPerformance.pdf
13ClassifierPerformance.pdf
 
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptxPERFORMANCE_PREDICTION__PARAMETERS[1].pptx
PERFORMANCE_PREDICTION__PARAMETERS[1].pptx
 
All PERFORMANCE PREDICTION PARAMETERS.pptx
All PERFORMANCE PREDICTION  PARAMETERS.pptxAll PERFORMANCE PREDICTION  PARAMETERS.pptx
All PERFORMANCE PREDICTION PARAMETERS.pptx
 
Py data19 final
Py data19   finalPy data19   final
Py data19 final
 
Supervised Learning.pdf
Supervised Learning.pdfSupervised Learning.pdf
Supervised Learning.pdf
 
Machine learning and_nlp
Machine learning and_nlpMachine learning and_nlp
Machine learning and_nlp
 
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACROBOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
BOOTSTRAPPING TO EVALUATE RESPONSE MODELS: A SAS® MACRO
 
Essentials of machine learning algorithms
Essentials of machine learning algorithmsEssentials of machine learning algorithms
Essentials of machine learning algorithms
 
Predict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an OrganizationPredict Backorder on a supply chain data for an Organization
Predict Backorder on a supply chain data for an Organization
 
MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.MACHINE LEARNING PPT K MEANS CLUSTERING.
MACHINE LEARNING PPT K MEANS CLUSTERING.
 
INTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.pptINTRODUCTION TO BOOSTING.ppt
INTRODUCTION TO BOOSTING.ppt
 

More from DataminingTools Inc

AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceDataminingTools Inc
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web miningDataminingTools Inc
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataDataminingTools Inc
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsDataminingTools Inc
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisDataminingTools Inc
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technologyDataminingTools Inc
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 

More from DataminingTools Inc (20)

Terminology Machine Learning
Terminology Machine LearningTerminology Machine Learning
Terminology Machine Learning
 
Techniques Machine Learning
Techniques Machine LearningTechniques Machine Learning
Techniques Machine Learning
 
Machine learning Introduction
Machine learning IntroductionMachine learning Introduction
Machine learning Introduction
 
Areas of machine leanring
Areas of machine leanringAreas of machine leanring
Areas of machine leanring
 
AI: Planning and AI
AI: Planning and AIAI: Planning and AI
AI: Planning and AI
 
AI: Logic in AI 2
AI: Logic in AI 2AI: Logic in AI 2
AI: Logic in AI 2
 
AI: Logic in AI
AI: Logic in AIAI: Logic in AI
AI: Logic in AI
 
AI: Learning in AI 2
AI: Learning in AI 2AI: Learning in AI 2
AI: Learning in AI 2
 
AI: Learning in AI
AI: Learning in AI AI: Learning in AI
AI: Learning in AI
 
AI: Introduction to artificial intelligence
AI: Introduction to artificial intelligenceAI: Introduction to artificial intelligence
AI: Introduction to artificial intelligence
 
AI: Belief Networks
AI: Belief NetworksAI: Belief Networks
AI: Belief Networks
 
AI: AI & Searching
AI: AI & SearchingAI: AI & Searching
AI: AI & Searching
 
AI: AI & Problem Solving
AI: AI & Problem SolvingAI: AI & Problem Solving
AI: AI & Problem Solving
 
Data Mining: Text and web mining
Data Mining: Text and web miningData Mining: Text and web mining
Data Mining: Text and web mining
 
Data Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence dataData Mining: Mining stream time series and sequence data
Data Mining: Mining stream time series and sequence data
 
Data Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlationsData Mining: Mining ,associations, and correlations
Data Mining: Mining ,associations, and correlations
 
Data Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysisData Mining: Graph mining and social network analysis
Data Mining: Graph mining and social network analysis
 
Data warehouse and olap technology
Data warehouse and olap technologyData warehouse and olap technology
Data warehouse and olap technology
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 

Recently uploaded

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Recently uploaded (20)

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

WEKA: Credibility Evaluating Whats Been Learned

  • 2. Training and Testing We measure the success of a classification procedure by using error rates (or equivalent success rates) Measuring success rate using training set is highly optimistic The error rate on training set is called resubstitution error We have a separate test set for calculating success error Test set should be independent of the training set Also some time to improve our classification technique we use a validation set When we hold out some part of training set for testing (which is now not used for training), this process is called holdout procedure
  • 3. Predicting performance Expected success rate = 100 – error rate (If error rate is also in percentage) We want the true success rate Calculation of true success rate Suppose we have expected success rate(f) = s/n, where s is the number of success out of a total n instances For large value of n, f follows a normal distribution Now we will predict the true success rate (p) based on the confidence percentage we want For example say our f = 75%, then p will lie in [73.2%,76.7%] with 80% confidence
  • 4. Predicting performance Now using properties of statistics we know that the mean of f is p and the variance is p(1-p)/n To use normal distribution we will have to make the mean of f = 0 and standard deviation = 1 So suppose our confidence = c% and we want to calculate p We will use the two tailed property of normal distribution And also that the are covered by normal distribution is taken as 100% so the are we will leave is 100 - c
  • 5. Predicting performance Finally after all the manipulations we have ,true success rate as: Here, p -> true success rate f - > expected success rate N -> Number of instances Z -> Factor derived from a normal distribution table using the 100-c measure
  • 6. Cross validation We use cross validation when amount of data is small and we need to have independent training and test set from it It is important that each class is represented in its actual proportions in the training and test set: Stratification An important cross validation technique is stratified 10 fold cross validation, where the instance set is divided into 10 folds We have 10 iterations with taking a different single fold for testing and the rest 9 folds for training, averaging the error of the 10 iterations Problem: Computationally intensive
  • 7. Other estimates Leave-one-out:Steps One instance is left for testing and the rest are used for training This is iterated for all the instances and the errors are averaged Leave-one-out:Advantage We use larger training sets Leave-one-out:Disadvantage Computationally intensive Cannot be stratified
  • 8. Other estimates 0.632 Bootstrap Dataset of n samples is sampled n times, with replacements, to give another dataset with n instances There will be some repeated instances in the second set Here error is defined as: e = 0.632x(error in test instances) + 0.368x(error in training instances)
  • 9. Comparing data mining methods Till now we were dealing with performance prediction Now we will look at methods to compare algorithms, to see which one did better We cant directly use Error rate to predict which algorithm is better as the error rate might have been calculated on different data sets So to compare algorithms we need some statistical tests We use Student’s t- test to do this. This test help us to figure out if the mean error of two algorithm are different or not for a given confidence level
  • 10. Comparing data mining methods We will use paired t-test which is a slight modification of student’s t-test Paired t-test Suppose we have unlimited data, do the following: Find k data sets from the unlimited data we have Use cross validation with each technique to get the respective outcomes: x1, x2, x3,….,xk and y1,y2,y3,……,yk mx = mean of x values and similarly my di = xi – yi Using t-statistic:
  • 11. Comparing data mining methods Based on the value of k we get a degree of freedom, which enables us to figure out a z for a particular confidence value If t <= (-z) or t >= (z) then, the two means differ significantly In case t = 0 then they don’t differ, we call this null hypothesis
  • 12. Predicting Probabilities Till now we were considering a scheme which when applied, results in either a correct or an incorrect prediction. This is called 0 – loss function Now we will deal with the success incase of algorithms that outputs probability distribution for e.g. Naïve Bayes
  • 13. Predicting Probabilities Quadratic loss function: For a single instance there are k out comes or classes Probability vector: p1,p2,….,pk The actual out come vector is: a1,a2,a3,…..ak (where the actual outcome will be 1, rest all 0) We have to minimize the quadratic loss function given by: The minimum will be achieved when the probability vector is the true probability vector
  • 14. Predicting Probabilities Informational loss function: Given by: –log(pi) Minimum is again reached at true probabilities Differences between Quadratic loss and Informational loss While quadratic loss takes all probabilities under consideration, Informational loss is based only on the class probability While quadratic loss is bounded as its maximum output is 2, Informational loss is unbounded as it can output values up to infinity
  • 15. Counting the cost Different outcomes might have different cost For example in loan decision, the cost of lending to a defaulter is far greater that the lost-business cost of refusing a loan to a non defaulter Suppose we have two class prediction. Outcomes can be:
  • 16. Counting the cost True positive rate: TP/(TP+FN) False positive rate: FP/(FP+TN) Overall success rate: Number of correct classification / Total Number of classification Error rate = 1 – success rate In multiclass case we have a confusion matrix like (actual and a random one):
  • 17. Counting the cost These are the actual and the random outcome of a three class problem The diagonal represents the successful cases Kappa statistic = (D-observed - D-actual) / (D-perfect - D-actual) Here kappa statistic = (140 – 82)/(200-82) = 49.2% Kappa is used to measure the agreement between predicted and observed categorizations of a dataset, while correcting for agreements that occurs by chance Does not take cost into account
  • 18. Classification with costs Example Cost matrices (just gives us the number of errors): Success rate is measured by average cost per prediction We try to minimize the costs Expected costs: dot products of vectors of class probabilities and appropriate column in cost matrix
  • 19. Classification with costs Steps to take cost into consideration while testing: First use a learning method to get the probability vector (like Naïve Bayes) Now multiple the probability vector to each column of a cost matrix one by one so as to get the cost for each class/column Select the class with the minimum(or maximum!!) cost
  • 20. Cost sensitive learning Till now we included the cost factor during evaluation We will incorporate costs into the learning phase of a method We can change the ratio of instances in the training set so as to take care of costs For example we can do replication of a instances of particular class so that our learning method will give us a model with less errors of that class
  • 21. Lift Charts In practice, costs are rarely known In marketing terminology the response rate is referred to as the lift factor We compare probable scenarios to make decisions A lift chart allows visual comparison Example: promotional mail out to 1,000,000 households Mail to all: 0.1%response (1000) Some data mining tool identifies subset of 100, 000 of which 0.4% respond (400) A lift of 4
  • 22. Lift Charts Steps to calculate lift factor: We decide a sample size Now we arrange our data in decreasing order of the predicted probability of a class (the one which we will base our lift factor on: positive class) We calculate: Sample success proportion = Number of positive instances / Sample size Lift factor = Sample success proportion / Data success proportion We calculate lift factor for different sample size to get Lift Charts
  • 23. Lift Charts A hypothetical lift chart
  • 24. Lift Charts In the lift chart we will like to stay towards the upper left corner The diagonal line is the curve for random samples without using sorted data Any good selection will keep the lift curve above the diagonal
  • 25. ROC Curves Stands for receiver operating characteristic Difference to lift charts: Y axis showspercentage of true positive X axis shows percentage of false positives in samples ROC is a jagged curve It can be smoothened out by cross validation
  • 26. ROC Curves A ROC curve
  • 27. ROC Curves Ways to generate cost curves (Consider the previous diagram for reference) First way: Get the probability distribution over different folds of data Sort the data in decreasing order of the probability of yes class Select a point on X-axis and for that number of no, get the number of yes for each probability distribution Average the number of yes from all the folds and plot it
  • 28. ROC Curves Second way: Get the probability distribution over different folds of data Sort the data in decreasing order of the probability of yes class Select a point on X-axis and for that number of no, get the number of yes for each probability distribution Plot a ROC for each fold individually Average all the ROCs
  • 29. ROC Curves ROC curves for two schemes
  • 30. ROC Curves In the previous ROC curves: For a small, focused sample, use method A For a large one, use method B In between, choose between A and B with appropriate probabilities
  • 31. Recall – precision curves In case of a search query: Recall = number of documents retrieved that are relevant / total number of documents that are relevant Precision = number of documents retrieved that are relevant / total number of documents that are retrieved
  • 32. A summary Different measures used to evaluate the false positive versus the false negative tradeoff
  • 33. Cost curves Cost curves plot expected costs directly Example for case with uniform costs (i.e. error):
  • 34. Cost curves Example with costs:
  • 35. Cost curves C[+|-] is the cost of predicting + when the instance is – C[-|+] is the cost of predicting - when the instance is +
  • 36. Minimum Description Length Principle The description length is defined as: Space required to describe a theory + space required to describe the theory’s mistakes Theory = Classifier and mistakes = errors on the training data We try to minimize the description length MDL theory is the one that compresses the data the most. I.e to compress a data set we generate a model and then store the model and its mistakes We need to compute: Size of the model Space needed to encode the error
  • 37. Minimum Description Length Principle The 2nd one is easy. Just use informational loss function For 1st we need a method to encode the model L[T] = “length” of the theory L[E|T] = training set encoded wrt the theory
  • 38. Minimum Description Length Principle MDL and clustering Description length of theory: bits needed to encode the clusters. E.g. cluster centers Description length of data given theory: encode cluster membership and position relative to cluster. E.g. distance to cluster centers Works if coding scheme uses less code space for small numbers than for large ones
  • 39. Visit more self help tutorials Pick a tutorial of your choice and browse through it at your own pace. The tutorials section is free, self-guiding and will not involve any additional support. Visit us at www.dataminingtools.net