SlideShare una empresa de Scribd logo
1 de 152
Descargar para leer sin conexión
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutes
Review Discussion…
• Hands-On with Decision Trees 30 minutes
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutes
Model performance evaluation…
• Machine Learning Boot Camp ~60 minutes
Clustering, k-Means…
Deriving Knowledge from Data at Scale
• Optional Reading: Data Science Weekly (2)
• Two Homework Assignments, due next Wednesday
1. One is described in the lecture notes
2. Two is uploaded to the class Catalyst site
• Key Points to Understand, review and discuss
1. Ensembles, the techniques of Bagging and Boosting
2. Random Forests
3. Clustering, specifically K-Means Clustering
What will your data science workflow be? (not having one is a fail…)
Deriving Knowledge from Data at Scale
gender age smoker eye
color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
Deriving Knowledge from Data at Scale
gender age smoker eye
color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
?
?
?
Train
ML Model
Deriving Knowledge from Data at Scale
gender age smoker eye
color
male 19 yes green
female 44 yes gray
male 49 yes blue
male 12 no brown
female 37 no brown
female 60 no brown
male 44 no blue
female 27 yes brown
female 51 yes green
female 81 yes gray
male 22 yes brown
male 29 no blue
lung
cancer
no
yes
yes
no
no
yes
no
no
yes
no
no
no
male 77 yes gray
male 19 yes green
female 44 no gray
yes
no
no
Train
ML Model
Deriving Knowledge from Data at Scale
Define
Objective
Access and
Understand the
Data
Pre-processing
Feature and/or
Target
construction
1. Define the objective and quantify it with a metric – optionally with constraints,
if any. This typically requires domain knowledge.
2. Collect and understand the data, deal with the vagaries and biases in the data
acquisition (missing data, outliers due to errors in the data collection process,
more sophisticated biases due to the data collection procedure etc
3. Frame the problem in terms of a machine learning problem – classification,
regression, ranking, clustering, forecasting, outlier detection etc. – some
combination of domain knowledge and ML knowledge is useful.
4. Transform the raw data into a “modeling dataset”, with features, weights,
targets etc., which can be used for modeling. Feature construction can often
be improved with domain knowledge. Target must be identical (or a very
good proxy) of the quantitative metric identified step 1.
Deriving Knowledge from Data at Scale
Feature selection
Model training
Model scoring
Evaluation
Train/ Test split
5. Train, test and evaluate, taking care to control
bias/variance and ensure the metrics are
reported with the right confidence intervals
(cross-validation helps here), be vigilant
against target leaks (which typically leads to
unbelievably good test metrics) – this is the
ML heavy step.
Deriving Knowledge from Data at Scale
Data Science Workflow.pdf
Develop your own for defining
and evaluating project
opportunities…
Deriving Knowledge from Data at Scale
Example 1: Amazon, big spenders. Target of the competition was
to predict customers who spend a lot of money among customers
using past purchases. The data consisted of transaction data in
different categories. But a winning model identified that ‘Free
shipping = True’ was an excellent predictor.
Leakage: “Free Shipping = True” was simultaneous with the sale,
which is a no-no… We can only use data from beforehand to
predict the future…
Deriving Knowledge from Data at Scale
Example 2: Cancer patients plotted by Patient ID – what happened?
What could you do to improve this?...
Deriving Knowledge from Data at Scale
Winning competition on leakage is easier than building good models.
But even if you don’t explicitly understand and game the leakage,
your model will do it for you. Either way, leakage is a huge problem.
• You need a strict temporal cutoff: remove all information just prior to the
event of interest.
• There has to be a timestamp on every entry and you need to keep it
• The best practice is to start from scratch with clean, raw data after careful
consideration
• You need to know how the data was created! I (try to ) work only with data I
pulled and prepared myself…
Deriving Knowledge from Data at Scale
To avoid overfitting, we cross-validate and we cut down on the complexity of the model to
begin with. Here’s a standard picture (although keep in mind we generally work in high
dimensional space and don’t have a pretty picture to look at)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
The art in data science
The science in data science
some
evaluation metric
rigorous
testing and experimentation to either validate or refute
Deriving Knowledge from Data at Scale
Given
We need to determine
evaluation metrics
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0 500 1000 1500 2000 2500 3000
A
c
c
u
r
a
c
y
# Training Records
Accuracy on test data stabilizes above 1000 training samples
Deriving Knowledge from Data at Scale
Review: Decision Tree
1. Automatically selects features
2. Able to handle large number of features
3. Numeric, nominal, missing
4. Easy to ensemble (Random Forrest, Boosted DT)
5. I can romance on DTs for hours …
Deriving Knowledge from Data at Scale
Blood pressure
Drug A Age
Drug A
Drug B
Drug B
high normal low
≤ 40 > 40
Assignment of drug to a patient
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
+
+
++
+
+
+
+
+
+
+
+
Deriving Knowledge from Data at Scale
+
+
++
+
+
+
+
+
+
+
+
Deriving Knowledge from Data at Scale
+
+
++
+
+
+
+
+
+
+
+
Deriving Knowledge from Data at Scale
+
+
++
+
+
+
+
+
+
+
+
Deriving Knowledge from Data at Scale
+
+
++
+
+
+
+
+
+
+
+
pm=5/6
Once regions are
chosen class
probabilities are easy
to calculate
Deriving Knowledge from Data at Scale
Decision Trees & Weka
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
When completed, submit this assignment to
the dropbox for homework Lecture 4
Deriving Knowledge from Data at Scale
Your task for this assignment: Design a simple, low-cost sensor that can distinguish
between red wine and white wine.
Your sensor must correctly distinguish between red and white wine for at least 95% of the
samples in a set of 6497 test samples of red and white wine.
Your technology is capable of sensing the following wine attributes:
- Fixed acidity - Free sulphur dioxide
- Volatile acidity - Total sulphur dioxide
- Citric acid - Sulphates
- Residual sugar - pH
- Chlorides - Alcohol
- Density
To keep your sensor cheap and simple, you need to sense as few of these attributes as
possible to meet the 95% requirement.
Question: Which attributes should your sensor be capable of measuring?
Deriving Knowledge from Data at Scale
1. Go to our class website, Lecture 4, and download the associated homework files
2. Read WineQuality.pdf.
3. Open the RedWhiteWine.arff file in Weka, and remove the quality attribute, which you
will not need for this assignment.
4. Run J48 with default set-up to see what kind of percent correct classification results
you get using all attributes.
5. Remove attributes to find the minimum
number of attributes needed to meet
the 95% correct classification requirement.
Remove
Deriving Knowledge from Data at Scale
Use these buttons to
simplify the task of
removing attributes
You can use these buttons
to simplify the task of
removing attributes
Deriving Knowledge from Data at Scale
(Paste a screenshot showing your
minimum attribute set here)
Deriving Knowledge from Data at Scale
(Paste a screenshot showing
your results for your minimum
attribute set here)
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutes
Review Discussion…
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutes
Model performance evaluation…
• Machine Learning Boot Camp ~60 minutes
Clustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
bagging
Decision trees
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Diversity of Opinion
Independence
Decentralization
Aggregation
Deriving Knowledge from Data at Scale
Ensemble Classification
Deriving Knowledge from Data at Scale
Given
Method
Goal
Deriving Knowledge from Data at Scale
The basic idea:
Randomly draw datasets with replacement from the
training data, each sample the same size as the original training set
Deriving Knowledge from Data at Scale
Training
• Regression
• Classification
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Two examples of random decisions in RFs
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Ellipsoid separation 
Two categories,
Two predictors
Single tree decision boundary 100 bagged trees..
Deriving Knowledge from Data at Scale
Random Forest Classifier
NexamplesTraining Data
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Nexamples
Create bootstrap samples
from the training data
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Nexamples
Construct a decision tree
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Nexamples
....…
M features
At each node in choosing the split feature
choose only among m<M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Create decision tree
from each bootstrap sample
Nexamples
....…
....…
M features
Deriving Knowledge from Data at Scale
Random Forest Classifier
Nexamples
....…
....…
Take the
majority
vote
M features
Deriving Knowledge from Data at Scale
Random Forests
Deriving Knowledge from Data at Scale
Consensus
Independence
Decentralization
Aggregation
Deriving Knowledge from Data at Scale
Diversity of Opinion
private information
Independence
Decentralization
Aggregation
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Decision Trees and Decision Forests
A forest is an ensemble of trees. The trees are all slightly different from one another.
terminal (leaf) node
internal
(split) node
root node0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
A general tree structure
Is top
part blue?
Is bottom
part green?
Is bottom
part blue?
A decision tree
Deriving Knowledge from Data at Scale
Decision Forest Model: the randomness model
1) Bagging (randomizing the training set)
The full training set
The randomly sampled subset of training data made available for the tree t
Forest training
Deriving Knowledge from Data at Scale
Decision Forest Model: the randomness model
The full set of all possible node test parameters
For each node the set of randomly sampled features
Randomness control parameter.
For no randomness and maximum tree correlation.
For max randomness and minimum tree correlation.
2) Randomized node optimization (RNO)
Small value of ; little tree correlation. Large value of ; large tree correlation.
The effect of
Node weak learner
Node test params
Node training
Deriving Knowledge from Data at Scale
Decision Forest Model: training and information gain
Beforesplit
Information gain
Shannon’s entropy
Node training
(for categorical, non-parametric distributions)
Split1Split2
Deriving Knowledge from Data at Scale
Why we prune…
Deriving Knowledge from Data at Scale
Classification Forest
Training data in feature space
?
?
?
Entropy of a discrete distribution
with
Classification tree
training
Obj. funct. for node j (information gain)
Training node j
Output is categorical
Input data point
Node weak learner
Predictor model (class posterior)
Model specialization for classification
( is feature response)
(discrete set)
Deriving Knowledge from Data at Scale
Classification Forest: the weak learner model
Node weak learner
Node test params
Splitting data at node j
Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section
Examples of weak learners
Feature response
for 2D example.
With a generic line in homog. coordinates.
Feature response
for 2D example.
With a matrix representing a conic.
Feature response
for 2D example.
In general may select only a very small subset of features
With or
Deriving Knowledge from Data at Scale
Classification Forest: the prediction model
What do we do at the leaf?
leaf
leaf
leaf
Prediction model: probabilistic
Deriving Knowledge from Data at Scale
Classification Forest: the ensemble model
Tree t=1 t=2 t=3
Forest output probability
The ensemble model
Deriving Knowledge from Data at Scale
Training different trees in the forest
Testing different trees in the forest
(2 videos in this page)
Classification Forest: effect of the weak learner model
Parameters: T=200, D=2, weak learner = aligned, leaf model = probabilistic
• “Accuracy of prediction”
• “Quality of confidence”
• “Generalization”
Three concepts to keep in mind:
Training points
Deriving Knowledge from Data at Scale
Classification Forest: with >2 classes
Training different trees in the forest
Testing different trees in the forest
Parameters: T=200, D=3, weak learner = conic, leaf model = probabilistic
(2 videos in this page)
Training points
Deriving Knowledge from Data at Scale
Classification Forest: effect of tree depth
max tree depth, D
overfittingunderfitting
T=200, D=3, w. l. = conic T=200, D=6, w. l. = conic T=200, D=15, w. l. = conic
Predictor model = prob.(3 videos in this page)
Training points: 4-class mixed
Deriving Knowledge from Data at Scale
Classification Forest: analysing generalization
Parameters: T=200, D=13, w. l. = conic, predictor = prob.
(3 videos in this page)
Training points: 4-class spiral Training pts: 4-class spiral, large gaps Tr. pts: 4-class spiral, larger gapsTestingposteriors
Deriving Knowledge from Data at Scale
Q
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
10 Minute Break…
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutes
Review Discussion…
• Ensembles, Random Forests 60 minutes
• Data Science Modelling 30 minutes
Model performance evaluation…
• Machine Learning Boot Camp ~60 minutes
Clustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
Data
Acquisition
Data
Exploration
Pre-processing
Feature and
Target
construction
Train/ Test
split
Feature
selection
Model
training
Model
scoring
Model
scoring
Evaluation
Evaluation
Compare
metrics
Deriving Knowledge from Data at Scale
Model Scoring (subject for today…)
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
1 2 3 4 (k-1) k
Train Test
Deriving Knowledge from Data at Scale
• Class
• Score
Deriving Knowledge from Data at Scale
True
Label
Predicted Label
Confusion
matrix
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Performance Metrics
Percent Reduction in Error
• 80% accuracy = 20% error
• Suppose learning increases accuracy from 80% to 90%
error reduced from 20% to 10%
• 50% reduction in error
• 99.90% to 99.99% = 90% reduction in error
• 50% to 75% = 50% reduction in error, can be applied to
many other measures
Deriving Knowledge from Data at Scale
Performance Metrics
Precision and Recall
• Typically used in document retrieval
• Precision:
– how many of the returned documents are correct
– precision (threshold)
• Recall:
– how many of the positives does the model return
– recall (threshold)
Deriving Knowledge from Data at Scale
Performance Metrics
Precision and Recall
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Next week we will go deeper with ROC
curves, kappa, lift charts, etc…
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Opening Discussion 30 minutes
Review Discussion…
• Ensembles, Random Forests 60 minutes
• Break 5 minutes
• Data Science Modelling 30 minutes
Model performance evaluation…
• Machine Learning Boot Camp ~60 minutes
Clustering, k-Means…
• Close
Deriving Knowledge from Data at Scale
similar
unsupervised learning
data exploration
Deriving Knowledge from Data at Scale
grouping within a group are
similar and different from (or unrelated to)
the objects in other groups
Inter-cluster
distances are
maximized
Intra-cluster
distances are
minimized
Deriving Knowledge from Data at Scale
• Outliers objects that do not belong to any cluster
outlier analysis
cluster
outliers
Deriving Knowledge from Data at Scale
data reduction
natural clusters useful outlier detection
Deriving Knowledge from Data at Scale
How many clusters?
Four ClustersTwo Clusters
Six Clusters
Deriving Knowledge from Data at Scale
hierarchical
partitional
Deriving Knowledge from Data at Scale
Original Points A Partitional Clustering
Deriving Knowledge from Data at Scale
p4
p1
p3
p2
p4
p1
p3
p2
p4p1 p2 p3
p4p1 p2 p3
Traditional Hierarchical Clustering
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
Traditional Dendrogram
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Single Linkage:
Minimum distance* *
Complete Linkage:
Maximum distance* *
Average Linkage:
Average distance*
*
*
*
Wards method:
Minimization of
within-cluster variance
*
*
*
*
*
¤
*
* *
*
¤
Centroid method:
Distance between
centres
*
*
*
* *
**
*
* *
¤ ¤
Non overlapping Overlapping
Hierarchical Non-hierarchical
1a 1b
1c
1a 1b
1b1
1b22
Agglomerative Divisive
Deriving Knowledge from Data at Scale
d(x, y) x y metric
• d(i, j)  0 non-negativity
• d(i, i) = 0 isolation
• d(i, j) = d(j, i) symmetry
• d(i, j) ≤ d(i, h)+d(h, j) triangular inequality
real,
boolean, categorical, ordinal
Deriving Knowledge from Data at Scale
p = 2, L2 Euclidean distance
weighted distance
)||...|
22
||
11
(|),( 222
d
y
d
xyxyxyxd 
)||...|
22
|
2
|
11
|
1
(),( 222
d
y
d
x
d
wxxwxxwyxd 
d
y
d
x
d
wyxwyxwyxd  ...
222111
),(
Deriving Knowledge from Data at Scale
Q1 Q2 Q3 Q4 Q5 Q6
X 1 0 0 1 1 1
Y 0 1 1 0 1 0
• Jaccard similarity between binary vectors X and Y
• Jaccard distance between binary vectors X and Y
Jdist(X,Y) = 1- JSim(X,Y)
• Example:
• JSim = 1/6
• Jdist = 5/6
YX
YX
YXJSim


),(
Deriving Knowledge from Data at Scale
• Lp Minkowski
p
p = 1, L1 Manhattan (or city block)
p
d
i i
y
i
x
pp
d
x
d
x
p
yx
p
yxyxpL
/1
1
)(
/1
||...|
22
||
11
|),(
























d
i i
y
i
x
d
y
d
xyxyxyxL
1
||...|
22
|||),(
1 11
Deriving Knowledge from Data at Scale
centroid
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Sub-optimal Clustering
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Optimal Clustering
Original Points
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 6
Deriving Knowledge from Data at Scale
 

K
i Cx
i
i
xmdistSSE
1
2
),(
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Deriving Knowledge from Data at Scale
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 1
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 3
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 4
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
0
0.5
1
1.5
2
2.5
3
x
y
Iteration 5
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
• Boolean Values
• Categories
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
Deriving Knowledge from Data at Scale
That’s all for tonight….

Más contenido relacionado

La actualidad más candente

Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6Roger Barga
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Roger Barga
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014Roger Barga
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3Roger Barga
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learningShishir Choudhary
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsSri Ambati
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsAseda Owusua Addai-Deseh
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science ProcessVishal Patel
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.pptbutest
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | EdurekaEdureka!
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRaveen Perera
 
Data Science 101
Data Science 101Data Science 101
Data Science 101odsc
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Aseda Owusua Addai-Deseh
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsDarius Barušauskas
 

La actualidad más candente (20)

Barga Data Science lecture 6
Barga Data Science lecture 6Barga Data Science lecture 6
Barga Data Science lecture 6
 
Barga Galvanize Sept 2015
Barga Galvanize Sept 2015Barga Galvanize Sept 2015
Barga Galvanize Sept 2015
 
Data Driven Engineering 2014
Data Driven Engineering 2014Data Driven Engineering 2014
Data Driven Engineering 2014
 
Barga Data Science lecture 3
Barga Data Science lecture 3Barga Data Science lecture 3
Barga Data Science lecture 3
 
Managing machine learning
Managing machine learningManaging machine learning
Managing machine learning
 
Introduction to machine learning and deep learning
Introduction to machine learning and deep learningIntroduction to machine learning and deep learning
Introduction to machine learning and deep learning
 
H2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark LandryH2O World - Top 10 Data Science Pitfalls - Mark Landry
H2O World - Top 10 Data Science Pitfalls - Mark Landry
 
Top 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner PitfallsTop 10 Data Science Practitioner Pitfalls
Top 10 Data Science Practitioner Pitfalls
 
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsDay 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and Applications
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
Machine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our WorldMachine Learning: Understanding the Invisible Force Changing Our World
Machine Learning: Understanding the Invisible Force Changing Our World
 
Exploring the Data science Process
Exploring the Data science ProcessExploring the Data science Process
Exploring the Data science Process
 
How to crack down big data?
How to crack down big data? How to crack down big data?
How to crack down big data?
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
Data Science Full Course | Edureka
Data Science Full Course | EdurekaData Science Full Course | Edureka
Data Science Full Course | Edureka
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
Day 2 (Lecture 5): A Practitioner's Perspective on Building Machine Product i...
 
Tips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitionsTips and tricks to win kaggle data science competitions
Tips and tricks to win kaggle data science competitions
 

Destacado

Menerapkan teknik pengambilan gambar produksi kd 6 english
Menerapkan teknik pengambilan gambar produksi kd 6 englishMenerapkan teknik pengambilan gambar produksi kd 6 english
Menerapkan teknik pengambilan gambar produksi kd 6 englishEko Supriyadi
 
Copy of Jims new Resume - Google Docs
Copy of Jims new Resume - Google DocsCopy of Jims new Resume - Google Docs
Copy of Jims new Resume - Google DocsJames Rice
 
Try out semester ii kelas 7 mei 2015
Try out semester ii kelas 7 mei 2015Try out semester ii kelas 7 mei 2015
Try out semester ii kelas 7 mei 2015Tari Utari
 
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...YASMEEN AHMED
 
Personal bi to personal data science
Personal bi to personal data sciencePersonal bi to personal data science
Personal bi to personal data scienceJan Mulkens
 
Đề cương ôn Sinh 11a1 - An Nhơn 3
Đề cương ôn Sinh 11a1 - An Nhơn 3Đề cương ôn Sinh 11a1 - An Nhơn 3
Đề cương ôn Sinh 11a1 - An Nhơn 3Ái Dân
 
nhận làm video quảng cáo 3d
nhận làm video quảng cáo 3dnhận làm video quảng cáo 3d
nhận làm video quảng cáo 3dtheo757
 

Destacado (12)

portfolio
portfolioportfolio
portfolio
 
Menerapkan teknik pengambilan gambar produksi kd 6 english
Menerapkan teknik pengambilan gambar produksi kd 6 englishMenerapkan teknik pengambilan gambar produksi kd 6 english
Menerapkan teknik pengambilan gambar produksi kd 6 english
 
Copy of Jims new Resume - Google Docs
Copy of Jims new Resume - Google DocsCopy of Jims new Resume - Google Docs
Copy of Jims new Resume - Google Docs
 
linked-in CV
linked-in CVlinked-in CV
linked-in CV
 
Try out semester ii kelas 7 mei 2015
Try out semester ii kelas 7 mei 2015Try out semester ii kelas 7 mei 2015
Try out semester ii kelas 7 mei 2015
 
2013 Superbowl Ads Review by Augustine Fou Chief Digital Officer
2013 Superbowl Ads Review by Augustine Fou Chief Digital Officer2013 Superbowl Ads Review by Augustine Fou Chief Digital Officer
2013 Superbowl Ads Review by Augustine Fou Chief Digital Officer
 
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...
BLOOD TRANSFUSION IN ANEMIC PATIENTS(DOSE, ADMINISTRATION, ROUTE, COMPONENT T...
 
Personal bi to personal data science
Personal bi to personal data sciencePersonal bi to personal data science
Personal bi to personal data science
 
Blood component therapy
Blood component therapyBlood component therapy
Blood component therapy
 
Đề cương ôn Sinh 11a1 - An Nhơn 3
Đề cương ôn Sinh 11a1 - An Nhơn 3Đề cương ôn Sinh 11a1 - An Nhơn 3
Đề cương ôn Sinh 11a1 - An Nhơn 3
 
nhận làm video quảng cáo 3d
nhận làm video quảng cáo 3dnhận làm video quảng cáo 3d
nhận làm video quảng cáo 3d
 
Herrera174A2
Herrera174A2Herrera174A2
Herrera174A2
 

Similar a Barga Data Science lecture 4

B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needGibDevs
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data MiningKai Koenig
 
AI Orange Belt - Session 4
AI Orange Belt - Session 4AI Orange Belt - Session 4
AI Orange Belt - Session 4AI Black Belt
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfSaketBansal9
 
AI Orange Belt - Session 2
AI Orange Belt - Session 2AI Orange Belt - Session 2
AI Orange Belt - Session 2AI Black Belt
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learningPramit Choudhary
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification courseKumarNaik21
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroSi Krishan
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceAnnie Flippo
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataTech Triveni
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxkprasad8
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Sanghun Kim
 
Operationalizing Machine Learning
Operationalizing Machine LearningOperationalizing Machine Learning
Operationalizing Machine LearningAgileThought
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseFormulatedby
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami PresentationGreg Werner
 
AI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World TutorialAI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World TutorialTariq King
 

Similar a Barga Data Science lecture 4 (20)

B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
Choosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your needChoosing a Machine Learning technique to solve your need
Choosing a Machine Learning technique to solve your need
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
AI Orange Belt - Session 4
AI Orange Belt - Session 4AI Orange Belt - Session 4
AI Orange Belt - Session 4
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 
Kx for wine tasting
Kx for wine tastingKx for wine tasting
Kx for wine tasting
 
Machinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdfMachinr Learning and artificial_Lect1.pdf
Machinr Learning and artificial_Lect1.pdf
 
AI Orange Belt - Session 2
AI Orange Belt - Session 2AI Orange Belt - Session 2
AI Orange Belt - Session 2
 
Model evaluation in the land of deep learning
Model evaluation in the land of deep learningModel evaluation in the land of deep learning
Model evaluation in the land of deep learning
 
Best Selenium certification course
Best Selenium certification courseBest Selenium certification course
Best Selenium certification course
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
What Managers Need to Know about Data Science
What Managers Need to Know about Data ScienceWhat Managers Need to Know about Data Science
What Managers Need to Know about Data Science
 
Semi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text DataSemi-Supervised Insight Generation from Petabyte Scale Text Data
Semi-Supervised Insight Generation from Petabyte Scale Text Data
 
AI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptxAI-900 - Fundamental Principles of ML.pptx
AI-900 - Fundamental Principles of ML.pptx
 
Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)Comparison of Top Data Mining(Final)
Comparison of Top Data Mining(Final)
 
Operationalizing Machine Learning
Operationalizing Machine LearningOperationalizing Machine Learning
Operationalizing Machine Learning
 
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use CaseData Science Salon: Introduction to Machine Learning - Marketing Use Case
Data Science Salon: Introduction to Machine Learning - Marketing Use Case
 
Data Science Salon Miami Presentation
Data Science Salon Miami PresentationData Science Salon Miami Presentation
Data Science Salon Miami Presentation
 
AI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World TutorialAI and ML Skills for the Testing World Tutorial
AI and ML Skills for the Testing World Tutorial
 

Último

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 

Último (20)

CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 

Barga Data Science lecture 4

  • 1. Deriving Knowledge from Data at Scale
  • 2. Deriving Knowledge from Data at Scale • Opening Discussion 30 minutes Review Discussion… • Hands-On with Decision Trees 30 minutes • Ensembles, Random Forests 60 minutes • Data Science Modelling 30 minutes Model performance evaluation… • Machine Learning Boot Camp ~60 minutes Clustering, k-Means…
  • 3. Deriving Knowledge from Data at Scale • Optional Reading: Data Science Weekly (2) • Two Homework Assignments, due next Wednesday 1. One is described in the lecture notes 2. Two is uploaded to the class Catalyst site • Key Points to Understand, review and discuss 1. Ensembles, the techniques of Bagging and Boosting 2. Random Forests 3. Clustering, specifically K-Means Clustering What will your data science workflow be? (not having one is a fail…)
  • 4. Deriving Knowledge from Data at Scale gender age smoker eye color male 19 yes green female 44 yes gray male 49 yes blue male 12 no brown female 37 no brown female 60 no brown male 44 no blue female 27 yes brown female 51 yes green female 81 yes gray male 22 yes brown male 29 no blue lung cancer no yes yes no no yes no no yes no no no male 77 yes gray male 19 yes green female 44 no gray ? ? ?
  • 5. Deriving Knowledge from Data at Scale gender age smoker eye color male 19 yes green female 44 yes gray male 49 yes blue male 12 no brown female 37 no brown female 60 no brown male 44 no blue female 27 yes brown female 51 yes green female 81 yes gray male 22 yes brown male 29 no blue lung cancer no yes yes no no yes no no yes no no no male 77 yes gray male 19 yes green female 44 no gray ? ? ? Train ML Model
  • 6. Deriving Knowledge from Data at Scale gender age smoker eye color male 19 yes green female 44 yes gray male 49 yes blue male 12 no brown female 37 no brown female 60 no brown male 44 no blue female 27 yes brown female 51 yes green female 81 yes gray male 22 yes brown male 29 no blue lung cancer no yes yes no no yes no no yes no no no male 77 yes gray male 19 yes green female 44 no gray yes no no Train ML Model
  • 7. Deriving Knowledge from Data at Scale Define Objective Access and Understand the Data Pre-processing Feature and/or Target construction 1. Define the objective and quantify it with a metric – optionally with constraints, if any. This typically requires domain knowledge. 2. Collect and understand the data, deal with the vagaries and biases in the data acquisition (missing data, outliers due to errors in the data collection process, more sophisticated biases due to the data collection procedure etc 3. Frame the problem in terms of a machine learning problem – classification, regression, ranking, clustering, forecasting, outlier detection etc. – some combination of domain knowledge and ML knowledge is useful. 4. Transform the raw data into a “modeling dataset”, with features, weights, targets etc., which can be used for modeling. Feature construction can often be improved with domain knowledge. Target must be identical (or a very good proxy) of the quantitative metric identified step 1.
  • 8. Deriving Knowledge from Data at Scale Feature selection Model training Model scoring Evaluation Train/ Test split 5. Train, test and evaluate, taking care to control bias/variance and ensure the metrics are reported with the right confidence intervals (cross-validation helps here), be vigilant against target leaks (which typically leads to unbelievably good test metrics) – this is the ML heavy step.
  • 9. Deriving Knowledge from Data at Scale Data Science Workflow.pdf Develop your own for defining and evaluating project opportunities…
  • 10. Deriving Knowledge from Data at Scale Example 1: Amazon, big spenders. Target of the competition was to predict customers who spend a lot of money among customers using past purchases. The data consisted of transaction data in different categories. But a winning model identified that ‘Free shipping = True’ was an excellent predictor. Leakage: “Free Shipping = True” was simultaneous with the sale, which is a no-no… We can only use data from beforehand to predict the future…
  • 11. Deriving Knowledge from Data at Scale Example 2: Cancer patients plotted by Patient ID – what happened? What could you do to improve this?...
  • 12. Deriving Knowledge from Data at Scale Winning competition on leakage is easier than building good models. But even if you don’t explicitly understand and game the leakage, your model will do it for you. Either way, leakage is a huge problem. • You need a strict temporal cutoff: remove all information just prior to the event of interest. • There has to be a timestamp on every entry and you need to keep it • The best practice is to start from scratch with clean, raw data after careful consideration • You need to know how the data was created! I (try to ) work only with data I pulled and prepared myself…
  • 13. Deriving Knowledge from Data at Scale To avoid overfitting, we cross-validate and we cut down on the complexity of the model to begin with. Here’s a standard picture (although keep in mind we generally work in high dimensional space and don’t have a pretty picture to look at)
  • 14. Deriving Knowledge from Data at Scale
  • 15. Deriving Knowledge from Data at Scale The art in data science The science in data science some evaluation metric rigorous testing and experimentation to either validate or refute
  • 16. Deriving Knowledge from Data at Scale Given We need to determine evaluation metrics
  • 17. Deriving Knowledge from Data at Scale
  • 18. Deriving Knowledge from Data at Scale
  • 19. Deriving Knowledge from Data at Scale
  • 20. Deriving Knowledge from Data at Scale
  • 21. Deriving Knowledge from Data at Scale 0.65 0.67 0.69 0.71 0.73 0.75 0.77 0.79 0 500 1000 1500 2000 2500 3000 A c c u r a c y # Training Records Accuracy on test data stabilizes above 1000 training samples
  • 22. Deriving Knowledge from Data at Scale Review: Decision Tree 1. Automatically selects features 2. Able to handle large number of features 3. Numeric, nominal, missing 4. Easy to ensemble (Random Forrest, Boosted DT) 5. I can romance on DTs for hours …
  • 23. Deriving Knowledge from Data at Scale Blood pressure Drug A Age Drug A Drug B Drug B high normal low ≤ 40 > 40 Assignment of drug to a patient
  • 24. Deriving Knowledge from Data at Scale
  • 25. Deriving Knowledge from Data at Scale
  • 26. Deriving Knowledge from Data at Scale
  • 27. Deriving Knowledge from Data at Scale
  • 28. Deriving Knowledge from Data at Scale + + ++ + + + + + + + +
  • 29. Deriving Knowledge from Data at Scale + + ++ + + + + + + + +
  • 30. Deriving Knowledge from Data at Scale + + ++ + + + + + + + +
  • 31. Deriving Knowledge from Data at Scale + + ++ + + + + + + + +
  • 32. Deriving Knowledge from Data at Scale + + ++ + + + + + + + + pm=5/6 Once regions are chosen class probabilities are easy to calculate
  • 33. Deriving Knowledge from Data at Scale Decision Trees & Weka
  • 34. Deriving Knowledge from Data at Scale
  • 35. Deriving Knowledge from Data at Scale
  • 36. Deriving Knowledge from Data at Scale
  • 37. Deriving Knowledge from Data at Scale
  • 38. Deriving Knowledge from Data at Scale
  • 39. Deriving Knowledge from Data at Scale
  • 40. Deriving Knowledge from Data at Scale
  • 41. Deriving Knowledge from Data at Scale
  • 42. Deriving Knowledge from Data at Scale
  • 43. Deriving Knowledge from Data at Scale
  • 44. Deriving Knowledge from Data at Scale
  • 45. Deriving Knowledge from Data at Scale
  • 46. Deriving Knowledge from Data at Scale
  • 47. Deriving Knowledge from Data at Scale
  • 48. Deriving Knowledge from Data at Scale
  • 49. Deriving Knowledge from Data at Scale
  • 50. Deriving Knowledge from Data at Scale
  • 51. Deriving Knowledge from Data at Scale
  • 52. Deriving Knowledge from Data at Scale
  • 53. Deriving Knowledge from Data at Scale
  • 54. Deriving Knowledge from Data at Scale When completed, submit this assignment to the dropbox for homework Lecture 4
  • 55. Deriving Knowledge from Data at Scale Your task for this assignment: Design a simple, low-cost sensor that can distinguish between red wine and white wine. Your sensor must correctly distinguish between red and white wine for at least 95% of the samples in a set of 6497 test samples of red and white wine. Your technology is capable of sensing the following wine attributes: - Fixed acidity - Free sulphur dioxide - Volatile acidity - Total sulphur dioxide - Citric acid - Sulphates - Residual sugar - pH - Chlorides - Alcohol - Density To keep your sensor cheap and simple, you need to sense as few of these attributes as possible to meet the 95% requirement. Question: Which attributes should your sensor be capable of measuring?
  • 56. Deriving Knowledge from Data at Scale 1. Go to our class website, Lecture 4, and download the associated homework files 2. Read WineQuality.pdf. 3. Open the RedWhiteWine.arff file in Weka, and remove the quality attribute, which you will not need for this assignment. 4. Run J48 with default set-up to see what kind of percent correct classification results you get using all attributes. 5. Remove attributes to find the minimum number of attributes needed to meet the 95% correct classification requirement. Remove
  • 57. Deriving Knowledge from Data at Scale Use these buttons to simplify the task of removing attributes You can use these buttons to simplify the task of removing attributes
  • 58. Deriving Knowledge from Data at Scale (Paste a screenshot showing your minimum attribute set here)
  • 59. Deriving Knowledge from Data at Scale (Paste a screenshot showing your results for your minimum attribute set here)
  • 60. Deriving Knowledge from Data at Scale • Opening Discussion 30 minutes Review Discussion… • Ensembles, Random Forests 60 minutes • Data Science Modelling 30 minutes Model performance evaluation… • Machine Learning Boot Camp ~60 minutes Clustering, k-Means… • Close
  • 61. Deriving Knowledge from Data at Scale bagging Decision trees
  • 62. Deriving Knowledge from Data at Scale
  • 63. Deriving Knowledge from Data at Scale
  • 64. Deriving Knowledge from Data at Scale
  • 65. Deriving Knowledge from Data at Scale
  • 66. Deriving Knowledge from Data at Scale Diversity of Opinion Independence Decentralization Aggregation
  • 67. Deriving Knowledge from Data at Scale Ensemble Classification
  • 68. Deriving Knowledge from Data at Scale Given Method Goal
  • 69. Deriving Knowledge from Data at Scale The basic idea: Randomly draw datasets with replacement from the training data, each sample the same size as the original training set
  • 70. Deriving Knowledge from Data at Scale Training • Regression • Classification
  • 71. Deriving Knowledge from Data at Scale
  • 72. Deriving Knowledge from Data at Scale
  • 73. Deriving Knowledge from Data at Scale
  • 74. Deriving Knowledge from Data at Scale
  • 75. Deriving Knowledge from Data at Scale
  • 76. Deriving Knowledge from Data at Scale
  • 77. Deriving Knowledge from Data at Scale
  • 78. Deriving Knowledge from Data at Scale Two examples of random decisions in RFs
  • 79. Deriving Knowledge from Data at Scale
  • 80. Deriving Knowledge from Data at Scale
  • 81. Deriving Knowledge from Data at Scale Ellipsoid separation  Two categories, Two predictors Single tree decision boundary 100 bagged trees..
  • 82. Deriving Knowledge from Data at Scale Random Forest Classifier NexamplesTraining Data M features
  • 83. Deriving Knowledge from Data at Scale Random Forest Classifier Nexamples Create bootstrap samples from the training data ....… M features
  • 84. Deriving Knowledge from Data at Scale Random Forest Classifier Nexamples Construct a decision tree ....… M features
  • 85. Deriving Knowledge from Data at Scale Random Forest Classifier Nexamples ....… M features At each node in choosing the split feature choose only among m<M features
  • 86. Deriving Knowledge from Data at Scale Random Forest Classifier Create decision tree from each bootstrap sample Nexamples ....… ....… M features
  • 87. Deriving Knowledge from Data at Scale Random Forest Classifier Nexamples ....… ....… Take the majority vote M features
  • 88. Deriving Knowledge from Data at Scale Random Forests
  • 89. Deriving Knowledge from Data at Scale Consensus Independence Decentralization Aggregation
  • 90. Deriving Knowledge from Data at Scale Diversity of Opinion private information Independence Decentralization Aggregation
  • 91. Deriving Knowledge from Data at Scale
  • 92. Deriving Knowledge from Data at Scale Decision Trees and Decision Forests A forest is an ensemble of trees. The trees are all slightly different from one another. terminal (leaf) node internal (split) node root node0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 A general tree structure Is top part blue? Is bottom part green? Is bottom part blue? A decision tree
  • 93. Deriving Knowledge from Data at Scale Decision Forest Model: the randomness model 1) Bagging (randomizing the training set) The full training set The randomly sampled subset of training data made available for the tree t Forest training
  • 94. Deriving Knowledge from Data at Scale Decision Forest Model: the randomness model The full set of all possible node test parameters For each node the set of randomly sampled features Randomness control parameter. For no randomness and maximum tree correlation. For max randomness and minimum tree correlation. 2) Randomized node optimization (RNO) Small value of ; little tree correlation. Large value of ; large tree correlation. The effect of Node weak learner Node test params Node training
  • 95. Deriving Knowledge from Data at Scale Decision Forest Model: training and information gain Beforesplit Information gain Shannon’s entropy Node training (for categorical, non-parametric distributions) Split1Split2
  • 96. Deriving Knowledge from Data at Scale Why we prune…
  • 97. Deriving Knowledge from Data at Scale Classification Forest Training data in feature space ? ? ? Entropy of a discrete distribution with Classification tree training Obj. funct. for node j (information gain) Training node j Output is categorical Input data point Node weak learner Predictor model (class posterior) Model specialization for classification ( is feature response) (discrete set)
  • 98. Deriving Knowledge from Data at Scale Classification Forest: the weak learner model Node weak learner Node test params Splitting data at node j Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Examples of weak learners Feature response for 2D example. With a generic line in homog. coordinates. Feature response for 2D example. With a matrix representing a conic. Feature response for 2D example. In general may select only a very small subset of features With or
  • 99. Deriving Knowledge from Data at Scale Classification Forest: the prediction model What do we do at the leaf? leaf leaf leaf Prediction model: probabilistic
  • 100. Deriving Knowledge from Data at Scale Classification Forest: the ensemble model Tree t=1 t=2 t=3 Forest output probability The ensemble model
  • 101. Deriving Knowledge from Data at Scale Training different trees in the forest Testing different trees in the forest (2 videos in this page) Classification Forest: effect of the weak learner model Parameters: T=200, D=2, weak learner = aligned, leaf model = probabilistic • “Accuracy of prediction” • “Quality of confidence” • “Generalization” Three concepts to keep in mind: Training points
  • 102. Deriving Knowledge from Data at Scale Classification Forest: with >2 classes Training different trees in the forest Testing different trees in the forest Parameters: T=200, D=3, weak learner = conic, leaf model = probabilistic (2 videos in this page) Training points
  • 103. Deriving Knowledge from Data at Scale Classification Forest: effect of tree depth max tree depth, D overfittingunderfitting T=200, D=3, w. l. = conic T=200, D=6, w. l. = conic T=200, D=15, w. l. = conic Predictor model = prob.(3 videos in this page) Training points: 4-class mixed
  • 104. Deriving Knowledge from Data at Scale Classification Forest: analysing generalization Parameters: T=200, D=13, w. l. = conic, predictor = prob. (3 videos in this page) Training points: 4-class spiral Training pts: 4-class spiral, large gaps Tr. pts: 4-class spiral, larger gapsTestingposteriors
  • 105. Deriving Knowledge from Data at Scale Q
  • 106. Deriving Knowledge from Data at Scale
  • 107. Deriving Knowledge from Data at Scale 10 Minute Break…
  • 108. Deriving Knowledge from Data at Scale • Opening Discussion 30 minutes Review Discussion… • Ensembles, Random Forests 60 minutes • Data Science Modelling 30 minutes Model performance evaluation… • Machine Learning Boot Camp ~60 minutes Clustering, k-Means… • Close
  • 109. Deriving Knowledge from Data at Scale Data Acquisition Data Exploration Pre-processing Feature and Target construction Train/ Test split Feature selection Model training Model scoring Model scoring Evaluation Evaluation Compare metrics
  • 110. Deriving Knowledge from Data at Scale Model Scoring (subject for today…)
  • 111. Deriving Knowledge from Data at Scale
  • 112. Deriving Knowledge from Data at Scale
  • 113. Deriving Knowledge from Data at Scale 1 2 3 4 (k-1) k Train Test
  • 114. Deriving Knowledge from Data at Scale • Class • Score
  • 115. Deriving Knowledge from Data at Scale True Label Predicted Label Confusion matrix
  • 116. Deriving Knowledge from Data at Scale
  • 117. Deriving Knowledge from Data at Scale
  • 118. Deriving Knowledge from Data at Scale Performance Metrics Percent Reduction in Error • 80% accuracy = 20% error • Suppose learning increases accuracy from 80% to 90% error reduced from 20% to 10% • 50% reduction in error • 99.90% to 99.99% = 90% reduction in error • 50% to 75% = 50% reduction in error, can be applied to many other measures
  • 119. Deriving Knowledge from Data at Scale Performance Metrics Precision and Recall • Typically used in document retrieval • Precision: – how many of the returned documents are correct – precision (threshold) • Recall: – how many of the positives does the model return – recall (threshold)
  • 120. Deriving Knowledge from Data at Scale Performance Metrics Precision and Recall
  • 121. Deriving Knowledge from Data at Scale
  • 122. Deriving Knowledge from Data at Scale Next week we will go deeper with ROC curves, kappa, lift charts, etc…
  • 123. Deriving Knowledge from Data at Scale
  • 124. Deriving Knowledge from Data at Scale • Opening Discussion 30 minutes Review Discussion… • Ensembles, Random Forests 60 minutes • Break 5 minutes • Data Science Modelling 30 minutes Model performance evaluation… • Machine Learning Boot Camp ~60 minutes Clustering, k-Means… • Close
  • 125. Deriving Knowledge from Data at Scale similar unsupervised learning data exploration
  • 126. Deriving Knowledge from Data at Scale grouping within a group are similar and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
  • 127. Deriving Knowledge from Data at Scale • Outliers objects that do not belong to any cluster outlier analysis cluster outliers
  • 128. Deriving Knowledge from Data at Scale data reduction natural clusters useful outlier detection
  • 129. Deriving Knowledge from Data at Scale How many clusters? Four ClustersTwo Clusters Six Clusters
  • 130. Deriving Knowledge from Data at Scale hierarchical partitional
  • 131. Deriving Knowledge from Data at Scale Original Points A Partitional Clustering
  • 132. Deriving Knowledge from Data at Scale p4 p1 p3 p2 p4 p1 p3 p2 p4p1 p2 p3 p4p1 p2 p3 Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram
  • 133. Deriving Knowledge from Data at Scale
  • 134. Deriving Knowledge from Data at Scale Single Linkage: Minimum distance* * Complete Linkage: Maximum distance* * Average Linkage: Average distance* * * * Wards method: Minimization of within-cluster variance * * * * * ¤ * * * * ¤ Centroid method: Distance between centres * * * * * ** * * * ¤ ¤ Non overlapping Overlapping Hierarchical Non-hierarchical 1a 1b 1c 1a 1b 1b1 1b22 Agglomerative Divisive
  • 135. Deriving Knowledge from Data at Scale d(x, y) x y metric • d(i, j)  0 non-negativity • d(i, i) = 0 isolation • d(i, j) = d(j, i) symmetry • d(i, j) ≤ d(i, h)+d(h, j) triangular inequality real, boolean, categorical, ordinal
  • 136. Deriving Knowledge from Data at Scale p = 2, L2 Euclidean distance weighted distance )||...| 22 || 11 (|),( 222 d y d xyxyxyxd  )||...| 22 | 2 | 11 | 1 (),( 222 d y d x d wxxwxxwyxd  d y d x d wyxwyxwyxd  ... 222111 ),(
  • 137. Deriving Knowledge from Data at Scale Q1 Q2 Q3 Q4 Q5 Q6 X 1 0 0 1 1 1 Y 0 1 1 0 1 0 • Jaccard similarity between binary vectors X and Y • Jaccard distance between binary vectors X and Y Jdist(X,Y) = 1- JSim(X,Y) • Example: • JSim = 1/6 • Jdist = 5/6 YX YX YXJSim   ),(
  • 138. Deriving Knowledge from Data at Scale • Lp Minkowski p p = 1, L1 Manhattan (or city block) p d i i y i x pp d x d x p yx p yxyxpL /1 1 )( /1 ||...| 22 || 11 |),(                         d i i y i x d y d xyxyxyxL 1 ||...| 22 |||),( 1 11
  • 139. Deriving Knowledge from Data at Scale centroid
  • 140. Deriving Knowledge from Data at Scale
  • 141. Deriving Knowledge from Data at Scale -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Sub-optimal Clustering -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Optimal Clustering Original Points
  • 142. Deriving Knowledge from Data at Scale -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 6
  • 143. Deriving Knowledge from Data at Scale -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 6
  • 144. Deriving Knowledge from Data at Scale    K i Cx i i xmdistSSE 1 2 ),(
  • 145. Deriving Knowledge from Data at Scale -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5
  • 146. Deriving Knowledge from Data at Scale -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 1 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 3 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 4 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 0 0.5 1 1.5 2 2.5 3 x y Iteration 5
  • 147. Deriving Knowledge from Data at Scale
  • 148. Deriving Knowledge from Data at Scale
  • 149. Deriving Knowledge from Data at Scale • Boolean Values • Categories
  • 150. Deriving Knowledge from Data at Scale
  • 151. Deriving Knowledge from Data at Scale
  • 152. Deriving Knowledge from Data at Scale That’s all for tonight….