SlideShare una empresa de Scribd logo
1 de 33
Understanding
Feature Space in
Machine Learning
Alice Zheng, Dato
September 9, 2015
1
2
My journey so far
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.
3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
Feature = numeric representation of raw data
6
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
7
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
8
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
9
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
10
Feature space in machine learning
• Raw data  high dimensional vectors
• Collection of data points  point cloud in feature space
• Model = geometric summary of point cloud
• Feature engineering = creating features of the appropriate
granularity for the task
Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to
reduce all problems to sets of numbers and
variables, and the geometers, who understand the
world through shapes.
-- Masha Gessen, “Perfect Rigor”
12
Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra Geometry
Pythagorean
Theorem
(Euclidean space)
13
Visualizing a sphere in 2D
x2 + y2 = 1
a
b
c
Pythagorean theorem:
a2 + b2 = c2
x
y
1
1
14
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
1
1
15
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
1
1
16
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
17
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• Point clouds = approximate geometric shapes
• High dimensional features can model many things
Visualizing Feature Space
19
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
20
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
21
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I have a puppy and
it is extremely cute
I have an extremely
cute cat
I have a cute
puppy
22
Document point cloud
word 1
word 2
23
What is a model?
• Model = mathematical “summary” of data
• What’s a summary?
- A geometric shape
24
Classification model
Feature 2
Feature 1
Decide between two classes
25
Clustering model
Feature 2
Feature 1
Group data points tightly
26
Regression model
Target
Feature
Fit the target values
Visualizing Feature Engineering
28
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
29
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
30
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
31
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
32
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
33
That’s not all, folks!
• There’s a lot more to feature engineering:
- Feature normalization
- Feature transformations
- “Regularizing” models
- Learning the right features
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData

Más contenido relacionado

La actualidad más candente

Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesDr. C.V. Suresh Babu
 
Artificial intelligence- Logic Agents
Artificial intelligence- Logic AgentsArtificial intelligence- Logic Agents
Artificial intelligence- Logic AgentsNuruzzaman Milon
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selectionchenhm
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded searchHema Kashyap
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Garry D. Lasaga
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsSalah Amean
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksFrancesco Collova'
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reductionmrizwan969
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Salah Amean
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representationSajan Sahu
 
Version spaces
Version spacesVersion spaces
Version spacesGekkietje
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 

La actualidad más candente (20)

Bert
BertBert
Bert
 
Machine Learning for Dummies
Machine Learning for DummiesMachine Learning for Dummies
Machine Learning for Dummies
 
Artificial Intelligence Searching Techniques
Artificial Intelligence Searching TechniquesArtificial Intelligence Searching Techniques
Artificial Intelligence Searching Techniques
 
Uncertainty in AI
Uncertainty in AIUncertainty in AI
Uncertainty in AI
 
AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)
 
Multi Layer Network
Multi Layer NetworkMulti Layer Network
Multi Layer Network
 
Artificial intelligence- Logic Agents
Artificial intelligence- Logic AgentsArtificial intelligence- Logic Agents
Artificial intelligence- Logic Agents
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
 
Lecture 16 memory bounded search
Lecture 16 memory bounded searchLecture 16 memory bounded search
Lecture 16 memory bounded search
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
 
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic ConceptsData Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
 
Machine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural NetworksMachine Learning: Introduction to Neural Networks
Machine Learning: Introduction to Neural Networks
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...Data Mining:  Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
 
Artificial intelligence and knowledge representation
Artificial intelligence and knowledge representationArtificial intelligence and knowledge representation
Artificial intelligence and knowledge representation
 
Version spaces
Version spacesVersion spaces
Version spaces
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 

Destacado

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature EngineeringAlice Zheng
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyAlexander Sergeev
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Sujit Pal
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning modelsExtract Data Conference
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookFaisal Siddiqi
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterZhiyong (Joe) Xie
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...Ed Chi
 

Destacado (8)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
 

Similar a Understanding Feature Space in Machine Learning

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleDomino Data Lab
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Artsmadahay
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Gianfranco Di Pietro
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법NAVER D2
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxManuelEsponilla
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingankit_ppt
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringTuri, Inc.
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPInsoo Chung
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Michael Scovetta
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizerEelke Folmer
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do itGregory Renard
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItHolberton School
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013Mel Chua
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiersbutest
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1Nat Karablina
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision TreeSuman Debnath
 

Similar a Understanding Feature Space in Machine Learning (20)

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptx
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Ml3
Ml3Ml3
Ml3
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
 
Collegeteaching102
Collegeteaching102Collegeteaching102
Collegeteaching102
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1
 
Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision Tree
 

Último

projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and momentdonamiaquintan2
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfSubhamKumar3239
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPirithiRaju
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsDobusch Leonhard
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterHanHyoKim
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxPayal Shrivastava
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书zdzoqco
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfAtiaGohar1
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptAmirRaziq1
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 

Último (20)

projectile motion, impulse and moment
projectile  motion, impulse  and  momentprojectile  motion, impulse  and  moment
projectile motion, impulse and moment
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
complex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdfcomplex analysis best book for solving questions.pdf
complex analysis best book for solving questions.pdf
 
Pests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPRPests of Sunflower_Binomics_Identification_Dr.UPR
Pests of Sunflower_Binomics_Identification_Dr.UPR
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
Science (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and PitfallsScience (Communication) and Wikipedia - Potentials and Pitfalls
Science (Communication) and Wikipedia - Potentials and Pitfalls
 
final waves properties grade 7 - third quarter
final waves properties grade 7 - third quarterfinal waves properties grade 7 - third quarter
final waves properties grade 7 - third quarter
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
FBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptxFBI Profiling - Forensic Psychology.pptx
FBI Profiling - Forensic Psychology.pptx
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
办理麦克马斯特大学毕业证成绩单|购买加拿大文凭证书
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdfReplisome-Cohesin Interfacing A Molecular Perspective.pdf
Replisome-Cohesin Interfacing A Molecular Perspective.pdf
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
Immunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.pptImmunoblott technique for protein detection.ppt
Immunoblott technique for protein detection.ppt
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 

Understanding Feature Space in Machine Learning

  • 1. Understanding Feature Space in Machine Learning Alice Zheng, Dato September 9, 2015 1
  • 2. 2 My journey so far Applied machine learning (Data science) Build ML tools Shortage of experts and good tools.
  • 3. 3 Why machine learning? Model data. Make predictions. Build intelligent applications.
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. Feature = numeric representation of raw data
  • 6. 6 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 7. 7 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 8. 8 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 9. 9 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 10. 10 Feature space in machine learning • Raw data  high dimensional vectors • Collection of data points  point cloud in feature space • Model = geometric summary of point cloud • Feature engineering = creating features of the appropriate granularity for the task
  • 11. Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”
  • 12. 12 Algebra vs. Geometry a b c a2 + b2 = c2 Algebra Geometry Pythagorean Theorem (Euclidean space)
  • 13. 13 Visualizing a sphere in 2D x2 + y2 = 1 a b c Pythagorean theorem: a2 + b2 = c2 x y 1 1
  • 14. 14 Visualizing a sphere in 3D x2 + y2 + z2 = 1 x y z 1 1 1
  • 15. 15 Visualizing a sphere in 4D x2 + y2 + z2 + t2 = 1 x y z 1 1 1
  • 16. 16 Why are we looking at spheres? = = = = Poincaré Conjecture: All physical objects without holes is “equivalent” to a sphere.
  • 17. 17 The power of higher dimensions • A sphere in 4D can model the birth and death process of physical objects • Point clouds = approximate geometric shapes • High dimensional features can model many things
  • 19. 19 The challenge of high dimension geometry • Feature space can have hundreds to millions of dimensions • In high dimensions, our geometric imagination is limited - Algebra comes to our aid
  • 20. 20 Visualizing bag-of-words puppy cute 1 1 I have a puppy and it is extremely cute I have a puppy and it is extremely cute it 1 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 zebra 0 cute 1 extremely 1 … …
  • 21. 21 Visualizing bag-of-words puppy cute 1 1 1 extremely I have a puppy and it is extremely cute I have an extremely cute cat I have a cute puppy
  • 23. 23 What is a model? • Model = mathematical “summary” of data • What’s a summary? - A geometric shape
  • 24. 24 Classification model Feature 2 Feature 1 Decide between two classes
  • 25. 25 Clustering model Feature 2 Feature 1 Group data points tightly
  • 28. 28 When does bag-of-words fail? puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten Task: find a surface that separates documents about dogs vs. cats Problem: the word “have” adds fluff instead of information I have a dog and I have a pen 1
  • 29. 29 Improving on bag-of-words • Idea: “normalize” word counts so that popular words are discounted • Term frequency (tf) = Number of times a terms appears in a document • Inverse document frequency of word (idf) = • N = total number of documents • Tf-idf count = tf x idf
  • 30. 30 From BOW to tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 31. 31 From BOW to tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy Decision surface Tf-idf flattens uninformative dimensions in the BOW point cloud
  • 32. 32 Entry points of feature engineering • Start from data and task - What’s the best text representation for classification? • Start from modeling method - What kind of features does k-means assume? - What does linear regression assume about the data?
  • 33. 33 That’s not all, folks! • There’s a lot more to feature engineering: - Feature normalization - Feature transformations - “Regularizing” models - Learning the right features • Dato is hiring! jobs@dato.com alicez@dato.com @RainyData

Notas del editor

  1. Features sit between raw data and model. They can make or break an application.