SlideShare una empresa de Scribd logo
1 de 40
Descargar para leer sin conexión
Words in Space
A Visual Exploration of Distance, Documents, and
Distributions for Text Analysis
PyData NYC
2018
Dr. Rebecca Bilbro
Head of Data Science, ICX Media
Co-creator, Scikit-Yellowbrick
Author, Applied Text Analysis with Python
@rebeccabilbro
Machine Learning Review
The Machine Learning Problem:
Given a set of n samples of data such that each sample is
represented by more than a single number (e.g. multivariate
data that has several attributes or features), create a model
that is able to predict unknown properties of each sample.
Spatial interpretation:
Given data points in a bounded,
high dimensional space, define
regions of decisions for any point
in that space.
Instances are composed of features that make up our dimensions.
Feature space is the n-dimensions where our variables live (not
including target).
Feature extraction is the art of creating a space with decision
boundaries.
Example
Target
Y ≡ Thickness of car tires after some testing period
Variables
X1
≡ distance travelled in test
X2
≡ time duration of test
X3
≡ amount of chemical C in tires
The feature space is R3
, or more accurately, the positive quadrant in R3
as all the X
variables can only be positive quantities.
Domain knowledge about tires might suggest that the speed the vehicle was
moving at is important, hence we generate another variable, X4
(this is the feature
extraction part):
X4
= X1
/ X2
≡ the speed of the vehicle during testing.
This extends our old feature space into a new one, the positive part of R4
.
A mapping is a function, ϕ, from R3
to R4
:
ϕ(x1
,x2
,x3
) = (x1
,x2
,x3
,x1
x2
)
Modeling Non-Numeric Data
Real-world data is often not
represented numerically
out of the box (e.g. text,
images), therefore some
transformation must be
applied in order to do
machine learning.
Tricky Part
Machine learning relies on our ability to imagine data as
points in space, where the relative closeness of any two
is a measure of their similarity.
So...when we transform those non-numeric features into
numeric ones, how should we quantify the distance
between instances?
Many ways of quantifying “distance” (or similarity)
often the
default for
numeric data
common rule
of thumb for
text data
With text, our choice of distance metric is very
important! Why?
Challenges of Modeling Text Data
● Very high dimensional
○ One dimension for every word (token) in the corpus!
● Sparsely distributed
○ Documents vary in length!
○ Most instances (documents) may be mostly zeros!
● Has some features that are more important than others
○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles.
● Has some feature variations that matter more than others
○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.
Help!
● Extends the Scikit-Learn API.
● Enhances the model selection process.
● Tools for feature visualization, visual
diagnostics, and visual steering.
● Not a replacement for other visualization
libraries.
Yellowbrick
Feature
Analysis
Algorithm
Selection
Hyperparameter
Tuning
model selection isiterative, but can besteered!
TSNE (t-distributed Stochastic Neighbor
Embedding)
1. Apply SVD (or PCA) to reduce
dimensionality (for efficiency).
2. Embed vectors using probability
distributions from both the original
dimensionality and the decomposed
dimensionality.
3. Cluster and visualize similar
documents in a scatterplot.
Three Example Datasets
Hobbies corpus
● From the Baleen project
● 448 newspaper/blog articles
● 5 classes: gaming, cooking, cinema, books, sports
● Doc length (in words): 532 avg, 14564 max, 1 min
Farm Ads corpus
● From the UCI Repository
● 4144 ads represented as a list of metadata tags
● 2 classes: accepted, not accepted
● Doc length (in words): 270 avg, 5316 max, 1 min
Dresses Attributes Sales corpus
● From the UCI Repository
● 500 dresses represented as features: neckline, waistline, fabric, size, season
● Doc length (in words): 11 avg, 11 max, 11 min
Euclidean Distance
Euclidean distance is the straight-line distance between 2 points in Euclidean
(metric) space.
tsne = TSNEVisualizer(metric="euclidean")
tsne.fit(docs, labels)
tsne.poof()
5 10 15 20 25
252015105
Doc 2
(20, 19)
Doc 1
(7, 14)
Euclidean Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Cityblock (Manhattan) Distance
Manhattan distance between two points is computed as the sum of the absolute
differences of their Cartesian coordinates.
tsne = TSNEVisualizer(metric="cityblock")
tsne.fit(docs, labels)
tsne.poof()
Cityblock (Manhattan) Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Chebyshev Distance
Chebyshev distance is the L∞-norm of the difference between two points, a special
case of the Minkowski distance where p goes to infinity. It is also known as
chessboard distance.
tsne = TSNEVisualizer(metric="chebyshev")
tsne.fit(docs, labels)
tsne.poof()
Chebyshev Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Minkowski Distance
Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev
distance, and defines distance between points in a normalized vector space as the
generalized Lp-norm of their difference.
tsne = TSNEVisualizer(metric="minkowski")
tsne.fit(docs, labels)
tsne.poof()
Minkowski Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Mahalanobis Distance
A multidimensional generalization
of the distance between a point
and a distribution of points.
tsne = TSNEVisualizer(metric="mahalanobis", method='exact')
tsne.fit(docs, labels)
tsne.poof()
Think: shifting and rescaling coordinates with respect to distribution. Can help find
similarities between different-length docs.
Mahalanobis Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Cosine “Distance”
Cosine “distance” is the cosine of the angle between two doc vectors. The more
parallel, the more similar. Corrects for length variations (angles rather than
magnitudes). Considers only non-zero elements (efficient for sparse vectors!).
Note: Cosine distance is not technically a distance measure because it doesn’t
satisfy the triangle inequality.
tsne = TSNEVisualizer(metric="cosine")
tsne.fit(docs, labels)
tsne.poof()
Cosine “Distance”
Hobbies Corpus Ads Corpus Dresses Corpus
Canberra Distance
Canberra distance is a weighted version of Manhattan distance. It is often used for
data scattered around an origin, as it is biased for measures around the origin and
very sensitive for values close to zero.
tsne = TSNEVisualizer(metric="canberra")
tsne.fit(docs, labels)
tsne.poof()
Canberra Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Jaccard Distance
Jaccard distance defines similarity between finite sets as the
quotient of their intersection and their union. More effective for
detecting things like document duplication.
tsne = TSNEVisualizer(metric="jaccard")
tsne.fit(docs, labels)
tsne.poof()
Jaccard Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Hamming Distance
Hamming distance between two strings is the number of positions at which the
corresponding symbols are different. Measures minimum substitutions required to
change one string into the other.
tsne = TSNEVisualizer(metric="hamming")
tsne.fit(docs, labels)
tsne.poof()
Hamming Distance
Hobbies Corpus Ads Corpus Dresses Corpus
Other Yellowbrick Text Visualizers
Intercluster
Distance
Maps
Token
Frequency
Distribution
Dispersion
Plot
“Overview first, zoom and filter, then
details-on-demand”
- Ben Schneiderman
Thank you!

Más contenido relacionado

La actualidad más candente

A Diffusion Wavelet Approach For 3 D Model Matching
A Diffusion Wavelet Approach For 3 D Model MatchingA Diffusion Wavelet Approach For 3 D Model Matching
A Diffusion Wavelet Approach For 3 D Model Matching
rafi
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
Ivonne Liu
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
Pierre de Lacaze
 

La actualidad más candente (20)

Clustering
ClusteringClustering
Clustering
 
Clustering: Large Databases in data mining
Clustering: Large Databases in data miningClustering: Large Databases in data mining
Clustering: Large Databases in data mining
 
3.4 density and grid methods
3.4 density and grid methods3.4 density and grid methods
3.4 density and grid methods
 
A1804010105
A1804010105A1804010105
A1804010105
 
Oblivious Neural Network Predictions via MiniONN Transformations
Oblivious Neural Network Predictions via MiniONN TransformationsOblivious Neural Network Predictions via MiniONN Transformations
Oblivious Neural Network Predictions via MiniONN Transformations
 
Density Based Clustering
Density Based ClusteringDensity Based Clustering
Density Based Clustering
 
Data compression
Data compressionData compression
Data compression
 
A Diffusion Wavelet Approach For 3 D Model Matching
A Diffusion Wavelet Approach For 3 D Model MatchingA Diffusion Wavelet Approach For 3 D Model Matching
A Diffusion Wavelet Approach For 3 D Model Matching
 
ADAPTIVE CONTOURLET TRANSFORM AND WAVELET TRANSFORM BASED IMAGE STEGANOGRAPHY...
ADAPTIVE CONTOURLET TRANSFORM AND WAVELET TRANSFORM BASED IMAGE STEGANOGRAPHY...ADAPTIVE CONTOURLET TRANSFORM AND WAVELET TRANSFORM BASED IMAGE STEGANOGRAPHY...
ADAPTIVE CONTOURLET TRANSFORM AND WAVELET TRANSFORM BASED IMAGE STEGANOGRAPHY...
 
DBSCAN : A Clustering Algorithm
DBSCAN : A Clustering AlgorithmDBSCAN : A Clustering Algorithm
DBSCAN : A Clustering Algorithm
 
IMAGE RETRIEVAL USING QUADRATIC DISTANCE BASED ON COLOR FEATURE AND PYRAMID S...
IMAGE RETRIEVAL USING QUADRATIC DISTANCE BASED ON COLOR FEATURE AND PYRAMID S...IMAGE RETRIEVAL USING QUADRATIC DISTANCE BASED ON COLOR FEATURE AND PYRAMID S...
IMAGE RETRIEVAL USING QUADRATIC DISTANCE BASED ON COLOR FEATURE AND PYRAMID S...
 
4 Cliques Clusters
4 Cliques Clusters4 Cliques Clusters
4 Cliques Clusters
 
Machine learning in science and industry — day 4
Machine learning in science and industry — day 4Machine learning in science and industry — day 4
Machine learning in science and industry — day 4
 
00463517b1e90c1e63000000
00463517b1e90c1e6300000000463517b1e90c1e63000000
00463517b1e90c1e63000000
 
www.ijerd.com
www.ijerd.comwww.ijerd.com
www.ijerd.com
 
Lecture8 clustering
Lecture8 clusteringLecture8 clustering
Lecture8 clustering
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Dbscan algorithom
Dbscan algorithomDbscan algorithom
Dbscan algorithom
 
Report Satellite Navigation Systems
Report Satellite Navigation SystemsReport Satellite Navigation Systems
Report Satellite Navigation Systems
 
I0341042048
I0341042048I0341042048
I0341042048
 

Similar a Words in space

Similar a Words in space (20)

A Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and DistributionsA Visual Exploration of Distance, Documents, and Distributions
A Visual Exploration of Distance, Documents, and Distributions
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
[PPT]
[PPT][PPT]
[PPT]
 
Vectorise all the things
Vectorise all the thingsVectorise all the things
Vectorise all the things
 
DMTM Lecture 11 Clustering
DMTM Lecture 11 ClusteringDMTM Lecture 11 Clustering
DMTM Lecture 11 Clustering
 
Ir 08
Ir   08Ir   08
Ir 08
 
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
similarities-knn.pptx
similarities-knn.pptxsimilarities-knn.pptx
similarities-knn.pptx
 
[241]large scale search with polysemous codes
[241]large scale search with polysemous codes[241]large scale search with polysemous codes
[241]large scale search with polysemous codes
 
Lec10 matching
Lec10 matchingLec10 matching
Lec10 matching
 
Lect4
Lect4Lect4
Lect4
 
similarities-knn-1.ppt
similarities-knn-1.pptsimilarities-knn-1.ppt
similarities-knn-1.ppt
 
Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?Could a Data Science Program use Data Science Insights?
Could a Data Science Program use Data Science Insights?
 
Introduction to machine learning
Introduction to machine learningIntroduction to machine learning
Introduction to machine learning
 
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
Dynamic time wrapping (dtw), vector quantization(vq), linear predictive codin...
 
Machine learning session8(svm nlp)
Machine learning   session8(svm nlp)Machine learning   session8(svm nlp)
Machine learning session8(svm nlp)
 
Cs345 cl
Cs345 clCs345 cl
Cs345 cl
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 

Más de Rebecca Bilbro

Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
Rebecca Bilbro
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
Rebecca Bilbro
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Rebecca Bilbro
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
Rebecca Bilbro
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
Rebecca Bilbro
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
Rebecca Bilbro
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
Rebecca Bilbro
 

Más de Rebecca Bilbro (20)

Data Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in ProductionData Structures for Data Privacy: Lessons Learned in Production
Data Structures for Data Privacy: Lessons Learned in Production
 
Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)Conflict-Free Replicated Data Types (PyCon 2022)
Conflict-Free Replicated Data Types (PyCon 2022)
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Anti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual ConsistencyAnti-Entropy Replication for Cost-Effective Eventual Consistency
Anti-Entropy Replication for Cost-Effective Eventual Consistency
 
The Promise and Peril of Very Big Models
The Promise and Peril of Very Big ModelsThe Promise and Peril of Very Big Models
The Promise and Peril of Very Big Models
 
Beyond Off the-Shelf Consensus
Beyond Off the-Shelf ConsensusBeyond Off the-Shelf Consensus
Beyond Off the-Shelf Consensus
 
PyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine LearningPyData Global: Thrifty Machine Learning
PyData Global: Thrifty Machine Learning
 
EuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scaleEuroSciPy 2019: Visual diagnostics at scale
EuroSciPy 2019: Visual diagnostics at scale
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
Steering Model Selection with Visual Diagnostics: Women in Analytics 2019
 
The Incredible Disappearing Data Scientist
The Incredible Disappearing Data ScientistThe Incredible Disappearing Data Scientist
The Incredible Disappearing Data Scientist
 
Camlis
CamlisCamlis
Camlis
 
Learning machine learning with Yellowbrick
Learning machine learning with YellowbrickLearning machine learning with Yellowbrick
Learning machine learning with Yellowbrick
 
Escaping the Black Box
Escaping the Black BoxEscaping the Black Box
Escaping the Black Box
 
Data Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword CorpusData Intelligence 2017 - Building a Gigaword Corpus
Data Intelligence 2017 - Building a Gigaword Corpus
 
Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)Building a Gigaword Corpus (PyCon 2017)
Building a Gigaword Corpus (PyCon 2017)
 
Yellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformersYellowbrick: Steering machine learning with visual transformers
Yellowbrick: Steering machine learning with visual transformers
 
Visualizing the model selection process
Visualizing the model selection processVisualizing the model selection process
Visualizing the model selection process
 
NLP for Everyday People
NLP for Everyday PeopleNLP for Everyday People
NLP for Everyday People
 
Commerce Data Usability Project
Commerce Data Usability ProjectCommerce Data Usability Project
Commerce Data Usability Project
 

Último

Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 

Último (20)

BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 

Words in space

  • 1. Words in Space A Visual Exploration of Distance, Documents, and Distributions for Text Analysis PyData NYC 2018
  • 2. Dr. Rebecca Bilbro Head of Data Science, ICX Media Co-creator, Scikit-Yellowbrick Author, Applied Text Analysis with Python @rebeccabilbro
  • 4. The Machine Learning Problem: Given a set of n samples of data such that each sample is represented by more than a single number (e.g. multivariate data that has several attributes or features), create a model that is able to predict unknown properties of each sample.
  • 5. Spatial interpretation: Given data points in a bounded, high dimensional space, define regions of decisions for any point in that space.
  • 6. Instances are composed of features that make up our dimensions.
  • 7. Feature space is the n-dimensions where our variables live (not including target). Feature extraction is the art of creating a space with decision boundaries.
  • 8. Example Target Y ≡ Thickness of car tires after some testing period Variables X1 ≡ distance travelled in test X2 ≡ time duration of test X3 ≡ amount of chemical C in tires The feature space is R3 , or more accurately, the positive quadrant in R3 as all the X variables can only be positive quantities.
  • 9. Domain knowledge about tires might suggest that the speed the vehicle was moving at is important, hence we generate another variable, X4 (this is the feature extraction part): X4 = X1 / X2 ≡ the speed of the vehicle during testing. This extends our old feature space into a new one, the positive part of R4 . A mapping is a function, ϕ, from R3 to R4 : ϕ(x1 ,x2 ,x3 ) = (x1 ,x2 ,x3 ,x1 x2 )
  • 11. Real-world data is often not represented numerically out of the box (e.g. text, images), therefore some transformation must be applied in order to do machine learning.
  • 12. Tricky Part Machine learning relies on our ability to imagine data as points in space, where the relative closeness of any two is a measure of their similarity. So...when we transform those non-numeric features into numeric ones, how should we quantify the distance between instances?
  • 13. Many ways of quantifying “distance” (or similarity) often the default for numeric data common rule of thumb for text data
  • 14. With text, our choice of distance metric is very important! Why?
  • 15. Challenges of Modeling Text Data ● Very high dimensional ○ One dimension for every word (token) in the corpus! ● Sparsely distributed ○ Documents vary in length! ○ Most instances (documents) may be mostly zeros! ● Has some features that are more important than others ○ E.g. the “of” dimension vs. the “basketball” dimension when clustering sports articles. ● Has some feature variations that matter more than others ○ E.g. freq(tree) vs. freq(horticulture) in classifying gardening books.
  • 16. Help!
  • 17. ● Extends the Scikit-Learn API. ● Enhances the model selection process. ● Tools for feature visualization, visual diagnostics, and visual steering. ● Not a replacement for other visualization libraries. Yellowbrick Feature Analysis Algorithm Selection Hyperparameter Tuning model selection isiterative, but can besteered!
  • 18. TSNE (t-distributed Stochastic Neighbor Embedding) 1. Apply SVD (or PCA) to reduce dimensionality (for efficiency). 2. Embed vectors using probability distributions from both the original dimensionality and the decomposed dimensionality. 3. Cluster and visualize similar documents in a scatterplot.
  • 19. Three Example Datasets Hobbies corpus ● From the Baleen project ● 448 newspaper/blog articles ● 5 classes: gaming, cooking, cinema, books, sports ● Doc length (in words): 532 avg, 14564 max, 1 min Farm Ads corpus ● From the UCI Repository ● 4144 ads represented as a list of metadata tags ● 2 classes: accepted, not accepted ● Doc length (in words): 270 avg, 5316 max, 1 min Dresses Attributes Sales corpus ● From the UCI Repository ● 500 dresses represented as features: neckline, waistline, fabric, size, season ● Doc length (in words): 11 avg, 11 max, 11 min
  • 20. Euclidean Distance Euclidean distance is the straight-line distance between 2 points in Euclidean (metric) space. tsne = TSNEVisualizer(metric="euclidean") tsne.fit(docs, labels) tsne.poof() 5 10 15 20 25 252015105 Doc 2 (20, 19) Doc 1 (7, 14)
  • 21. Euclidean Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 22. Cityblock (Manhattan) Distance Manhattan distance between two points is computed as the sum of the absolute differences of their Cartesian coordinates. tsne = TSNEVisualizer(metric="cityblock") tsne.fit(docs, labels) tsne.poof()
  • 23. Cityblock (Manhattan) Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 24. Chebyshev Distance Chebyshev distance is the L∞-norm of the difference between two points, a special case of the Minkowski distance where p goes to infinity. It is also known as chessboard distance. tsne = TSNEVisualizer(metric="chebyshev") tsne.fit(docs, labels) tsne.poof()
  • 25. Chebyshev Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 26. Minkowski Distance Minkowski distance is a generalization of Euclidean, Manhattan, and Chebyshev distance, and defines distance between points in a normalized vector space as the generalized Lp-norm of their difference. tsne = TSNEVisualizer(metric="minkowski") tsne.fit(docs, labels) tsne.poof()
  • 27. Minkowski Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 28. Mahalanobis Distance A multidimensional generalization of the distance between a point and a distribution of points. tsne = TSNEVisualizer(metric="mahalanobis", method='exact') tsne.fit(docs, labels) tsne.poof() Think: shifting and rescaling coordinates with respect to distribution. Can help find similarities between different-length docs.
  • 29. Mahalanobis Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 30. Cosine “Distance” Cosine “distance” is the cosine of the angle between two doc vectors. The more parallel, the more similar. Corrects for length variations (angles rather than magnitudes). Considers only non-zero elements (efficient for sparse vectors!). Note: Cosine distance is not technically a distance measure because it doesn’t satisfy the triangle inequality. tsne = TSNEVisualizer(metric="cosine") tsne.fit(docs, labels) tsne.poof()
  • 31. Cosine “Distance” Hobbies Corpus Ads Corpus Dresses Corpus
  • 32. Canberra Distance Canberra distance is a weighted version of Manhattan distance. It is often used for data scattered around an origin, as it is biased for measures around the origin and very sensitive for values close to zero. tsne = TSNEVisualizer(metric="canberra") tsne.fit(docs, labels) tsne.poof()
  • 33. Canberra Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 34. Jaccard Distance Jaccard distance defines similarity between finite sets as the quotient of their intersection and their union. More effective for detecting things like document duplication. tsne = TSNEVisualizer(metric="jaccard") tsne.fit(docs, labels) tsne.poof()
  • 35. Jaccard Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 36. Hamming Distance Hamming distance between two strings is the number of positions at which the corresponding symbols are different. Measures minimum substitutions required to change one string into the other. tsne = TSNEVisualizer(metric="hamming") tsne.fit(docs, labels) tsne.poof()
  • 37. Hamming Distance Hobbies Corpus Ads Corpus Dresses Corpus
  • 38. Other Yellowbrick Text Visualizers