BDSM⚡Call Girls in Sector 128 Noida Escorts >༒8448380779 Escort Service
NLP for SEO
1. Paul Shapiro | @fighto | #TechSEOBoost
#TechSEOBoost | @CatalystSEM
THANK YOU TO THIS YEAR’S SPONSORS
NLP for SEO
Paul Shapiro, Catalyst
2. Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro, Catalyst
Breaking Down NLP
for SEO
3. Paul Shapiro | @fighto | #TechSEOBoost
Paul Shapiro
Senior Partner, Head of SEO
@ Catalyst, a GroupM Agency
4. Paul Shapiro | @fighto | #TechSEOBoost
Assumptions & Prerequisites
• Familiarity with Python
• Familiarity with common data science libraries such as pandas and NumPy
• Familiarity with Jupyter Notebooks (optional)
• But no prior knowledge of NLP
5. Paul Shapiro | @fighto | #TechSEOBoost
Libraries Used in Examples
6. Paul Shapiro | @fighto | #TechSEOBoost
KNIME as an Alternative
https://www.knime.com
7. Paul Shapiro | @fighto | #TechSEOBoost
What is Natural
Language Processing
(NLP)?
8. Paul Shapiro | @fighto | #TechSEOBoost
What is NLP?
“NLP is a way for computers to analyze, understand, and derive
meaning from human language in a smart and useful way. By utilizing
NLP, developers can organize and structure knowledge to perform
tasks such as automatic summarization, translation, named entity
recognition, relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.”
https://blog.algorithmia.com/introduction-natural-language-processing-nlp/
9. Paul Shapiro | @fighto | #TechSEOBoost
NLP
Old New
Linguistical Heuristics
Statistics
Machine Learning
10. Paul Shapiro | @fighto | #TechSEOBoost
Input: Parse Semi/Unstructured Text Data
https://github.com/niderhoff/nlp-datasets
11. Paul Shapiro | @fighto | #TechSEOBoost
Example Data Sources
• (Digital) Books
• CSVs, Excel, JSON, XML, etc.
• Word Docs/PDFs
• Web Pages (most relevant to SEO)
13. Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Tokenization
• Text must be broken into units aka tokens
• (Usually individual words)
14. Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
We need to parse, clean, and
prepare text data for both analysis
and conversion into a machine
interpretable formats.
16. Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Noise and Junk Removal/Cleanup
• Punctuation and Special Characters
• Stop Words
• Common Abbreviations
• Common Character Cases
• Etc.
18. Paul Shapiro | @fighto | #TechSEOBoost
Tokenize & Remove Stop Words
19. Paul Shapiro | @fighto | #TechSEOBoost
Expand Common Abbreviations
20. Paul Shapiro | @fighto | #TechSEOBoost
Text Pre-Processing
Normalization and Standardization
• Stemming
• Lemmatization
21. Paul Shapiro | @fighto | #TechSEOBoost
Why Normalization, Text Analytics Ex
• Speeds up machine learning analysis
• Disambiguation
Say there are 500 jokes in our corpus that mention “Donald Trump”
• 25 of those jokes include the word “economy, 15 include the word “economic” and 10 mention “world
economies”.
• All of these jokes have to do with both “economics” and “Donald Trump” but would turn up as 3
distinct co-occurences.
22. Paul Shapiro | @fighto | #TechSEOBoost
Why Stemming and Pitfalls
• More basic method of reducing different forms of the same word to a common base
• Stemming chops off the end of the word to accomplish this
• Faster method
• Results in terms that are not real words:
24. Paul Shapiro | @fighto | #TechSEOBoost
Why Lemmatization and Pitfalls
• More sophisticated method of reducing different forms of the same word to a common base
• Lemmatizations leverages vocabulary and grammar to infer the root of a word
• Requires Parts of Speech tagging
• Slower but more accurate method
26. Paul Shapiro | @fighto | #TechSEOBoost
Information Extraction & Grouping
Getting more context
• N-Grams
• Parts of Speech Tagging
• Chunking/Chinking
• Named Entity Recognition
• Word Embeddings
30. Paul Shapiro | @fighto | #TechSEOBoost
Named Entity Recognition
31. Paul Shapiro | @fighto | #TechSEOBoost
Word Embeddings: word2vec, GloVe
32. Paul Shapiro | @fighto | #TechSEOBoost
Word Embeddings: word2vec, GloVe
33. Paul Shapiro | @fighto | #TechSEOBoost
Statistical Feature Creation
• Leverage personal heuristics to create customized numeric
representations that you think could be used by a machine
learning model to make predictions
35. Paul Shapiro | @fighto | #TechSEOBoost
Example: Boolean Profanity
36. Paul Shapiro | @fighto | #TechSEOBoost
Example: Number of Profane Words
37. Paul Shapiro | @fighto | #TechSEOBoost
Feature Normalization
Box-Cox Power Transformations
• “A Box Cox transformation is a way to transform non-
normal dependent variables into a normal
shape. Normality is an important assumption for many
statistical techniques; if your data isn’t normal, applying a
Box-Cox means that you are able to run a broader
number of tests.”
https://www.statisticshowto.datasciencecentral.com/box-cox-transformation/
38. Paul Shapiro | @fighto | #TechSEOBoost
Box-Cox Power Transformation
39. Paul Shapiro | @fighto | #TechSEOBoost
Check Distribution with Histogram
40. Paul Shapiro | @fighto | #TechSEOBoost
Check Distribution with Histogram
45. Paul Shapiro | @fighto | #TechSEOBoost
N-Gram Vectorizer
46. Paul Shapiro | @fighto | #TechSEOBoost
Let’s Talk About TF-IDF for a Moment
• Count Vectorizer looked at how many times a term or n-gram appeared in a joke and
represents as positive integer
• TF-IDF would create a score that considers how many time a term appears in a joke
as well as how many times it appears in the entire corpus of jokes.
• Rarer words are deemed to more important because they can be used distinguish one joke from
another.
• Higher TF-IDF value = more uncommon
• Lower TF-IDF value = less common
47. Paul Shapiro | @fighto | #TechSEOBoost
TF-IDF Vectorizer
48. Paul Shapiro | @fighto | #TechSEOBoost
Decision Trees
Will
[Sports
Team]
win?
Players
statistics
are
favorable?
Is the team
their
playing
historically
better?
Yes No?
Yes
No
49. Paul Shapiro | @fighto | #TechSEOBoost
Random Forest
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
Will [Sports
Team] win?
Players
statistics are
favorable?
Is the team
their playing
historically
better?
Yes No?
Yes
No
50. Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
51. Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
52. Paul Shapiro | @fighto | #TechSEOBoost
Basic Machine Learning
53. Paul Shapiro | @fighto | #TechSEOBoost
Having Done This Better
• Reduce overfitting
• Standardize features (mixing sparse and non-sparse data)
• Word embeddings for more context
• More sophisticated models
54. Paul Shapiro | @fighto | #TechSEOBoost
More Applications for SEO
• Creating performant content (joke example extrapolated)
• Predicting natural link earning potential
• Natural language generation, writing bits of content
• Semantic content optimization
• Site architecture design and taxonomy
• User flow creation
• Keyword research
• Etc.
55. Paul Shapiro | @fighto | #TechSEOBoost
How to Learn More, Resources
• https://web.stanford.edu/~jurafsky/slp3/
• https://www.kaggle.com/learn/overview
• https://towardsdatascience.com
• https://github.com/keon/awesome-nlp
56. Paul Shapiro | @fighto | #TechSEOBoost
LET’S
REDEFINE
TECHNICAL
SEO
57. Paul Shapiro | @fighto | #TechSEOBoost
Thank You
–
Paul Shapiro, Senior Partner, Head of SEO, Catalyst
Paul.Shapiro@groupm.com
58. Paul Shapiro | @fighto | #TechSEOBoost
Thanks for Viewing the Slideshare!
–
Watch the Recording: https://youtube.com/session-example
Or
Contact us today to discover how Catalyst can deliver unparalleled SEO
results for your business. https://www.catalystdigital.com/