SlideShare una empresa de Scribd logo
1 de 76
Descargar para leer sin conexión
NLP for the Web
Dr. Matthew Peters
@mattthemathman
(with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq,
Chris Whitten, Jay Leary and many others at Moz)
Moz is a SaaS company that sells SEO and Content Marketing software to
professional marketers
We crawl a lot: > 1 Billion pages / day
We’d like to extract structured information from these pages
Author identification
Keyword / topic
extraction
Measure of
page’s “Reach”
Measure of
page’s “Reach”
+ structured
information à
most important
topics,
associate
authors with
areas of
expertise, etc
Most NLP tasks (parsing, POS tagging, Q&A, sentiment, etc.) focus exclusively on text content.
First two
paragraphs
http://www.seattletimes.com/seattle-news/transportation/berthas-delays-prompt-state-to-sue-tunnel-contractor/
POS tagging
Question
answering
Language
Modeling
Parsing
Sentiment
NLP Web
HTML
XML
Microdata
(schema.org)
CSS
Javascript
Interesting
problems at
intersection
3 extraction tasks
Main article
Keywords
Bertha
lawsuit
Seattle Tunnel Partners
STP
WSDOT
tunnel construction
etc.
Author
•  Motivation – what are some of the unique challenges
and opportunities in doing NLP on the web?
•  Main article extraction / page de-chroming
•  Keyword extraction
•  Author identification
•  Conclusion
Outline
Challenges & Opportunities
C1: Pages have clutter
and unrelated text
Navigation aids
Ads
Links to
other articles
C2: Text segments can confuse NLP components
Mike Lindblom Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom Bertha
State sues NP chunks extracted by our chunker
the lawsuit
•  Many different standards and only partial adoption
•  Wide variety of templates and sites
•  Broken HTML
C3: The web has a lot of cruft
•  Web page have attributes other then the visual text: URL
string, page title, meta description, etc.
•  The HTML/XML has a tree structure we can use
O1: Pages have additional structure
O2: Hyperlinks
O3: CSS
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
General approach
1.  Use HTML parser to represent page as a tree
2.  Split the tree into small pieces and analyze each piece
separately
3.  Run NLP pipelines or other machine learning models on these
small pieces to:
- focus attention the important pieces
- extract structured information
- other task dependent objectives
4.  Need algorithms that efficiently process only raw HTML
(without JS, image, CSS, etc.)
Content extraction /
Web page de-chroming
Extract main
article content
(and optionally
comments) from
a web page
Main article
Dragnet
•  Combine diverse features with machine learning
•  Open source: https://github.com/seomoz/dragnet
•  v1 (2013): link/text density + CETR:
blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/
paper: http://www2013.org/companion/p89.pdf)
•  v2: (2015): added Readability
blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms-
dragnet-readability-goose-and-eatiht/
10,000 foot view
•  Split page into distinct visual elements called “blocks”
•  Use machine learning to classify each block as content
or no content
Inspired by:
Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10
Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10
Readability: (https://github.com/buriy/python-readability)
Splitting the text into blocks
•  Use new lines in HTML (n) (CETR)
•  Use subtrees (Readability)
•  Flatten HTML and break on <div>, <p>, <h1>, etc.
(Kohlschütter et al. and us)
Text and link
“density” two
important features.
High text density, low
link density à more
likely to be content
Compute “smoothed tag ratio”, ratio of #
tags to # chars
Compute “smoothed absolute difference
tag ratio”, dTR / dblock.
Captures intuition that main
content occurs together
Run k-means with 3 clusters on blocks,
with one centroid always pinned
to (0, 0)
Blocks in (0, 0) cluster are non-content,
remainder content
Encode CETR
predicted class as
0-1 feature
Use a simplified version of Readability:
•  Compute score for each subtree using:
- parent id/class attributes
- length of text
•  Find subtree with highest score
•  Block feature = maximum subtree score for all subtrees containing block
Random
Forest
Model performance
From 2013 paper (v1)
Task: Extract content
and comments
Model performance
Model performance
Keyword / topic extraction
Extract a ranked list of
keywords from a page with
relevancy score

(91, 'bertha')
(61, 'stp')
(59, 'state sues')
(44, 'tunnel')
(37, 'wsdot')
(30, 'tunnel construction')
(28, 'the seattle times')
(17, 'seattle tunnel partners')
(13, 'repair bertha')
(10, 'transportation lawsuit’)
Prior work
Many prior papers on similar task
Most use small data sets (hundreds of labeled examples) à unsupervised +
supervised methods
Wide range of previous approaches and almost always tailored to specific
type of document (academic papers, etc.)
Requirements for “gold standard” are fuzzy
Our approach:
Build a web specific algorithm to leverage unique aspects of domain
Combine many different features / approaches
Overcome data limitations and build complex model by gathering lots of
data automatically
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
C
A
N
D
I
D
A
T
E
S
Main article
Run dragnet to
extract the main
article content.
Keep track of
individual blocks
and process each
separately.
This displays as a dash but is the
unicode character U+2014
Need to special case @twitter,
email@domain.com, dates, etc.
Text & Token normalization
Include web specific logic in tokenizer / normalizer
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
C
A
N
D
I
D
A
T
E
S
Processing individual blocks helps NP chunker
Mike Lindblom
Bertha: State sues - but even the lawsuit is delayed
Mike Lindblom
Bertha NP chunks extracted by our chunker
State sues
the lawsuit
Wikipedia lookup
Treat “Statue of Liberty”
as a single candidate
instead of splitting into
“Statue” “of Liberty”
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
C
A
N
D
I
D
A
T
E
S
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
C
A
N
D
I
D
A
T
E
S
TF
Ranking model features
Shallow: relative position in document, number of tokens
Occurrence: does candidate occur in title, H1, meta description, etc
Term frequency: count of occurrences, average token count, sum(in degree),
etc
QDR: information retrieval motivated “query-document relevance” ranking
models. TF-IDF (term frequency X inverse document frequency),
probabilistic approaches, language models
POS tags: is the keyword a proper noun, etc
URL features: does the keyword appear in URL
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
Generating training data
List of high
volume
keywords
Top 10
results
Crawl
pages
Training Data:
HTML with relevant
keyword
Commercial
Search Engine
PU learning
Learning classifiers from only positive and unlabeled
data, Elkan and Noto, SIGKDD 2008
●  Most ML classifiers have both
positive and negative
examples in training data
●  We only have one keyword
per page that is relevant
(“positive”) and many others
that may or may not be
positive
●  Use result from this paper
applied to our data
Generate Candidates Rank Candidates
Raw HTML Ranked Topics
Parse/
Dechroming
Normalize
Sentence/
word
tokenize
POS tag/
Noun phrase
chunk
Wikipedia
lookup
Shallow
Occurrence
QDR
POS
URL
Classifier
(probability of
relevance)
C
A
N
D
I
D
A
T
E
S
TF
Keyword extraction review
Resulting algorithm is:
●  Robust across different content types – worst case still extracts
reasonable topics
●  Reasonably fast, about 25 pages / second end-to-end
●  Subjectively outperforms other commercial APIs (e.g. Alchemy, etc).
●  In production for a year+, processed many millions of pages
Author extraction
Author
Extract a list of author names
(or an empty list if no authors)
from a given web page.
See: https://moz.com/devblog/web-page-author-extraction/
Do we need a ML algorithm for this?
(why isn’t this trivial?)
Heuristics do an adequate job:
●  The microformat rel="author" attribute in link tags (a) is commonly used
to specify the page author
●  Some sites specify page authors with a meta author tag.
●  Many sites use names like “author” or “byline” for class attributes in their
CSS.
Sometimes heuristics work well
<div class="article-columnist-name vcard">
<a class="author url fn" rel="author"
href="/author/mike-lindblom/">Mike Lindblom</a>
</div>
But sometimes heuristics fall over
Some pages do not have any special markup for byline...
But sometimes heuristics fall over
Some pages have
misleading or wrong
markup
But sometimes heuristics fall over
Links to related stories and
sidebar bylines look nearly
identical to the main byline
Machine learning approach
Use machine learning approach
Crowd source labeled data using
Spare5
Approx 9,000 labeled pages
Case study
http://arstechnica.com/science/2016/07/algorithms-used-to-study-brain-activity-may-be-exaggerating-results/
Ranked blocks
Tags for highest ranked block
Dragnet HTML
Blockfier
Author chunker
Page HTML
Block
representation
Block ranking model
Block ranking model
Combines NLP and web features
•  Tokens in block text (similar to bag-of-words classification)
•  Tokens in block HTML tag attributes (e.g. class=“byline”)
•  The HTML tags in block (e.g. many author names are links)
•  rel=“author” and other markup inspired features
Put all features through Random Forest classifier that predicts probability a
block contains author
Block model performance
Overall block model is pretty good –
captures intuition that “bylines are easy
to spot”
Table lists Precision@K whether block
actually contains the author’s name.
Author Chunker
Modified IOB tagger similar to NP chunker or POS tagger
3 – class classification problem (Beginning of name, Inside name, Outside)
To make predictions at next token:
uni-, bi- and tri-gram tokens from previous/next few tokens
uni-, bi- and tri-gram POS tags
previous predicted IOB labels
HTML tags preceding and following the token
rel="author" and other markup inspired features
Overall 85.6% accurate chunking top block.
Author Chunker
<p class="byline”>
by
<a href="http://arstechnica.com/author/john-timmer/” rel="author”>
<span>John Timmer</span>
</a>
- <span class="date”>Jul 1, 2016 6:55 pm UTC</span>
</p>
Author Chunker using HTML features
by John Timmer - Jul 1

IN NNP NNP - NN CD

O ??
<p class="byline”> <a rel="author”><span> </span></a>
To make prediction here, we can use tokens,
POS tags and HTML structure between tokens
Overall author model performance
Overall accuracy on test set
is good, outperforming
alternatives.
(heuristics)
(commercial API)
(OS Python library)
Conclusion
POS tagging
Question
answering
Language
Modeling
Parsing
Sentiment
NLP Web
HTML
XML
Microdata
(schema.org)
CSS
Javascript
Conclusion
NLP for the Web
Dr. Matthew Peters
@mattthemathman

Más contenido relacionado

La actualidad más candente

Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
Matthew Rowe
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-Atay
Alex Sumner
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Tomek Pluskiewicz
 

La actualidad más candente (19)

"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
DITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel PublishingDITA,Single-source, Multi-channel Publishing
DITA,Single-source, Multi-channel Publishing
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Introduction to html
Introduction to htmlIntroduction to html
Introduction to html
 
Realizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 TutorialRealizing a Semantic Web Application - ICWE 2010 Tutorial
Realizing a Semantic Web Application - ICWE 2010 Tutorial
 
Semantic web
Semantic web Semantic web
Semantic web
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Web Database
Web DatabaseWeb Database
Web Database
 
ScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-AtayScholarsDay_Poster2015_Sumner-Atay
ScholarsDay_Poster2015_Sumner-Atay
 
Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Dhtml ppt (2)
Dhtml ppt (2)Dhtml ppt (2)
Dhtml ppt (2)
 
Web Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen ScrapingWeb Scraping using Python | Web Screen Scraping
Web Scraping using Python | Web Screen Scraping
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
XML and Databases
XML and DatabasesXML and Databases
XML and Databases
 
Why XML is important for everyone, especially technical communicators
Why XML is important for everyone, especially technical communicatorsWhy XML is important for everyone, especially technical communicators
Why XML is important for everyone, especially technical communicators
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic Analytics
 

Similar a NLP and the Web

Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
Tim Weninger
 

Similar a NLP and the Web (20)

Semantic framework for web scraping.
Semantic framework for web scraping.Semantic framework for web scraping.
Semantic framework for web scraping.
 
Rails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSSRails Girls - Introduction to HTML & CSS
Rails Girls - Introduction to HTML & CSS
 
Web Information Network Extraction and Analysis
Web Information Network Extraction and AnalysisWeb Information Network Extraction and Analysis
Web Information Network Extraction and Analysis
 
State of modern web technologies: an introduction
State of modern web technologies: an introductionState of modern web technologies: an introduction
State of modern web technologies: an introduction
 
Intro to mobile web application development
Intro to mobile web application developmentIntro to mobile web application development
Intro to mobile web application development
 
Dita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and DeveloperDita for the web: Make Adaptive Content Simple for Writers and Developer
Dita for the web: Make Adaptive Content Simple for Writers and Developer
 
Overview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysisOverview of modern software ecosystem for big data analysis
Overview of modern software ecosystem for big data analysis
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
DITA and SEO
DITA and SEODITA and SEO
DITA and SEO
 
Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1Html5 deciphered - designing concepts part 1
Html5 deciphered - designing concepts part 1
 
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML ParsingMastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
Mastering Web Scraping with JSoup Unlocking the Secrets of HTML Parsing
 
What “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincWhat “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information Modelinc
 
Social Media Data Collection & Analysis
Social Media Data Collection & AnalysisSocial Media Data Collection & Analysis
Social Media Data Collection & Analysis
 
Web Pages
Web PagesWeb Pages
Web Pages
 
Unit 01 (1).pdf
Unit 01 (1).pdfUnit 01 (1).pdf
Unit 01 (1).pdf
 
Html Templating - DOT JS
Html Templating - DOT JSHtml Templating - DOT JS
Html Templating - DOT JS
 
Html templating introduction
Html templating introductionHtml templating introduction
Html templating introduction
 
Week 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and ProducingWeek 1 - Interactive News Editing and Producing
Week 1 - Interactive News Editing and Producing
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
Industry Ontologies: Case Studies in Creating and Extending Schema.org for In...
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 

Último

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 

Último (20)

一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit RiyadhCytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
Cytotec in Jeddah+966572737505) get unwanted pregnancy kit Riyadh
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 

NLP and the Web

  • 1. NLP for the Web Dr. Matthew Peters @mattthemathman (with thanks to Rutu Mulkar, Erin Renshaw, Dan Lecocq, Chris Whitten, Jay Leary and many others at Moz)
  • 2. Moz is a SaaS company that sells SEO and Content Marketing software to professional marketers
  • 3. We crawl a lot: > 1 Billion pages / day We’d like to extract structured information from these pages
  • 4.
  • 8. Measure of page’s “Reach” + structured information à most important topics, associate authors with areas of expertise, etc
  • 9. Most NLP tasks (parsing, POS tagging, Q&A, sentiment, etc.) focus exclusively on text content.
  • 16. •  Motivation – what are some of the unique challenges and opportunities in doing NLP on the web? •  Main article extraction / page de-chroming •  Keyword extraction •  Author identification •  Conclusion Outline
  • 18. C1: Pages have clutter and unrelated text Navigation aids Ads Links to other articles
  • 19. C2: Text segments can confuse NLP components Mike Lindblom Bertha: State sues - but even the lawsuit is delayed Mike Lindblom Bertha State sues NP chunks extracted by our chunker the lawsuit
  • 20. •  Many different standards and only partial adoption •  Wide variety of templates and sites •  Broken HTML C3: The web has a lot of cruft
  • 21. •  Web page have attributes other then the visual text: URL string, page title, meta description, etc. •  The HTML/XML has a tree structure we can use O1: Pages have additional structure
  • 23. O3: CSS <div class="article-columnist-name vcard"> <a class="author url fn" rel="author" href="/author/mike-lindblom/">Mike Lindblom</a> </div>
  • 24. General approach 1.  Use HTML parser to represent page as a tree 2.  Split the tree into small pieces and analyze each piece separately 3.  Run NLP pipelines or other machine learning models on these small pieces to: - focus attention the important pieces - extract structured information - other task dependent objectives 4.  Need algorithms that efficiently process only raw HTML (without JS, image, CSS, etc.)
  • 25. Content extraction / Web page de-chroming
  • 26. Extract main article content (and optionally comments) from a web page Main article
  • 27. Dragnet •  Combine diverse features with machine learning •  Open source: https://github.com/seomoz/dragnet •  v1 (2013): link/text density + CETR: blog: https://moz.com/devblog/dragnet-content-extraction-from-diverse-feature-sets/ paper: http://www2013.org/companion/p89.pdf) •  v2: (2015): added Readability blog: https://moz.com/devblog/benchmarking-python-content-extraction-algorithms- dragnet-readability-goose-and-eatiht/
  • 28. 10,000 foot view •  Split page into distinct visual elements called “blocks” •  Use machine learning to classify each block as content or no content Inspired by: Kohlschütter et al, Boilerplate Detection using Shallow Text Features, WSDM ’10 Weninger et al, CETR -- Content Extraction with Tag Ratios, WWW ‘10 Readability: (https://github.com/buriy/python-readability)
  • 29. Splitting the text into blocks •  Use new lines in HTML (n) (CETR) •  Use subtrees (Readability) •  Flatten HTML and break on <div>, <p>, <h1>, etc. (Kohlschütter et al. and us)
  • 30.
  • 31.
  • 32.
  • 33. Text and link “density” two important features. High text density, low link density à more likely to be content
  • 34. Compute “smoothed tag ratio”, ratio of # tags to # chars Compute “smoothed absolute difference tag ratio”, dTR / dblock. Captures intuition that main content occurs together Run k-means with 3 clusters on blocks, with one centroid always pinned to (0, 0) Blocks in (0, 0) cluster are non-content, remainder content
  • 35. Encode CETR predicted class as 0-1 feature
  • 36. Use a simplified version of Readability: •  Compute score for each subtree using: - parent id/class attributes - length of text •  Find subtree with highest score •  Block feature = maximum subtree score for all subtrees containing block
  • 38. Model performance From 2013 paper (v1) Task: Extract content and comments
  • 41. Keyword / topic extraction
  • 42. Extract a ranked list of keywords from a page with relevancy score (91, 'bertha') (61, 'stp') (59, 'state sues') (44, 'tunnel') (37, 'wsdot') (30, 'tunnel construction') (28, 'the seattle times') (17, 'seattle tunnel partners') (13, 'repair bertha') (10, 'transportation lawsuit’)
  • 43. Prior work Many prior papers on similar task Most use small data sets (hundreds of labeled examples) à unsupervised + supervised methods Wide range of previous approaches and almost always tailored to specific type of document (academic papers, etc.) Requirements for “gold standard” are fuzzy Our approach: Build a web specific algorithm to leverage unique aspects of domain Combine many different features / approaches Overcome data limitations and build complex model by gathering lots of data automatically
  • 44. Generate Candidates Rank Candidates Raw HTML Ranked Topics C A N D I D A T E S
  • 45. Main article Run dragnet to extract the main article content. Keep track of individual blocks and process each separately.
  • 46. This displays as a dash but is the unicode character U+2014 Need to special case @twitter, email@domain.com, dates, etc. Text & Token normalization Include web specific logic in tokenizer / normalizer
  • 47. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize C A N D I D A T E S
  • 48. Processing individual blocks helps NP chunker Mike Lindblom Bertha: State sues - but even the lawsuit is delayed Mike Lindblom Bertha NP chunks extracted by our chunker State sues the lawsuit
  • 49. Wikipedia lookup Treat “Statue of Liberty” as a single candidate instead of splitting into “Statue” “of Liberty”
  • 50. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup C A N D I D A T E S
  • 51. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL C A N D I D A T E S TF
  • 52. Ranking model features Shallow: relative position in document, number of tokens Occurrence: does candidate occur in title, H1, meta description, etc Term frequency: count of occurrences, average token count, sum(in degree), etc QDR: information retrieval motivated “query-document relevance” ranking models. TF-IDF (term frequency X inverse document frequency), probabilistic approaches, language models POS tags: is the keyword a proper noun, etc URL features: does the keyword appear in URL
  • 53. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL Classifier (probability of relevance) C A N D I D A T E S TF
  • 54. Generating training data List of high volume keywords Top 10 results Crawl pages Training Data: HTML with relevant keyword Commercial Search Engine
  • 55. PU learning Learning classifiers from only positive and unlabeled data, Elkan and Noto, SIGKDD 2008 ●  Most ML classifiers have both positive and negative examples in training data ●  We only have one keyword per page that is relevant (“positive”) and many others that may or may not be positive ●  Use result from this paper applied to our data
  • 56. Generate Candidates Rank Candidates Raw HTML Ranked Topics Parse/ Dechroming Normalize Sentence/ word tokenize POS tag/ Noun phrase chunk Wikipedia lookup Shallow Occurrence QDR POS URL Classifier (probability of relevance) C A N D I D A T E S TF
  • 57. Keyword extraction review Resulting algorithm is: ●  Robust across different content types – worst case still extracts reasonable topics ●  Reasonably fast, about 25 pages / second end-to-end ●  Subjectively outperforms other commercial APIs (e.g. Alchemy, etc). ●  In production for a year+, processed many millions of pages
  • 59. Author Extract a list of author names (or an empty list if no authors) from a given web page. See: https://moz.com/devblog/web-page-author-extraction/
  • 60. Do we need a ML algorithm for this? (why isn’t this trivial?) Heuristics do an adequate job: ●  The microformat rel="author" attribute in link tags (a) is commonly used to specify the page author ●  Some sites specify page authors with a meta author tag. ●  Many sites use names like “author” or “byline” for class attributes in their CSS.
  • 61. Sometimes heuristics work well <div class="article-columnist-name vcard"> <a class="author url fn" rel="author" href="/author/mike-lindblom/">Mike Lindblom</a> </div>
  • 62. But sometimes heuristics fall over Some pages do not have any special markup for byline...
  • 63. But sometimes heuristics fall over Some pages have misleading or wrong markup
  • 64. But sometimes heuristics fall over Links to related stories and sidebar bylines look nearly identical to the main byline
  • 65. Machine learning approach Use machine learning approach Crowd source labeled data using Spare5 Approx 9,000 labeled pages
  • 67. Ranked blocks Tags for highest ranked block Dragnet HTML Blockfier Author chunker Page HTML Block representation Block ranking model
  • 68. Block ranking model Combines NLP and web features •  Tokens in block text (similar to bag-of-words classification) •  Tokens in block HTML tag attributes (e.g. class=“byline”) •  The HTML tags in block (e.g. many author names are links) •  rel=“author” and other markup inspired features Put all features through Random Forest classifier that predicts probability a block contains author
  • 69. Block model performance Overall block model is pretty good – captures intuition that “bylines are easy to spot” Table lists Precision@K whether block actually contains the author’s name.
  • 70. Author Chunker Modified IOB tagger similar to NP chunker or POS tagger 3 – class classification problem (Beginning of name, Inside name, Outside) To make predictions at next token: uni-, bi- and tri-gram tokens from previous/next few tokens uni-, bi- and tri-gram POS tags previous predicted IOB labels HTML tags preceding and following the token rel="author" and other markup inspired features Overall 85.6% accurate chunking top block.
  • 71. Author Chunker <p class="byline”> by <a href="http://arstechnica.com/author/john-timmer/” rel="author”> <span>John Timmer</span> </a> - <span class="date”>Jul 1, 2016 6:55 pm UTC</span> </p>
  • 72. Author Chunker using HTML features by John Timmer - Jul 1 IN NNP NNP - NN CD O ?? <p class="byline”> <a rel="author”><span> </span></a> To make prediction here, we can use tokens, POS tags and HTML structure between tokens
  • 73. Overall author model performance Overall accuracy on test set is good, outperforming alternatives. (heuristics) (commercial API) (OS Python library)
  • 76. NLP for the Web Dr. Matthew Peters @mattthemathman