My presentation at ODSC 2017. Video link: https://www.youtube.com/watch?v=Mwv6dSTYvN4&t=
AI engine for unsupervised full-site web extraction from millions of websites, supervised machine learning methods to fuse these distinct information sources and link them into our domain-specific probabilistic knowledge graph with over 275 million facts mined so far. Also shared practical learnings on how we combine traditional information extraction techniques with recent advancements in deep learning for a variety of NLP tasks such as entity-level sentiment and relation extraction on 10’s of millions of new documents a day across 15 different languages.
3. Meltwater is the global leader in media intelligence
3
1500 employees
worldwide
28,000
corporate clients
50 offices across
6 continents
Bootstrapped
No venture funding
FOUNDED 2001
in Oslo, Norway
HEADQUARTERS
in San Francisco
4. Big data company: We process 100 million documents and
billions of searches every day
4
data
capture
data
enrichment
proprietary
search
engine
real-time
analytics
5. 5
We track leading performance
real-time
analytics
Client
Satisfactio
ns
Industry
Trends
Competitive
Intelligence
Brand
Perception
Share
of
Voice
7. Under the hood
7
Ingestion:
• AI crawling for unstructured web
• Programmatic api’s for partnerships
• Over 100M documents everyday
Media Intelligence applications
• 1M complex Boolean queries configured
• Counters, aggregates, drill downs, pivoting, regression
• Vertical Search, news feed, media exposure, alerts based
on trends & anomalies, influencers etc
Data Augmentation (15 languages):
• Text categorization (topic, language)
• Keyphrase extraction, summarization
• Sentiment analysis (entity, aspect level)
• Semantic hashing for near duplicate detection
Knowledge Management
• NER (person, location, organization, ...)
• NED ( https://en.wikipedia.org/wiki/Tim_Cook )
• Relation & event extraction
• Truth finding, link prediction, graph mining
8. Motivation for the platform
8
Access to Structured Data
• Make sure the data is clean, complete & normalized
• Make it relevant by connecting the dots
• Bring the methods close to the data
Data is deceitful. Need a systematic way to
mine, propose, and explain possible insights
• we need factual knowledge
• combine (machine) learning and reasoning
Source: www.tylervigen.com (Spurious Correlations)
9. 9
Streaming, Search, Analytics, APIs
Building blocks to leverage the platform
Data Enrichment Platform
Enrich, analyze & build insights by interoperating with all major players
Knowledge Graph
Enable cognitive applications on top of our data by connecting the dots
AI-Driven Data Acquisition
Bring high quality outside data to our repository with minimal human effort
Media Intelligence
Apps
New
Apps
Enterprise
Solutions
Custom
solutions
3rd party
Apps
PaaS
Outside
Data
Context
Building
Enrichment
& Analysis
Service
Layer
Global Monitoring
Distribute
Analyze & Report
Influence & Engage
Outside
Insight
AI-Powered
Reporting
Employee
App
Freemium
100M
documents
ingested daily
150 NLP/IR
pipelines
100’s Billions of
Searches
10. 10
A treasure trove of valuable external data
Online News
Share Price Job Postings
Press Releases
Patent Filings
Social Media Financial Filings
Real Estate Rates
App Downloads
Web traffic
UnemploymentOil Prices
Court DocumentsOnline ad Spend
Blogs
Forums
Product Reviews
Interest Rates
Corporate
Websites
Weather Data
12. The academic web
12
Typical comments about web data extraction
• Microdata and the semantic web have solved problems
• All the data is in web tables
• API’s provide all the structured data you need
13. The real web
13
Web data extraction is not a solved problem
• API’s are limited to large websites
• Web tables and microdata are marginal
• The real problem is not one-time extraction, but keeping the data up-to-date over time
14. AI crawling for Web data extraction
14
Traditional scraping requires a huge human effort:
•Code wrappers for each source, e.g., in Scrapy or MW’s source configurations
•Visually testing and support tool (ala Connotate, Mozenda, …)
•Automatic scraping for small number of fixed data types (ala Diffbot), e.g., Microdata
•Meltwater (old): ~50 “source engineers” maintaining manual wrappers
o sources failing at a rate of 100’s per week, 1-2h to fix each source effectively
15. 15
Web-Scale Wrapper Induction
We need to scale to the web
• minimize supervision per source
But: we can afford prior knowledge
• about entities and attributes
• mostly in form of known knowledge graph for domain knowledge
• expressed as Gazetteers or rules for local, textual information
• higher-level rules or classifiers for complex structures
16. Web-Scale Wrapper Induction
16
Problem: application of prior
knowledge is costly & noisy
• wrapper induction to
generalise to other pages of
site
• “template” hypothesis
Solution: Generate “wrapper”
program from examples
• then apply to all pages of a
site
• when to apply which
extractor
Full site extraction needs to
also deal with
• Interactivity such as
pagination & form filling
(deep web)
• Detecting complex
structures such as lists,
tables, …
17. 17
Fairhair.AI Crawlers
Exploration
• Focused crawling
• Stop conditions
• Relational transducers
Template Discovery
• Data areas detection
• Record segmentation
• Attribute alignment
Form Understanding
• Labelling
• Classification
• Filling
Domain Modeling
• DOM annotation
(dictionaries, regexes)
• Web phenomenology
(forms, fields, labels,
menus)
• Conceptual models
Framework for rule-based feature engineering supporting quick turn around for domain-specific rich features on
top of a library of 2.5k pre-built features representing structure, visual rendering, and textual content of a
webpage, as well as the link structure and interaction patterns of the entire site.
18. 18
{
"title": "White House vows to fight media 'tooth and nail' over Trump coverage; says it presented 'alternative
facts'",
"authors":
[ {
"name": "Doina Chiacu",
"socialHandlers": {
"linkedin": "https://www.linkedin.com/in/doina-chiacu-2b2a9875",
"twitter": "https://twitter.com/doinachiacu" } },
{
"name": "Jason Lange",
"socialHandlers": {
"twitter": "https://twitter.com/langejason" } }
],
"datePublished": {
"date": "2017-01-22",
"time": "9:36PM"
},
"keywords": "Politics",
"summary": "The White House vowed on Sunday to fight the news media “tooth and nail” over what it sees
as unfair attacks, with a top adviser saying the Trump administration had presented “alternative facts” to
counter low inauguration crowd estimates.",
"siteHandlers": {
"twitter": "@UnionLeader" },
"ingress": "WASHINGTON — The White House vowed on Sunday to fight the news media “tooth and nail”
over what it sees as unfair attacks, with a top adviser saying the Trump administration had presented
“alternative facts” to counter low inauguration crowd estimates.",
"images": [ {
"url": "http://www.unionleader.com/storyimage/UL/20170122/NEWS06/170129767/AR/0/AR-
170129767.jpg",
"type": "primary" }],
"engagements": [ {
"value": "4",
"type": "comments" } ],
"content": "On his first full day as president, Trump said he had a “running war” with the media and accused
journalists of underestimating the number of people who turned out Friday for his swearing-in.nnWhite House
officials made clear no truce was on the horizon on Sunday in television interviews that set a much harsher
tone in the traditionally adversarial relationship between the White House and the press corps.nn“The point is
not the crowd size. The point is the attacks and the attempt to delegitimize this president in one day. And we’re
not going to sit around and take it,” Chief of Staff Reince Priebus said on “Fox News Sunday.”nnThe sparring
with the media has dominated Trump’s first weekend in office, eclipsing debate over policy and Cabinet
19. 19
Effects of AI Crawling
80-90% lower
human effort
10-100x
more sources
without loss in quality
compared with state-of-
the-art
and domains than existing
automated solutions and
affordable supervised one
3-10x more
attributes
e.g., 300k+ news sources, 1M+ of
company websites, Job postings,
Press releases
20. 20
Information extraction
Identify and disambiguate mentions of entities of interest in a document.
ORG ORG DATETIME
NER Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
Tesla Science Center at Wardenclyffe
Tesla (Czechoslovak company)
Tesla, Inc.
SolarCity Corporation
City Solar AG
Black Monday
Monday
NED Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
VBZ VBN DT JJ NN IN NNP WDT VBD IN NNP NNNNP
Tokenizer
+
Splitter
+
PoS
Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
21. 21
Motivation
• Disambiguated entities are necessary to produce new relations for the graph via Relation
Extraction (RE) - no disambiguation, no linking of relations to the nodes in the graph
• Plain keyword search is extremely noisy
Tesla, Inc.
SolarCity
Corporation
Elon Musk
CEO
Lyndon Rive
CEO
Palo Alto, CA
HQ
HQ
San Mateo, CA
ACQUISITION
acquire(Tesla_Inc, SolarCity_Corp)
Tesla has announced the full acquisition of SolarCity which closed on Monday morning .
22. NER architecture
22
● No feature engineering required
● Current SOTA does use DL
● Transfer learning to new domains
with fewer training instances
● Better Generalization
23. Character Representation using CNN
23
● Different embedding size --- token , character
● Optimizers : Adam , RMSProp, SGD
● Varying dropout rates
● Changing number of hidden layers and units
● Different regularization techniques
● Averages pooling vs Max pooling for CNN
● Varying filter and window length for CNN
Ref : End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , Xuezhe Ma, Eduard Hovy -
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016)
24. 24
NED Architecture
24
Tesla has announced the full acquisition of
SolarCity which closed on Monday morning.
Plain text NER Annotated spans
entity(0,5,ORG,”Tesla”)
entity(44,53,ORG,”SolarCity”)
entity(70,84,DATETIME,”Monday Morning”)
Leipzig University’s AGDISTIS NED
HITS - Scoring algorithm
Tesla
SolarCity
Monday Morning
Indexed
triple store
Disambiguated entities
Tesla
|-> http://.../mw/Tesla_Inc
SolarCity
|-> http://.../dbp/SolarCity
Tesl
a
Inc.
SolarCity
Corporation
KG
25. 25
HITS Algorithm (PageRank variant)
● It is query dependent, that is, the (Hubs and Authority) scores resulting from the
link analysis are influenced by the search terms;
● As a corollary, it is executed at query time, not at indexing time, with the
associated hit on performance that accompanies query-time processing.
● It computes two scores per document, hub and authority, as opposed to a single
score;
● It is processed on a small subset of ‘relevant’ documents (a 'focused subgraph' or
base set), not all documents as was the case with PageRank.
26. 26
Problems and solutions
Problem: AGDISTIS only receives named entities, thus it’s disambiguation context is limited
Solution: Complement the NER with a domain-specific (non named) entity recognizer and verbs.
• We have a first version available for the business domain
• Bootstrapped from large corpus of docs and a termbank generation algorithm
Tim Cook is the CEO of Apple. I know Bayer is a pharmaceutical company.
PER BUSINESS
TERM
ORG ORG INDUSTRY BUSINESS
TERM
Problem: AGDISTIS is language agnostic but the index is not. We were initially limited to English
Solution: Replicate the same process on other languages by mapping the index fields
• Not all languages have comprehensive DBPedia fields
• The English part (most richest) of the index probably has to be always present
28. 28
Relation extraction using LSTM’s
Google is competing with Microsoft
LSTM LSTM LSTM LSTM LSTM
Softmax
Output
Vectors Vectors Vectors Vectors Vectors
LSTM Layer
Embedding
Layer
Input
Sentence
Dense Layer
29. Knowledge Graph
29
• Funding Developments
• Leadership Changes
• New Offerings
• Bankruptcy,
• Restructuring, Cost
Cutting
Editorial
Influencer
DB
Job
Postings
Press
Releases Company
Database
SEC
Filings
Patents
Social
Media
Company
Website
FHAI
Knowledge
Graph
3rd Party
providers
• Competitor
• Customer
• Investment
• Lawsuit/Litigation
• Partnership
• Companies
• Brands
• Products
• Key people
• Influencers
• Relate facts
• Data mining
• Cognitive applications (higher-order
reasoning)
• Contextual Features
• Supplier
• Acquisition
• Out/under performance
• Expanding Operations
• Compliance
Entities: Goal:
Infer high-level insights from a set of extracted events/facts.
30. 30
Challenges
Source: Xin Luna Dong (Google) - PVLDB ‘14
Text
(301M)
Document Object
Model
(1,280M)
Tables
(10M)
Annotations
(website metadata)
(145 M)
110M
13K
1.5M.3M
1.1M 1.7M
● Knowledge deduplication /
integration
● Truth Finding (Contradictory facts)
● Confidence values
31. 31
Graph embedding
• Given a graph (g), entities (e) and relations (r), produce a low rank tensor factorization of
the co-occurrence cube of all combinations of (e,r,e)
• Input:
• Graph
• Vector size
• Goal:
• Find vectors for all (e) and (r) that minimizes the scoring function:
• Output:
• Embedding vectors for (e) and (r).
32. 32
Link prediction using embeddings
• Given a pair of entities (e1,e2), give a score on how
probable that they have a relation
• Input:
• Embedding vectors of entities and relation r
• Annotated examples of true and false combinations of
e,r,e
• Goal:
• Find a decision boundary that separates the true and
false
• E.g.: Using a standard classifier (SVM or
RandomForest), use Embeddings as features
• Output:
• A probability score for e1,e2,r
34. • Combine PRA and Embedding models to produce a superior link
scoring/prediction algorithm.
• Achievement: significant improvement over SOTA during our experiments
Link prediction combination
34
35. 35
Output Organization Person
11,171,077 1,708,796
Relation Instances
Competition 33,327,137
Works At 228,070
Investor (Company) 93,980
Founder 67,515
Board Member 43,525
Acquisition 15,532
Investor (Person) 10,420
Sub-organization 4,214
275M Facts Mined
(Distinct)
37. 37
NLP/IE Pipelines
NED
2
LANGUAGE-COUNTRY
en-us
en-uk
sv-se
fr-fr
fr-ca
...
DL
language
classifier
topic
classifier
NER1
router
NER2
.
.
.
country
classifier
CLASSIFIER TOPICS
Arts & Entertainment
Business
Demographic Groups
Environment & Nature
Events
Government & Politics
Health
Lifestyle
Living Things
Media
Science
Social Affairs
Sports
Technology
{
“topic”: “business”,
“language”: ”en”,
“country”: ”us”,
“section”: “body”
}
NED1
NED
n
.
.
.
RE2
RE1
REn
.
.
.
DP
model
repo
nlp-data
repo
registry
NED index
repo
GS
repo
scoringGS
versioned
SW
NERn
Every time a new component sw-
version is registered a scoring task
against the GS is triggered
A workflow for analyzing datasets
(Batch & Real Time)
The standard workflow
fetching documents from
the data lake
DOCUMENT SECTIONS
title
ingress
highlights
body
captions
quotations
Supports multiway data flows, e.g.,
for ensembles of NERs
IR
38. 38
Human in the loop
● Annotate text, entities, classification and
custom HTML
● Task Assignment, Ranking, Inter Annotator
agreement
● Gold set creation for any structured data like
NED, Knowledge Fusion
Very Time
Consuming
39. 39
Data Programming
Our deep learning approach for extracting entity-relation tuples requires a huge amount of labelled
data, which is an expensive and time-consuming effort.
Facebook is competing with Google
With Snorkel the goal is to write heuristics to programmatically generate training data.
Potential
relation
mentions
f1
fn
Probabilistic
training labels
Heuristic (Labelling) functions
42. 42
• Input text is processed as sequence of UTF8
encoded bytes.
• Hidden states of model encodes all information the
model has learned.
• Final cell states are used as feature representation.
• Encoded output values range from -1 to 1.
• The mLSTM response lag is corrected using reverse
correlation method which ensures the responses
align with the corresponding text.
• The underlying lag corrected mLSTM response to
individual keyphrases is averaged to produce
keyphrases with sentiment values.
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Input text
mLSTM encoder
Aspect extractor
encoder lag correction
Aspect Level
sentiment
System Block Diagram
Aspect level sentiment extraction using character level LSTM
43. ALS extraction using character level LSTM
43
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Network design
• Single layer multiplicative LSTM with 4096 units
• Mini-batches of 128 subsequences of length
256
• 4 Pascal Titan X gpus
• Training took approximately one month
• Trained on ~100M online reviews that are
labelled
44. 44
Unsupervised SAE
Large corpus of homogeneous documents (50k ~ 250k)
• same domain (use a classifier), preferably no bundles
Normalisation and tagging
• tokenisation (NUT specific)
• orthography normalisation (most common orthography)
• POS tagging (Hepple’s on TreeBank)
• NP chunking (Ramshaw – Mitchell)
NP Clustering
• head noun lemmatization (approx. last noun in NP)
• frequent head nouns -> aspect terms
Segmentation
• cPMI optimal parsing of an NP -> modifiers / multi-
words
Generalisation and typing
• structured aspect patterns (SAP)
• entity, aspect term, qualifier, quantifier
45. 45
The filled markers indicate shifts in the LSTM response that are used to
extract keyphrases in the text and their corresponding sentiment value
from LSTM response.
Aspect Sentiment
Display negative
Email alert notification negative
Fonts negative
wallpaper negative
Example
Multiplicative LSTM for sequence modelling. Krause et al., 2017
Learning to Generate Reviews and Discovering Sentiment. Radford et al., 2017
Aspect sentiment extraction using character level LSTM
47. 47
Connectors to serving systems
Data Ingestion &
Insights Delivery by
setting up simple
schema mappers
48. 48
Involve users, entrepreneurs, and researchers
6 Data Science Hubs (co-working spaces)
✔ Sydney
✔ Berlin
✔ New York
✔ London
✔ San Francisco
✔ Singapore
Meltwater Entrepreneurial School of Technology
• HQ in Accra, Ghana
• Training program for African entrepreneurs
• Incubator (25+ startups)
• Networking hub
University collaborations
49. 49
Streaming, Search, Analytics, APIs
Building blocks to leverage the platform
Data Enrichment Platform
Enrich, analyze & build insights by interoperating with all major players
Knowledge Graph
Enable cognitive applications on top of our data by connecting the dots
AI-Driven Data Acquisition
Bring high quality outside data to our repository with minimal human effort
Media Intelligence
Apps
New
Apps
Enterprise
Solutions
Custom
solutions
3rd party
Apps
PaaS
Outside
Data
Context
Building
Enrichment
& Analysis
Service
Layer
Global Monitoring
Distribute
Analyze & Report
Influence & Engage
Outside
Insight
AI-Powered
Reporting
Employee
App
Freemium