The Semantic Web And The News

The Semantic Web and the News:
Exploitation and Adoption

Ken Ellis
Chief Scientist

Agenda

Intro to Daylife

 Exploiting the Semantic Web
Named Entities

Toolsets, issues


Adopting / Enabling

Others

Daylife


Daylife

A Platform for News Innovation:

A scalable solution for publishers of all sizes to generate more content
and more inventory – with no additional personnel costs

Daylife: What We Do
Aggregate Content

Licensed photos (Getty, AP, Reuters)

Articles (scraped, real-time)

Create Metadata

Topics (people, organizations, concepts)

Topic taxonomy, descriptions

Quotes with attribution

Photo identification

Relatedness

Authorship, sentiment analysis, etc.

Deliver to Clients

Web Sites / Modules / Data

Flexibility: API w/ 500 distinct queries

Novel search/ranking algorithms

Free API


[Wiki|DB]Pedia and Named Entites

We also want to collect content around a named entity
…and associate it with external data (Wikipedia, Freebase)

… for a lot of NE’s
(55k newsworthy ones last month)
1000000

100000
Articles Per Month

10000

1000

100

10

1
1 10 100 1000 10000 100000

NE Rank


Without getting swamped


Daylife and the Semantic Web

Wikipedia

website

API

Wikimedia dumps


DBPedia

Freebase

Partners

IPTC, NewsML


Clients

Proprietary metadata


Resources for News Organizations

Named Entities
Wikipedia 

vetting
website 

disambiguation
API 

aliases
Wikimedia dumps 

prominence
DBPedia 

Freebase

Partners

IPTC, NewsML


Clients




But:
“… Now, team owner Kevin Buckler is looking to debut
in NASCAR Sprint Cup Series competition, when Mike
Wallace runs in Thursday's Gatorade Duel …”

Which Mike Wallace?

Mike_Wallace_(journalist)

Mike_Wallace_(NASCAR)


Two disambiguation approaches

Given an article, extracted name, what Wikipedia entry does

it map to?
Given a Wikipedia entry, what articles match?



Articles First:

Wikimedia dumps and DBPedia

Filter for people, organizations, other NE

Construct weighted graph from links

Proxy for prominence (# edits, pageviews, dumps only)

Redirects & disambiguation pages

“Hillary Clinton” redirect to Hillary_Rodham_Clinton: human

decided reference is unambiguous; Usama/Osama

Identify names, possibly matching graph nodes

Select set of nodes that minimizes total distance

Perhaps factor in node prominence



Mike
Wallace
journalist

NASCAR
Chicago
Sun-
Times

Mike
Kevin
Chicago
Wallace
Buckler
Bulls
NASCAR

Gatorade

I made this up!


Another possibility: compare text of Wikipedia entry to

the article

But:

Wikipedia entries largely historical, small fraction related to

current events
Journalists, in providing context for lesser-known individuals,

often mention a few other named entities


NE First approach:

Classifier for race car drivers, Wikipedia to identify names

Filter based on prominence

See EVRI taxonomical paths

http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with-
taxonomical-paths


NE First:

Tractable for a human (limited number of classifiers)

Better for low-recall high-precision


Article First:

Low editorial oversight

Best-guess


Neither is a complete solution

Not for locations


General Nits

Sticky Graffiti

Wikipedia can be updated

real-time if you don’t like it
Some derived data sets

can’t. Makes it our
problem!
On-demand updates from

Wikipedia API / HTML

General Nits

Career Changes

Mike Wallace (journalist)

becomes a NASCAR driver
Joe Wurtzelbacher

becomes a political pundit
Not a complete solution,

but we knew that.

General Nits

Staleness

Infrequent Wikimedia

dumps
GWB is still president?

DBPedia bad

Wikimedia dumps bad

Freebase good

Wikipedia HTML/API good

DBPedia, 3/5/09

Obscure Information

Clint Eastwood:

Is prominent, is a politician

Not a prominent politician


URI Stability

If this were 1981, unambiguous “George Bush”:


<rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot;
xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;>
<rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;>
<dc:title>George Bush</dc:title>
<dc:publisher>Wikipedia</dc:publisher>
</rdf:Description>
</rdf:RDF>

The NYTimes did this, and still does (API):

“George Bush” tag  George H. W. Bush


A lucky problem to have!


Resources

Named Entities
Wikipedia 

GUID’s!
website 

tagging
API 

associations (members of
Wikimedia dumps 

teams)
DBPedia

other data

Freebase

Partners

IPTC, NewsML


Clients



Freebase
GUID’s are stable

Query by Wikipedia URI
 http://www.freebase.com/api/service/mqlre
ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/
Easy-to-find redirects

en/Mike_Wallace_$0028journalist$0029quot;}}
GWB isn’t president

Professions vs. Types

Easier for topic tagging


Clint Eastwood still a politician

but: easier to tell he’s a minor one

multiple types/professions, not much political data


No good proxy for significance

cross-reference


Resources

Inter-agency standards
Wikipedia 

Newswire services
website 

IPTC: photo information
API 

NewsML: article information,
Wikimedia dumps 

topics
DBPedia

Freebase

Partners

IPTC, NewsML


Clients



Interagency Metadata

Data:

 authorship
 location
 caption
 sometimes people,
category
 NE’s hand-typed,
often quickly
 RSS almost as good

Stripped

Matching problem,

but STILL USEFUL

Resources

Q: “Can you use our metadata”
Wikipedia 

A: “Sometimes”
website 

API

Again, matching problem, but
Wikimedia dumps 

good for client-specific topics,
DBPedia

still useful
Freebase

Partners

IPTC, NewsML


Clients



Others Using the Semantic Web

Having an API

not the Semantic Web, but at least machine-friendly

eventually common, even for publishers


Publishing URI’s for Wikipedia, Freebase, IMDB, etc.

common among non-publishers

parasitic (not bad!)


Querying using the same URI’s

not so common

mutualistic



EVRI

API

Topics (mostly, all?) from Wikipedia

Probably taxonomic pathways, facets, derived from Wikipedia

Disambiguation based on above

Published Wikipedia URL’s

Can’t query by Wikipedia, other URI’s



Zemanta

Lots of Linked Data

API provides text markup


Developing (with others)

simplified RDFa based
semantic tagging standard


Calais (Thomson Reuters)

API extracts NE’s, other information

Provides Linked Data URI’s to others (one-way)

Provides their own endpoints

Not an aggregator

Eventual support for querying

Very clean!


The New York Times

Leading charge with publisher API

Their own tagging, great quality

Some major newspapers

following suit
Others APIs: NewsGator, Inform,

Outside.in
Slow Moves to Digital Access

Full-text RSS rare

API rare

Semantic Web standards rare

Wouldn’t it be great if:

You could ask for content about Mike_Wallace_(American_football)

They pointed you to other rich data sources


Wikipedia URI Lookup
A quick service to support lookup for Wikipedia URI’s

http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=
http://en.wikipedia.org/wiki/Mike_Wallace_(journalist)
or
http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama

Thank you

Web Site
http://www.daylife.com

Daylife API
http://developer.daylife.com

Labs
http://labs.daylife.com

Email
ken@daylife.com

The Semantic Web And The News

Recomendados

Recomendados

Más contenido relacionado

Similar a The Semantic Web And The News

Similar a The Semantic Web And The News (20)

Último

Último (20)

The Semantic Web And The News