SlideShare una empresa de Scribd logo
1 de 32
The Semantic Web and the News:
Exploitation and Adoption

Ken Ellis
Chief Scientist
Agenda

  Intro to Daylife

 Exploiting the Semantic Web
        Named Entities
    
        Toolsets, issues
    

    Adopting / Enabling

        Others
    
        Daylife
    
Daylife

                A Platform for News Innovation:

A scalable solution for publishers of all sizes to generate more content
        and more inventory – with no additional personnel costs
Daylife: What We Do
    Aggregate Content

        Licensed photos (Getty, AP, Reuters)
    
        Articles (scraped, real-time)
    
    Create Metadata

        Topics (people, organizations, concepts)
    
        Topic taxonomy, descriptions
    
        Quotes with attribution
    
        Photo identification
    
        Relatedness
    
        Authorship, sentiment analysis, etc.
    
    Deliver to Clients

        Web Sites / Modules / Data
    
        Flexibility: API w/ 500 distinct queries
    
        Novel search/ranking algorithms
    
        Free API
    
[Wiki|DB]Pedia and Named Entites

 We also want to collect content around a named entity
…and associate it with external data (Wikipedia, Freebase)
[Wiki|DB]Pedia and Named Entites
                                         … for a lot of NE’s
                                   (55k newsworthy ones last month)
                     1000000



                      100000
Articles Per Month




                      10000



                       1000



                        100



                         10



                          1
                               1       10      100             1000   10000   100000

                                                     NE Rank
[Wiki|DB]Pedia and Named Entites

    Without getting swamped

Daylife and the Semantic Web

    Wikipedia

        website
    
        API
    
        Wikimedia dumps
    

    DBPedia

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Resources for News Organizations

                                   Named Entities
    Wikipedia                  

                                        vetting
        website                     
    
                                        disambiguation
        API                         
    
                                        aliases
        Wikimedia dumps             
    
                                        prominence
    DBPedia                         

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
[Wiki|DB]Pedia and Named Entites

                        But:
“… Now, team owner Kevin Buckler is looking to debut
 in NASCAR Sprint Cup Series competition, when Mike
     Wallace runs in Thursday's Gatorade Duel …”

    Which Mike Wallace?

        Mike_Wallace_(journalist)
    
        Mike_Wallace_(NASCAR)
    


    Two disambiguation approaches

        Given an article, extracted name, what Wikipedia entry does
    
        it map to?
        Given a Wikipedia entry, what articles match?
    
[Wiki|DB]Pedia and Named Entites

    Articles First:

    Wikimedia dumps and DBPedia

        Filter for people, organizations, other NE
    
        Construct weighted graph from links
    
        Proxy for prominence (# edits, pageviews, dumps only)
    
        Redirects & disambiguation pages
    
            “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human
        
            decided reference is unambiguous; Usama/Osama


    Identify names, possibly matching graph nodes

    Select set of nodes that minimizes total distance

        Perhaps factor in node prominence
    
[Wiki|DB]Pedia and Named Entites

      Mike
     Wallace
    journalist


                                             NASCAR
Chicago
  Sun-
 Times

                                                       Mike
                              Kevin
                 Chicago
                                                      Wallace
                             Buckler
                  Bulls
                                                      NASCAR




          Gatorade

                           I made this up!
[Wiki|DB]Pedia and Named Entites

    Another possibility: compare text of Wikipedia entry to

    the article

    But:

        Wikipedia entries largely historical, small fraction related to
    
        current events
        Journalists, in providing context for lesser-known individuals,
    
        often mention a few other named entities
[Wiki|DB]Pedia and Named Entites

    NE First approach:

    Classifier for race car drivers, Wikipedia to identify names

        Filter based on prominence
    
        See EVRI taxonomical paths
    
        http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with-
        taxonomical-paths
[Wiki|DB]Pedia and Named Entites

    NE First:

        Tractable for a human (limited number of classifiers)
    
        Better for low-recall high-precision
    


    Article First:

        Low editorial oversight
    
        Best-guess
    


    Neither is a complete solution

    Not for locations

[Wiki|DB]Pedia and Named Entites
General Nits

    Sticky Graffiti

        Wikipedia can be updated
    
        real-time if you don’t like it
        Some derived data sets
    
        can’t. Makes it our
        problem!
        On-demand updates from
    
        Wikipedia API / HTML
[Wiki|DB]Pedia and Named Entites
General Nits

    Career Changes

        Mike Wallace (journalist)
    
        becomes a NASCAR driver
        Joe Wurtzelbacher
    
        becomes a political pundit
    Not a complete solution,

    but we knew that.
[Wiki|DB]Pedia and Named Entites
General Nits

    Staleness

    Infrequent Wikimedia

    dumps
    GWB is still president?

        DBPedia bad
    
        Wikimedia dumps bad
    
        Freebase good
    
        Wikipedia HTML/API good
    
                                  DBPedia, 3/5/09
[Wiki|DB]Pedia and Named Entites
    Obscure Information

    Clint Eastwood:

        Is prominent, is a politician
    
        Not a prominent politician
    
[Wiki|DB]Pedia and Named Entites
      URI Stability
  
      If this were 1981, unambiguous “George Bush”:
  

<rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot;
        xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;>
  <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;>
    <dc:title>George Bush</dc:title>
    <dc:publisher>Wikipedia</dc:publisher>
  </rdf:Description>
</rdf:RDF>



      The NYTimes did this, and still does (API):
  
          “George Bush” tag  George H. W. Bush
      

      A lucky problem to have!
  
Resources

                                   Named Entities
    Wikipedia                  

                                        GUID’s!
        website                     
    
                                        tagging
        API                         
    
                                        associations (members of
        Wikimedia dumps             
    
                                        teams)
    DBPedia

                                        other data
                                    
    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Freebase
    GUID’s are stable

    Query by Wikipedia URI
                                        http://www.freebase.com/api/service/mqlre
                                         ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/
    Easy-to-find redirects

                                         en/Mike_Wallace_$0028journalist$0029quot;}}
    GWB isn’t president

    Professions vs. Types

    Easier for topic tagging



    Clint Eastwood still a politician

        but: easier to tell he’s a minor one
    
        multiple types/professions, not much political data
    


    No good proxy for significance

        cross-reference
    
Resources

                                   Inter-agency standards
    Wikipedia                  

                                   Newswire services
        website                
    
                                   IPTC: photo information
        API                    
    
                                   NewsML: article information,
        Wikimedia dumps        
    
                                   topics
    DBPedia

    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Interagency Metadata

     Data:

       authorship
       location
       caption
       sometimes people,
        category
       NE’s hand-typed,
        often quickly
     RSS almost as good

        Stripped
    
        Matching problem,
    
        but STILL USEFUL
Resources

                                   Q: “Can you use our metadata”
    Wikipedia                  

                                   A: “Sometimes”
        website                
    
        API
    
                                   Again, matching problem, but
        Wikimedia dumps        
    
                                   good for client-specific topics,
    DBPedia

                                   still useful
    Freebase

    Partners

        IPTC, NewsML
    

    Clients

        Proprietary metadata
    
Others Using the Semantic Web

    Having an API

        not the Semantic Web, but at least machine-friendly
    
        eventually common, even for publishers
    



    Publishing URI’s for Wikipedia, Freebase, IMDB, etc.

        common among non-publishers
    
        parasitic (not bad!)
    



    Querying using the same URI’s

        not so common
    
        mutualistic
    
Others Using the Semantic Web

    EVRI

        API
    
        Topics (mostly, all?) from Wikipedia
    
        Probably taxonomic pathways, facets, derived from Wikipedia
    
        Disambiguation based on above
    
        Published Wikipedia URL’s
    
        Can’t query by Wikipedia, other URI’s
    
Others Using the Semantic Web

    Zemanta

        Lots of Linked Data
    
        API provides text markup
    


        Developing (with others)
    
        simplified RDFa based
        semantic tagging standard
Others Using the Semantic Web

    Calais (Thomson Reuters)

        API extracts NE’s, other information
    
        Provides Linked Data URI’s to others (one-way)
    
        Provides their own endpoints
    
        Not an aggregator
    
        Eventual support for querying
    
        Very clean!
    
Others Using the Semantic Web
    The New York Times

        Leading charge with publisher API
    
        Their own tagging, great quality
    
        Some major newspapers
    
        following suit
        Others APIs: NewsGator, Inform,
    
        Outside.in
    Slow Moves to Digital Access

        Full-text RSS rare
    
        API rare
    
        Semantic Web standards rare
    
    Wouldn’t it be great if:

        You could ask for content about Mike_Wallace_(American_football)
    
        They pointed you to other rich data sources
    
Wikipedia URI Lookup
A quick service to support lookup for Wikipedia URI’s

         http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=
          http://en.wikipedia.org/wiki/Mike_Wallace_(journalist)
                                     or
http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
Thank you


            Web Site
            http://www.daylife.com

            Daylife API
            http://developer.daylife.com

            Labs
            http://labs.daylife.com

            Email
            ken@daylife.com

Más contenido relacionado

Similar a The Semantic Web And The News

Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
ACSG Section Montréal
 
London HUG
London HUGLondon HUG
London HUG
Boudicca
 
Creating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic LibraryCreating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic Library
Darylyne Provost
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
Pierre De Wilde
 

Similar a The Semantic Web And The News (20)

Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416Freebase: Wikipedia Mining 20080416
Freebase: Wikipedia Mining 20080416
 
“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”“Library 2.0: Let's get connected!”
“Library 2.0: Let's get connected!”
 
WTF is Semantic Web?
WTF is Semantic Web?WTF is Semantic Web?
WTF is Semantic Web?
 
Organisational Wiki Adoption
Organisational Wiki AdoptionOrganisational Wiki Adoption
Organisational Wiki Adoption
 
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop ApplicationHydrologic Information Systems and the CUAHSI HIS Desktop Application
Hydrologic Information Systems and the CUAHSI HIS Desktop Application
 
London HUG
London HUGLondon HUG
London HUG
 
Common Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web DataCommon Crawl: An Open Repository of Web Data
Common Crawl: An Open Repository of Web Data
 
Lessons Learnt From Working With Rails
Lessons Learnt From Working With RailsLessons Learnt From Working With Rails
Lessons Learnt From Working With Rails
 
Schema.org - An Extending Influence
Schema.org - An Extending InfluenceSchema.org - An Extending Influence
Schema.org - An Extending Influence
 
Collaborating with the Community
Collaborating with the CommunityCollaborating with the Community
Collaborating with the Community
 
Smart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge GraphSmart Data Applications powered by the Wikidata Knowledge Graph
Smart Data Applications powered by the Wikidata Knowledge Graph
 
Schema.org - Extending Benefits
Schema.org - Extending BenefitsSchema.org - Extending Benefits
Schema.org - Extending Benefits
 
Wikimedia, MediaWiki & Education in IT: Notes
Wikimedia, MediaWiki & Education in IT: NotesWikimedia, MediaWiki & Education in IT: Notes
Wikimedia, MediaWiki & Education in IT: Notes
 
Wikis And Your Business
Wikis And Your BusinessWikis And Your Business
Wikis And Your Business
 
Creating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic LibraryCreating A Web 2.0 Toolbox For The Academic Library
Creating A Web 2.0 Toolbox For The Academic Library
 
Small, Medium and Big Data
Small, Medium and Big DataSmall, Medium and Big Data
Small, Medium and Big Data
 
Creating Narrative with Digital Objects
Creating Narrative with Digital ObjectsCreating Narrative with Digital Objects
Creating Narrative with Digital Objects
 
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data LakesA Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
A Tale of 2 BI Standards: One for Data Warehouses and One for Data Lakes
 
datavault2.pptx
datavault2.pptxdatavault2.pptx
datavault2.pptx
 
Building_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdfBuilding_Decentralized_Web_Apps.pdf
Building_Decentralized_Web_Apps.pdf
 

Último

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

The Semantic Web And The News

  • 1. The Semantic Web and the News: Exploitation and Adoption Ken Ellis Chief Scientist
  • 2. Agenda Intro to Daylife   Exploiting the Semantic Web Named Entities  Toolsets, issues  Adopting / Enabling  Others  Daylife 
  • 3. Daylife A Platform for News Innovation: A scalable solution for publishers of all sizes to generate more content and more inventory – with no additional personnel costs
  • 4. Daylife: What We Do Aggregate Content  Licensed photos (Getty, AP, Reuters)  Articles (scraped, real-time)  Create Metadata  Topics (people, organizations, concepts)  Topic taxonomy, descriptions  Quotes with attribution  Photo identification  Relatedness  Authorship, sentiment analysis, etc.  Deliver to Clients  Web Sites / Modules / Data  Flexibility: API w/ 500 distinct queries  Novel search/ranking algorithms  Free API 
  • 5. [Wiki|DB]Pedia and Named Entites We also want to collect content around a named entity …and associate it with external data (Wikipedia, Freebase)
  • 6. [Wiki|DB]Pedia and Named Entites … for a lot of NE’s (55k newsworthy ones last month) 1000000 100000 Articles Per Month 10000 1000 100 10 1 1 10 100 1000 10000 100000 NE Rank
  • 7. [Wiki|DB]Pedia and Named Entites Without getting swamped 
  • 8. Daylife and the Semantic Web Wikipedia  website  API  Wikimedia dumps  DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 9. Resources for News Organizations Named Entities Wikipedia   vetting website   disambiguation API   aliases Wikimedia dumps   prominence DBPedia   Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 10. [Wiki|DB]Pedia and Named Entites But: “… Now, team owner Kevin Buckler is looking to debut in NASCAR Sprint Cup Series competition, when Mike Wallace runs in Thursday's Gatorade Duel …” Which Mike Wallace?  Mike_Wallace_(journalist)  Mike_Wallace_(NASCAR)  Two disambiguation approaches  Given an article, extracted name, what Wikipedia entry does  it map to? Given a Wikipedia entry, what articles match? 
  • 11. [Wiki|DB]Pedia and Named Entites Articles First:  Wikimedia dumps and DBPedia  Filter for people, organizations, other NE  Construct weighted graph from links  Proxy for prominence (# edits, pageviews, dumps only)  Redirects & disambiguation pages  “Hillary Clinton” redirect to Hillary_Rodham_Clinton: human  decided reference is unambiguous; Usama/Osama Identify names, possibly matching graph nodes  Select set of nodes that minimizes total distance  Perhaps factor in node prominence 
  • 12. [Wiki|DB]Pedia and Named Entites Mike Wallace journalist NASCAR Chicago Sun- Times Mike Kevin Chicago Wallace Buckler Bulls NASCAR Gatorade I made this up!
  • 13. [Wiki|DB]Pedia and Named Entites Another possibility: compare text of Wikipedia entry to  the article But:  Wikipedia entries largely historical, small fraction related to  current events Journalists, in providing context for lesser-known individuals,  often mention a few other named entities
  • 14. [Wiki|DB]Pedia and Named Entites NE First approach:  Classifier for race car drivers, Wikipedia to identify names  Filter based on prominence  See EVRI taxonomical paths  http://www.evri.com/mainline-ui/jsp/index.jsf#searching-with- taxonomical-paths
  • 15. [Wiki|DB]Pedia and Named Entites NE First:  Tractable for a human (limited number of classifiers)  Better for low-recall high-precision  Article First:  Low editorial oversight  Best-guess  Neither is a complete solution  Not for locations 
  • 16. [Wiki|DB]Pedia and Named Entites General Nits Sticky Graffiti  Wikipedia can be updated  real-time if you don’t like it Some derived data sets  can’t. Makes it our problem! On-demand updates from  Wikipedia API / HTML
  • 17. [Wiki|DB]Pedia and Named Entites General Nits Career Changes  Mike Wallace (journalist)  becomes a NASCAR driver Joe Wurtzelbacher  becomes a political pundit Not a complete solution,  but we knew that.
  • 18. [Wiki|DB]Pedia and Named Entites General Nits Staleness  Infrequent Wikimedia  dumps GWB is still president?  DBPedia bad  Wikimedia dumps bad  Freebase good  Wikipedia HTML/API good  DBPedia, 3/5/09
  • 19. [Wiki|DB]Pedia and Named Entites Obscure Information  Clint Eastwood:  Is prominent, is a politician  Not a prominent politician 
  • 20. [Wiki|DB]Pedia and Named Entites URI Stability  If this were 1981, unambiguous “George Bush”:  <rdf:RDF xmlns:rdf=quot;http://www.w3.org/1981/02/22-rdf-syntax-ns#quot; xmlns:dc=quot;http://purl.org/dc/elements/1.1/quot;> <rdf:Description rdf:about=quot;http://en.wikipedia.org/wiki/George_Bushquot;> <dc:title>George Bush</dc:title> <dc:publisher>Wikipedia</dc:publisher> </rdf:Description> </rdf:RDF> The NYTimes did this, and still does (API):  “George Bush” tag  George H. W. Bush  A lucky problem to have! 
  • 21. Resources Named Entities Wikipedia   GUID’s! website   tagging API   associations (members of Wikimedia dumps   teams) DBPedia  other data  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 22. Freebase GUID’s are stable  Query by Wikipedia URI  http://www.freebase.com/api/service/mqlre ad?query={quot;queryquot;:{quot;*quot;:null,quot;idquot;:quot;/wikipedia/ Easy-to-find redirects  en/Mike_Wallace_$0028journalist$0029quot;}} GWB isn’t president  Professions vs. Types  Easier for topic tagging  Clint Eastwood still a politician  but: easier to tell he’s a minor one  multiple types/professions, not much political data  No good proxy for significance  cross-reference 
  • 23. Resources Inter-agency standards Wikipedia   Newswire services website   IPTC: photo information API   NewsML: article information, Wikimedia dumps   topics DBPedia  Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 24. Interagency Metadata Data:   authorship  location  caption  sometimes people, category  NE’s hand-typed, often quickly  RSS almost as good Stripped  Matching problem,  but STILL USEFUL
  • 25. Resources Q: “Can you use our metadata” Wikipedia   A: “Sometimes” website   API  Again, matching problem, but Wikimedia dumps   good for client-specific topics, DBPedia  still useful Freebase  Partners  IPTC, NewsML  Clients  Proprietary metadata 
  • 26. Others Using the Semantic Web Having an API  not the Semantic Web, but at least machine-friendly  eventually common, even for publishers  Publishing URI’s for Wikipedia, Freebase, IMDB, etc.  common among non-publishers  parasitic (not bad!)  Querying using the same URI’s  not so common  mutualistic 
  • 27. Others Using the Semantic Web EVRI  API  Topics (mostly, all?) from Wikipedia  Probably taxonomic pathways, facets, derived from Wikipedia  Disambiguation based on above  Published Wikipedia URL’s  Can’t query by Wikipedia, other URI’s 
  • 28. Others Using the Semantic Web Zemanta  Lots of Linked Data  API provides text markup  Developing (with others)  simplified RDFa based semantic tagging standard
  • 29. Others Using the Semantic Web Calais (Thomson Reuters)  API extracts NE’s, other information  Provides Linked Data URI’s to others (one-way)  Provides their own endpoints  Not an aggregator  Eventual support for querying  Very clean! 
  • 30. Others Using the Semantic Web The New York Times  Leading charge with publisher API  Their own tagging, great quality  Some major newspapers  following suit Others APIs: NewsGator, Inform,  Outside.in Slow Moves to Digital Access  Full-text RSS rare  API rare  Semantic Web standards rare  Wouldn’t it be great if:  You could ask for content about Mike_Wallace_(American_football)  They pointed you to other rich data sources 
  • 31. Wikipedia URI Lookup A quick service to support lookup for Wikipedia URI’s http://labs.daylife.com/wikipedia_topic_getInfo.php?uri= http://en.wikipedia.org/wiki/Mike_Wallace_(journalist) or http://labs.daylife.com/wikipedia_topic_getInfo.php?uri=Barack_Obama
  • 32. Thank you Web Site http://www.daylife.com Daylife API http://developer.daylife.com Labs http://labs.daylife.com Email ken@daylife.com