1. Semanticannotation of text: techniques and applications Prof. Luis Sanchez-Fernandez Web Technologies LaboratoryUniversity Carlos III of Madrid http://webtlab.it.uc3m.es 1 http://webtlab.it.uc3m.es
3. Short history of the Web 1990: Creation of WorldWide Web infraestructure at CERN by Tim Berners-Lee HTTP, HTML, first Web client, first Web server 1993: Mosaic, firstgraphic Web client 1994: Netscape Navigator 1996: Commercial use of WWWisgeneralized 1999: Tim Berners-Lee proposestheSemantic Web 2002: Weblogs and RSS Web 2.0 6thOctober 2009: at least 8 billionindexable Web pages 23rdSeptember 2010: at least 15 billionindexable Web pages accordingtohttp://www.worldwidewebsize.com/
4. The problem of information overload The great success of the web has lead to one of its current problems: information overload Difficult and time costly to find and update relevant information for people and companies Ex.: keep an updated state of the art Company employees can use up to 20% of their working time searching in the Web (Outsell Inc, 2002)
5. Thegoal of theSemantic Web istoautomate web tasksbyenrichingthecurrent Web contentwith formal representationsthatenablebettercooperationbetweenhumans and computers http://webtlab.it.uc3m.es 5 TheSemantic Web proposal
7. Máster interuniversitario en Ingeniería Telemática 7 RDF “ResourceDescription Framework” (RDF) Goal of RDF (alternativeviews): Languageforresourcedescription in the Web Languagefor formal representation of (parts of) informationavailable in a Web document (metadata) Formal => machine readable Vocabularydefinedwithontologies Whatis a resource? Web content: Web pages, images, e-mails, files, … Resourcesmentioned in Web content: Persons, locations, organizations, …
8. Máster interuniversitario en Ingeniería Telemática 8 RDF basic principles We want to represent a piece of information available in the Web describing a resource Each metadata states a property that can be modelled as a (formal) statement, composed of: subject: resource being described predicate: property of the resource object: value of the property for the resource being described “http://www.example.org has a creator whose value is John Smith”
9. Máster interuniversitario en Ingeniería Telemática 9 RDF Model An RDF model (set of RDF statements) can be represented by means of a graf For each statement: subject is a node predicate is an arc object is a node Subject and predicate are resources Object can be either a resource or a literal
10. Máster interuniversitario en Ingeniería Telemática 10 Example “http://www.example.org has a creator whose value is John Smith”.
12. Máster interuniversitario en Ingeniería Telemática 12 Ontologies: goal An ontology is a formal, explicit specification of a shared conceptualization An ontology defines the basic terms and relations comprising the vocabulary of a topic area, as well as rules that should be fulfilled by such terms and relations
13. Máster interuniversitario en Ingeniería Telemática 13 RDF Schema RDF vocabulary Properties definition and description of properties Classes definition and description Can be used to define simple ontologies
14. Máster interuniversitario en Ingeniería Telemática 14 Properties in RDF Schema rdfs:subPropertyOf rdfs:range rdfs:domain rdfs:subClassOf
20. Generalities Goal: extract semantic annotations from free text Natural language is complex and ambiguous Language dependent Domain dependent applications News Literature E-mail Transcriptions of spoken dialogues Some useful results can be achieved nowadays
21. Taxonomy of semanticannotations Content basedannotations Documentcategorization Namedentities Ontologybaseddomainannotations Concepts and instancesidentification Relationsextraction isGovernor(GaryLocke,WST) Named Entity (Washington, location) <rdf:Description rdf:about=‘WST'> <rdf:type rdf:resource=‘State'/> </rdf:Description> <rdf:Description rdf:about=‘WDC'> <rdf:type rdf:resource=‘City'/> </rdf:Description>
22. basic techniques (i) Semantic Analysis S NP NP*(X) VBT(Elect) NN(Y) Parsing S NP NP* VBT NN S hasFunction(X, Y) NP VBT NP NN Symbolic NLP Based on the use of lexicons and grammar rules to process text Example: “Barack Obama Elected President” Lexical Analysis NP Barack NP Obama VBT Elect VBT VBT + ‘ed’ NN President hasFunction(BarackObama, President)
23. Basic techniques (ii) Statistical NLP Based on counting: finding frequent patterns that make likely the occurrence of certain text feature Use of extensive corpora Example: “Washington” when appearing in the same document with “Hollywood” is likely to represent (Denzel Washington, actor) while Washington” when appearing in the same document with “Obama” is likely to represent (Washington D.C., American capital) We can count the frequency of different meanings of “Washington” when appearing in different contexts
27. Approach Findentities in document Foreachentity, identifycandidateinstancesthat are compatible withtheentityname Assign a ranking valuetoeachcandidateinstance: 0 ≤ r ≤ 1 Greater ranking valuesindicategreaterlikelihood of occurrence http://webtlab.it.uc3m.es 27 Strategy II
28. Semanticcoherence (in terms of ranking) “Aninstancewouldhave a high ranking valueiftheinstancesthattypicallyco-occurwithitalsohavehigh ranking values” http://webtlab.it.uc3m.es 28 Strategy III
29. We can add a vector Ethataccountsforothercontextinformation Equation similar to Google PageRank http://webtlab.it.uc3m.es 29 Strategy IV
30. Alternativeinstancenamesextractedbyprocessing a Wikipediadump Page titles, redirects, disambiguationpages, anchors IndexedbyLucene Candidateinstances are obtainedbyqueryingLucene CandidateinstancesweightedbycombiningLucene scores and PageRankvalues Filteringlimitsthemaximumnumber of candidates http://webtlab.it.uc3m.es 30 Instancefinder & filter
38. aCij≈ P(Ii|Ij) Basedoncountingcooccurrence of Ii and Ij in Wikipediapages Ex.: P(PauGasol|Lakers)= #pageswhereboth Pau Gasol and Los AngelesLakers are mentioneddividedby #total pageswhere Los AngelesLakers are mentioned http://webtlab.it.uc3m.es 38 ACComputation
39. Basedondirect links Ex.: TheWikipedia page of Pau Gasol links totheWikipedia page of Los AngelesLakers Initial idea IfIj links many times toIi and Ijislikelytooccur (it has a high ranking), thenIiisalsolikelytooccur Lucene score isusedto compute αLij http://webtlab.it.uc3m.es 39 ALComputation