6. Not very similar to pages like this
2nd DBpedia Meeting Leipzig 03.09.2014
7. DBpedia Extraction Framework
2nd DBpedia Meeting Leipzig 03.09.2014
✔ “Wiki agnostic”
✔ Pluggable
extractors
✔ Out of the box
support for
common
metadata
✗ Tuned for extraction in the main namespace (not File:)
✗ Many other challenges left
8. 2nd DBpedia Meeting Leipzig 03.09.2014
Challenges
✔ File metadata
✔ KML files
✔ Image Galleries
✔ Image Annotations
✔ Mappings Wiki
✔ Bootstrap community mappings
✔ Template Statistics
✔ Licensing
✔ Technical details I'll not go into
9. Out-of-the-box support
2nd DBpedia Meeting Leipzig 03.09.2014
● Categories (skos)
● External links
● Geo-coordinates
● Raw infobox properties
● Labels
● PageIds / Revisions
● Links (internal / external)
● Mappings Wiki (with some tweaking / more on that later)
10. 2nd DBpedia Meeting Leipzig 03.09.2014
File metadata
● New Extractor
● New file Class hierarchy
– dbo:File, dbo:Image, dbo:StillImage, dbo:MovingImage and
dbo:Sound
Sample Output:
:Aeropetes.JPG a dbo:StillImage, dbo:Image, dbo:Document, dbo:File, Work;
dcterms:type dbo:StillImage
dbo:fileExtension "jpg"
dcterms:format "image/jpeg"
dbo:fileURL commons-path:Aeropetes.JPG ;
foaf:depiction commons-path:Aeropetes.JPG ;
dbo:thumbnail commons-path:Aeropetes.JPG?width=300 .
11. 2nd DBpedia Meeting Leipzig 03.09.2014
Image Galleries
● Attach each gallery
item to the page
resource
:Colorado dbo:hasGalleryItem
Colorado.JPG,
Denver_Colorado_Art.jpg,
ColoradoCenter1.jpg.
13. Image Annotations
● W3 Media Fragments recommendation
● Embed the box in the URI
– ?width=15130&height=1886#xywh=pixel:10431,324,1670,1208> .
● Add descriptions in the new resource
2nd DBpedia Meeting Leipzig 03.09.2014
16. 2nd DBpedia Meeting Leipzig 03.09.2014
Licensing
● Identified & imported automatically ~360 licence templates
● Use the mappings wiki
● Needed some hacking to make it work
– e.g. {{Self|GFDL|cc-by-sa-3.0,2.5,2.0,1.0}}
:Acraea_circeis.JPG dbo:license
<http://creativecommons.org/publicdomain/mark/1.0/>
:Antepipona_deflenda_-_2012-10-17.webm dbo:license <
http://creativecommons.org/licenses/by-sa/3.0/ >
17. KML Annotations attached to media
Attach raw KML data to resource with custom extractor
Sample Output:
:Yellowstone_1871b.jpg dbo:hasKMLData “””
?xml version=1.0 encoding=UTF-8?>
<kml xmlns=http://earth.google.com/kml/2.2”>
<GroundOverlay>
<name>Yorktown, Indiana (1878)</name>
<description>An 1878 map of Yorktown in Tippecanoe County, Indiana. Source: Kingman
Brothers' Combination Atlas Map of Tippecanoe County, Indiana, 1878.</description>
<color>99ffffff</color><Icon><href>BIG_LINK_HERE</href>
<viewBoundScale>0.75</viewBoundScale></Icon>
<LatLonBox>
<north>40.26126145890567</north><south>40.25777915632657</south>
<east>-86.77033439383223</east><west>-86.77398493316619</west>
<rotation>-1.123009884936565</rotation></LatLonBox>
</GroundOverlay></kml>“”"^^rdfs:XMLLiteral .
2nd DBpedia Meeting Leipzig 03.09.2014
18. 2nd DBpedia Meeting Leipzig 03.09.2014
Left TODOs
● Nested templates are commonly used and cannot be handled
by the mappings wiki atm
– e.g. Media descriptions (although mapped) are missing
{{Information |Description= {{en|Logo of the [[w:en:DBpedia|DBpedia project]]}} {{fr|
Logo du projet [[w:fr:DBpedia|DBpedia]]}}
● Annotation descriptions need some tweaking
– Need to render wikitext
● Put it under a SPARQL Endpoint
● Provide Linked Data
– http://commons.dbpedia.org
19. 2nd DBpedia Meeting Leipzig 03.09.2014
Thank You!
Special thanks to:
● Alexandru Todor (importing the License templates)
● Google Summer of Code for sponsoring this project
(Gaurav Vaidya)
Questions?
Dataset: http://nl.dbpedia.org/downloads/commonswiki
Dataset samples: https://github.com/gaurav/commons-extraction