1. Study about OpenCalais API practical usage in
linked data context
Căciulă Maricel
„Faculty of Computer Science, A. I. Cuza Univesrity of Iasi”
Abstract. A presentation of OpenCalaisses. Here will be a short describtion
of the web service API and will be presented some projects that are using this
API. At the end it will be showed some personal ideas of the API usage.
Keywords: Web Service, API, resource management, linked data.
2. 2 Căciulă Maricel
1 Introduction
OpenCalais is a project that makes your text more valuable. It enables you to
identify named entities, facts and events and returns a Resource Description
Framework formatted result. This project was initiated by Thomson Reuters, and at
the beginning, it was aiming to eliminate the manual tagging step for publishers. In
time, OpenCalais proved to be useful improving user search experience and lately
was used to generate content hubs. OpenCalais is free to use, and can be accessed up
to 40.000 times per day. It can be used in commercial and noncommercial
applications. The motivation behind open free usage is to improve their natural
language processing tools, and to semantically link the web content.
1.1 OpenCalais Web Service
OpenCalais can be accessed through a web service. It supports SOAP, REST and
HTTP Trafic compressions
Accessing through SOAP can be done using the web method “Enlighten” on this
URL : http://api.opencalais.com/enlighten?wsdl
String Enlighten(String licenseID, String content, String paramsXML)
The parameters are described in the following table as in the official Calais Soap
documentation :
Field Name Type Definition Notes
licenseID String API access key Optain through registration
Content String Content to be annotated Max input length is
100,000 characters
paramsXML String Processing and user directives and external Max parameters length is
metadata 16.000 characters
Accessing through REST can be done at the following URL :
http://api.opencalais.com/enlighten/rest
3. Study about OpenCalais API practical usage in linked data context 3
LicenceID=url-encoded-string&content=url-encoded-string¶msXML=url-encoded-streams
This can be used with GET, adding the argument lines to the rest URL, or with POST
and including the argument line in the html body.
There is a nice tutorial example on the official site:
http: //opencalais/files/HTMLform.zip
Accessing through HTTP Trafic Compression can be done using a Gzip request.The
client should add the “Accept-encodeing:gzip” header to the web request in order to
tell the server that the client can handle a gzip response
1.2 Open Calais API
The input API parameters are set in XML format. The parameters refer o process
directives, user directives and external metadata. The entire input XML (meaning the
paramasXML) must be HTTP encoded (escaped)
Here is a table that describes the API input parameters from the official OpenCalais
API documentation :
Parameter Section Definition Values Default
“TEXT/HTML”
contentType Processing Format of the input ”TEXT/XML”
Directives content “TEXT/HTMLRAW” None
“TEXT/RAW”
Format of the returned “XML/RDF”,
outputFormat Processing results “Text/Simple”
Directives “Text/Microformats” XML/RDF
“Application/JSON”
Base URL to be put in
reltagBaseURL Processing Rel-tag microformats <the base URL>, for
Directives example None
“http://myblog.com/tag”
4. 4 Căciulă Maricel
Indicates wheter the
Processing extracted metadata
calculateRelativeScore Directives will include relevance “true” or”false” True
score for each unique
entity
I
ndicates wheter output “GenericRelations”
Processing will include Generic “SocialTags”
enableMetadataType Directives Relation extraction “GenericRelations,Socia None
(RDF) and/or lTags”
Social/Tags
Indicates
whetherentire
Processing XML/RDF document
docRDFaccessible Directives is saved in the Calais “true” or “false” None
Linked Data
repository
User Indicates whether the
allowDistribution Directives extracted metadata “true” or “false” False
can be distribuied
Indicates whether
User future searchers can
allowSearch Directives be performed on the “true” or “false” False
extracted metadata
User-generated ID for
externalID User the submission Any string None
Directives
User Indentifier for the
Submitter Directives content submitter Any string None
The Input Content can be TEXT/HTML, TEXT/HTMLRAW, TEXT/XML,
TEXT/RAW. If the content type is not specified , then Calais tries to auto detect the
type.
As a default language Open Calais uses English, but also supports French and
Spanish. If the input text is smaller the 100 characters, then the default language is
used.
5. Study about OpenCalais API practical usage in linked data context 5
The API can also be used with SSL , accessing through https.
1.3 Data structure
OpenCalais returns the response by default in RDF format. The RDF header
includes a summary of all entities extracted from the text and sorted alphabetically
based o the Entity type.
INFORMATION
For each unique element, the information includes the element type (that can be a
Company, Person, Acquisition for example) attribute values and ID of a unique
element
We can enable the Relevance feature and the result RDF will also include the
relevance score for this unique entity
When an attribute value is refered to by its ID, it will include a comment
containing the actual value for easier understanding
INSTANCES
As we can see on the official documentation, one or more individual instances
(mentions) for each unique metadata element. Each element instance includes the
following :
c:docId: URI of the document this mention was detected in
c:subject:URI of the unique entity
c:detection:snippet of the input content where the metadata element was
indentified
c:prefix:snippet of the input content that precedes the current instance
c:exact:snippet of the input content in the matched portion of text
c:offset: the character offset relative to theinput content after it has been converted
into xml
c:length:length of the instance
1.4 OpenCalais and linked data
With the last significant update on OpenCalais, the 4.0 version, users are now
able to connect to the Linked Data web standard.
Linked data is a method of exposing , sharing, and connecting data through
deferenceable URIs on the web.
To be compatible, OpenCalais respects the four principles of linked data.
- It has URIs to identify things.
6. 6 Căciulă Maricel
- It usesHTP URIs so that these things can be reffered to and looked up by
people and user agents
- It provides useful information (structures description - metadata) about the
thing when is URI is deferenced
- It include links to other URIs in the exposed data to improve discovery of
other related information on the Web
In the image shown beneath, we see the latest instance linkage within the Linking
Open Data datasets. Here we can see the OpenCalais .
The Calais ecosystem is exposed via Linked Data endpoints and when it extracts an
entity from a given text it also returns a entity URI. This URI is deferenceable. You
can submit an HTTP request programmatically or through browser, and get in
response useful information and links to other Kinked Data and web assets.
As we can see on the official site, OpenCalais is linked at this moment to the
following assets :
- DBpedia
- Wikipedia
- Freebase
- Reuters.com
- GeoNames
- Shooping.com
- IMDB
- LinkedMDB
7. Study about OpenCalais API practical usage in linked data context 7
2 Practical usage
OpenCalais used primarily for tagging blogs and word press articles. As it’s founder
says, they noticed that the OpenCalais project is used for other purposes like creating
content hubs
Open Calais can be used to :
Triage – Filter large influx of content
Workflow – Use metadata returned from OpenCalais to route documents to the right
person/system
Content Enhancement – OpenCalais can be the entry point for the huge world of
linked data.
Alerting – Allow advanced alerting giving the users the ability to interact more
naturally with the user application
Media Monitoring – Take in a content feed (social media, press releases , news) can
be categorized and organized using OpenCalais.
Content Harmonization – Mixing different sources of information that can be
integratied in a CMS (Content Management System)
Automated News Portal – Publish relevant information taken from different sources
after are filtered using OpenCalais
SEO – Improving search
News Presentation- With consistent metadata extraction it is possible to create new
navigation and search tools on your site
2.1 Blog tagging
As we expected, one of the first implementation based on OpenCalais was designed
for bloggers. Tagaroo is a tool initiated by the same OpenCalais team and it’s a
plugin for wordpress.com blogger site.This tool makes better your blog by improving
8. 8 Căciulă Maricel
both the user experience and searchability. This tool analyzes you text , as you are
writing and suggests intelligent tags for the things and events you are writing about.
A nice ability that this tool provides is to use the generated tags to automatically get
images from Flicker to include your post.
Link : http://tagaroo.opencalais.com
Another site that is using OpenCalais for blog tagging is “Al Jazeera English’ new
blogging network”. All posts in the new blog are semantically tagged using
OpenCalais for optimal search and navigation.
Link: http://blogs.aljazeera,net
I *heart* Sea is hyperlocal news aggregation site that collects some of the best blogs
in Seatle. It uses OpenCalais to automatically tag the keywords of the blog posts in
aggregates, to make it easier to find related information.
Link: http://iheartsea.com
2.2 Press tagging
The new websited from “The New Republic” is using an OenCalais-enabled Drupal-
powered Content Management System to increase editorial productivity and improve
search engine optimization
Link: http://www.tnr.com
The “Slate Magazine’s News Dots Network” visualizes the most recent topics in the
news as a concise network of related topics Like a human social network, the ews
tends to cluster around popular topics, and most stories are more closely related than
one might think. In the background, the News Dots scans all the articles from major
publications and submits them to OpenCalais to identify the relevant people, places,
companies, topics, etc
Link: http://slatest.slate.com/features/news_dots/default.htm
2.3 Media monitoring
Tattler is an open source topic monitoring tool for the Web. Tattler finds and
aggregates content from the web on topics users ask it to monitor. In background it
uses OpenCalais together with other Semantic Web technologies. It mines news,
websites, blogs, multimedia sites, and other social media like Twitter, to find
9. Study about OpenCalais API practical usage in linked data context 9
mentions of the issues, most relevant to user’s selected topics , making easy for user
to filter, organize, share and take action on content gathered from the Web.
Link : http://tattlerapp.com
Interceder is a social media monitoring tool that makes it easy to track trending topics
and search through the latest content from major news websites, blogs, twitter and
youtube.
Link: http://www.interceder.net
AskJot is a tool for analyzing web pages fro keywords and displaying as links to
search results from various services around the web. Behind the scene Ask Jot uses
OpenCalais, NYT articles search API, DBPedia, Yahho! Answers API, the flicker
API and others.
2.4 Intelligent Content
Feedly is a Firefox plugin that brings to life user-selected inputs from Google
Reader, friendfeed, Twitter, RSS feeds and others in a easy to read and engaging
magazine style format. It uses OpenCalais and other semantic techonologiesfor
clustering, linking and organizing the content experience in an intuitive fashion that is
nicely integrated into the browsing experience.
Link : http://www.feedly.com
OpenPublish is based on the Drupal platform and it is a next generation CMS that
has been tailored to the needs of today’s online publishers (magazines, newspapers,
journals, trade publications, broadcast and wire services). It uses metatagging from
OpenCalais to streamline content operations, automatically create topic hubs and
recommended related articles and archived more from the same authors stories
Link: http://www.opensourceopenminds.com/openpublish
DocumentCloud was found by The New York Times and ProPublic . DocumentCloud
is a unique online resource that offers public access to news reporters’ original source
materials, including documents, media files and more. OpenCalais processes
materials available through DocumentCloud to make it easy for users to explore
connections between newsmakers, corporations, transactions and even quotations
across documents and across the full collection of sources.
Link : http://www.documentcloud.org
10. 10 Căciulă Maricel
3 .Personal of OpenCalais API usage idea
As I tested the OpenCalais API with Document Viewer, I notice that on short text
the relevance is not accurate. For example using the text from twitter will prove that.
It will add irrelevant topics and social tags.
Testing OpenCalais on big text, it took several hours to process. That is not ok.
This means is not reliable for books and other big length texts.
Finally I arrived at the conclusion that the optimal text should be from an article, or
blog, that has more then 100 words and is smaller than 2 pages.
3.1 Blog tagging and filter
Manual tagging of blogs in not always the best way to describe a post . I’ve seen
blogs that are not completely described, omitting some key words that could be
essential to find what are interested in, and, as people could see in personal way the
things, they can tag the same post with different key words.
Essentially, the idea was to try tagging blogs using semantic web (OpenCalais).
Using as many blogs are possible, manually added or through a crawler that will
recursively add new blogs(using the contacts from friends or persons that added a
comment ).
The easiest way to watch blogs is to use the RSS feed. In this way we can gather
blogs from different sites in a standard way.
Creating a new service that gathers posts from blogs and tags the text using the
OpenCalais , we can create a database of feeds. Having such a database we can do a
site/application that could enable a user to create a new custom generated RSS feed
from the entire database.
This way a user can see posts he is interested in from thousands of blogs. The
generated result RSS feed can be consumed by the already existing applications for
RSS.
The original idea in this is that you could se posts from thousands of common blogs
and filter by semantic tags
11. Study about OpenCalais API practical usage in linked data context 11
3.2 Language abstraction
Other interesting idea is to abstract the language. Right now the supported
language for OpenCalais are English, Franch and Spanish.
A interesting idea it was to semantic tag Romanian , or other language texts.
I was thinking to integrate the google translate service from Google with the
OpenCalais.
The idea is to translate first the text from blogs or news and then to use OpenCalasi
to semantic tag .
This could not be reliable as the translation could not be so accurate but for texts
larger the 100 words will probably tag correct the most relevant tags. This because the
translation will translate ok the key words.
4 References
1. http://opencalais.com
2. http:;//en.wikipedia.com/wiki/Linked_Data
3. http://facebook/note.php?none_id=160609314491
4. http://readwriteweb.com/archives/calais_4/linked_data.php
5. http://tagaroo.opencalais.com
6. http://viewer.opencalais.com
7. http://vator.tv/news/shows/2009-06-19-opencalais-makes-content-discoverable
8.
http://video.google.com/videoplay?docid=1419547095322807081&ei=mUgRS7yIO5CW2
wKs7ImKAg&q=opencalais#