1. Qedia – Natural Language Queries on DBPedia
Andreea-Georgiana Zbranca, Diana Andreea Gorea, Lucian Bentea
Faculty of Computer Science, “A.I. Cuza” University, Ia¸i, Romania
s
Abstract. In this paper we present an application that allows users to
query DBPedia through natural language, which is more intuitive than
plain SPARQL.
1 Introduction
We present an application that is able to translate natural language phrases,
which conform to a certain basic grammar, into SPARQL queries that are then
run on the DBPedia knowledge base. For tagging the parts of speech of the
phrase, we have used a lexical analyzer implemented by Ian Barber and avail-
able at http://phpir.com/part-of-speech-tagging. The syntactical analysis
is achieved with respect to a basic grammar that we describe in the following
section. The resulting parse tree can also be interpreted as the RDF graph cor-
responding to the given phrase. Furthermore, there are three types of phrases
that we allow to be used as a natural language query, which we also describe
below – one which is missing the subject, one which is missing the object and
one which is missing both. Based on these categories of phrases, we are able to
automatically generate the corresponding SPARQL queries which we run on the
DBPedia end-point. In order to obtain further statistics, a SPARQL query has
also been used.
To increase flexibility, the graphical interface has been implemented in two
versions – a Web page version using the Zend framework and requiring Apache
or a similar local server to be running, and a Desktop version using the PHP-
GTK 2 library. Also, in order to run the queries from within PHP, the ARC
library has been used, which is freely available to download from http://arc.
semsol.org/. The results returned by each query are displayed both in tabular
and in text form, along with other statistics. We also mention that the main
RDF vocabularies used by DBPedia are also automatically included with each
SPARQL query.
2 Parsing a Phrase
2.1 Algorithm
The query will be in natural language. The sentence will be transformed into
an RDF triplet Subject-Predicate-Object. Identifying the parts of sentence, the
natural language query can be transformed into a SPARQL query. As input we
2. get a phrase and we obtain three arrays: nouns (meaning also adjectives and
adverbs), parents (corresponding to the tree grammar parsing) and verbs that
connect the nouns. First step is to obtain the parts of speech of the phrase and
after that to find out the part of sentence and build the three arrays.
To build this parser we first used an algorithm already implemented by Ian
Barber. This system use a corpus, with words hand tagged for part of speech.
Some examples of taggers are: NN for noun, VB for verb, VBD for verb past
tense, JJ for adjective. In his code I removed some words that are unnecessary
in the following steps. For example I removed the word the that is determinant
for noun. The output of this algorithm is the phrase with tagged with its parts
of speech, e.g.
Input: The quick brown fox jumped over the lazy dog.
Output: The/DT quick/JJ brown/JJ fox/NN
jumped/VBD over/IN the/DT lazy/JJ dog/NN.
According to the algorithm, the tagger was trained by analysing a corpus and
noting the frequencies of the different tags for a given word. More informations
and also the algorithm that we used for this step, can be found at: http://
phpir.com/part-of-speech-tagging.
In the next step we have as input the phrase tagged according to the Ian
Barber algorithm and we print the three arrays from above. To parse the phrase
we used a simple grammar and built the tree parse of the phrase. As a general
structure all our valid phrases must conform to the following basic grammar:
Prop = Beg S P C
Beg = What | What does | What do
S = noun | S P.atr
C = noun | adjective | adverb | C P.atr
P.Atr = that P C
P = verb
where the terminals are What, What does, What do, noun, adjective, adverb, verb
and everything else is a non-terminal. An example of a phrase that conforms to
this grammar is the following:
What animal that has the color that is gray eats leaves
that belong to the species that is Eucalyptus?
The parse tree that we aim to generate is basically the RDF graph of this phrase
and is depicted in Figure 1.
We get the phrase and we removed from the tags all the line breaks. We then
built an array of pairs of the form (word, tag). After that we verify the tag and
if it is a noun, adjective or adverb, we build our first array that will contain only
nouns, adjectives and adverbs. In the same way we obtain the array with verbs.
For building the parent array we go through the elements one by one and
we verify whether they are root nodes. When we find the root we search for the
3. animal
has eats
color leaves
is belong
gray species
is
Eucalyptus
Fig. 1. RDF graph (parse tree) for the phrase: What animal that has the color that is
gray eats leaves that belong to the species that is Eucalyptus?
predicate and split the phrase in two sub trees. According to our grammar the
predicate is between the root and the other sub tree. If our phrase does not have
a subject we put in our array the symbol * in the first position. If the phrase
does not have an object we put in the array the symbol # in the last position. In
each sub tree we verify step by step if the noun is followed by the word that and
a verb, and that the child of this noun is the first noun after the verb with that
in front. The parent of the root is 0. When we form the verbs array we verify
what verb is between the child and his parent and put it into the array. On first
position we put 0 because that corresponds to the root.
2.2 Accepted Types of Phrases
In order to verify our project we used three types of phrases that can be trans-
lated into SPARQL queries:
1. “What [property] has [subject]?”
translated into:
SELECT ?property WHERE {
:[subject] dbpedia:property ?property
}
For example, the phrase “What abstract has Guitar?” generates the following
parse arrays:
4. nouns-array: abstract guitar
parents: 0 abstract
verbs: 0 has
and is translated into the SPARQL query:
SELECT ?abstract WHERE {
:Guitar dbpedia2:abstract ?abstract
}
2. “What has [property] [object] ?”
translated into:
SELECT ?subject WHERE {
?subject dbpedia2:[property] "[object]"@en
}
For example, the phrase “What has name that is animal?” generates the
following parse arrays:
nouns-array: * name animal
parents: 0 * name
verbs: 0 has is
and is translated into the SPARQL query:
SELECT ?subject WHERE {
?subject dbpedia2:name "Animal"@en
}
3. “What has [property] ?”
translated into:
SELECT ?subject ?object WHERE {
?subject dbpedia2:[property] ?object
}
For example, the phrase “What has regnum?” generates the following parse
arrays:
nouns-array: * regnum
parents: 0 *
verbs: 0 has
and is translated into the SPARQL query:
SELECT ?subject ?object WHERE {
?subject dbpedia2:regnum ?object
}
5. In this case, where both the subject and object are missing, it is advised
that we put a limit on the number of results returned by DBPedia, using the
LIMIT keyword, as in:
SELECT ?subject ?object WHERE {
?subject dbpedia2:regnum ?object
}
LIMIT 20
2.3 Statistics
In order to obtain statistics, we go through the list of all nouns in the given
phrase and for each noun X we query the number of languages in which its
corresponding abstract data is translated, using:
SELECT COUNT DISTINCT ?abstract
WHERE {
:X dbpedia2:abstract ?abstract
}
2.4 ARC Queries
The following example shows how SPARQL queries can be made from within
PHP using the ARC library, which we also have used in our application.
include_once(’./arc/ARC2.php’);
$ssp = ARC2::getSPARQLScriptProcessor();
// define the script
$scr = ’
ENDPOINT <http://dbpedia.org/sparql>
PREFIX dbpedia2: <http://dbpedia.org/property/>
PREFIX dbpedia: <http://dbpedia.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
$results = SELECT * WHERE {
?episode skos:subject
<http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>.
?episode dbpedia2:blackboard ?chalkboard_gag.
}
’;
// run the script
$ssp->processScript($scr);
6. // display the results
echo "nnQuery results:nn";
print_r($ssp->env[’vars’][’results’][’value’]);
3 Conclusions and Future Developments
We presented a preliminary version of an application that allows users to query
DBPedia using basic natural language phrases. There are several features that
can be improved or new features that can be added. For instance, the basic
grammar that we use to create the parse tree can be made more complex. Also,
the three types of phrases that we allow as natural language queries can be made
more complex and closer to the everyday speech – they sound rather artificial
at the moment.
Another feature that can be added is to allow you to query several end-points,
not just DBPedia. The main problem is that each end-point may come with its
own set of vocabularies, apart from the well-known skos, foaf, rdfs, etc. Thus,
a further knowledge of each end-point is necessary before implementing natural
language queries that can be run on it.
As last remarks, in order to improve the lexical analysis step, a larger lexicon
can be used. Also, the graphical interface can be made more user friendly as the
previously mentioned features are implemented.
References
1. The ARC open-source RDF system at http://arc.semsol.org.
2. Ian Barber’s part of speech lexical analyzer, freely available at http://phpir.com/
part-of-speech-tagging.
3. The DBPedia Wiki at http://dbpedia.org/About.
4. The SPARQL online query interface on DBPedia, at http://dbpedia.org/snorql.