The document discusses geographic information retrieval (GIR). It introduces GIR as a specialized branch of information retrieval that deals with georeferenced information. It describes some general problems in GIR, such as ambiguity of place names and fuzzy geographic boundaries. It also discusses how cognitive models of human understanding of geography can impact GIR. The document then covers techniques for geo-referencing documents using gazetteers and ontologies. It concludes by discussing related projects, evaluation of GIR systems, and a gazetteer server and service developed for UK academia.
4. Introduction
● Geographic Information Retrieval can be seen as a specialized branch of traditional
Information Retrieval.
● Information that has relationships to geographic space is called georeferenced
information and frequently used term in Georeferenced Information Retrieval.
● Georeferenced information is used in all kinds of media, Eg :- Structured data like
maps, land surveys, airborne and satellite images and tabulated observations.
● Can also be used by researchers looking for certain area, or requiring particular area
inhabited by certain animals or is affected by an epidemic.
5. Properties of Georeferenced Information:
● Information available in digital libraries and on the Internet is georeferenced,
although mostly it is not denoted in terms of geographic coordinates.
● The geographical location and extension of a place name is often called geographic
footprint and it is given by coordinates ( longitude, latitude ).
● Geographic Information Retrieval requires that place names and phrases that include
direct or indirect references to place names be resolved and translated into footprints
that can be indexed.
6. General Problems in GIR:
Ambiguity/Lack of precision in Place Names:
● Firstly, several places can share the same name, making the place names unique
only within a limited geographic area.
● Secondly, some place names occurring in texts are temporal or cultural conventions
rather than official names, requiring the user to have an understanding of the time,
context or cultural environment the place names are used in to be able to link it to
some geographic location.
● Thirdly, some place names change over time. eg. Banglore to Bengaluru, Calcutta to
Kolkata etc..
● Fourthly, the geographic extension that the place name denotes can be extended,
reduced or changed over time.
7. General Problems in GIR: (contd.)
○ Fifthly, the borders of a location can be fuzzy. (Kashmir?)
○ The same place name can be written differently in different text, either because the
author has misspelled the name or because there are different legal spellings of the
same place name.
Information being fuzzy :
○ About 200 kilometers south of the capital of Russia” . Direction may vary,
distance may vary. In case of South Africa there are 3 capitals which may lead to
ambiguity.
○ Often, people are imprecise in giving geographic direction, using one of the four
general directions north, south, east or west, when the actual direction might be
somewhere in between.
8. Impact of cognitive model on Geographic
Information Retrieval
● Human understanding of the geographic loaction: Procedural and Survey based.
● Survey: Involves looking at maps and geographic location finding.
● Procedural: Involves exploring and navigating through the place so as to get the 'feel'
of it.
● Using procedural method to locate or gain information is particularly difficult as it
contains many phrases involving human ambiguity.
9. Cognitive model (continued)
● 'People link geographic distance with time.': People when talking about going from say
'a' to 'b' have a tendency of using time as a method of asserting distance.eg: It takes
two hours to reach from 'A' to 'B' by car.
● 'Topology and metric distances': People are very good at mentioning topological
aspects pertaining to a place. Like inclusion (eg: names of the topologies in an area.)
or coincidences (eg: this place is at the same place as..)
● 'People have biases towards east-west or north-south direction': People have a very
biased view of the geographical area. And while giving specifics in direction, they seem
to have a vague sense of direction. eg: When asked where is south america w.r.t to
north america. The answer generally is south. While the really it is in the south-east.
10. Geo referencing using the Gazetteers
Gazetteers: A form of index that relates place names to co-ordinates of locations and
extents.
Here we are going to focus on automatic geo-referencing based on the contents of the
documents text alone
In an automated approach most projects have based their approaches to georeferencing
on a combination of place name identification and natural language processing to identify
phrases that modifies the location pointed to by occurrences of place names (“200 km
south of the Moskow”) or that provides georeferences that indicates a geo-reference
without actually mentioning a specific place name (“Rosenborgs homefield”).
11. Geo- referencing (continued)
Gazetteers have three basic components:
The name is the textual designator of a geographic location, the location is the coordinates
of a point, line or area on the earth’s surface pointed to by a name, and the feature type is
the type of location that a name points to
(Forrest, agricultural area, river, inhabited location etc).The location that a place name
refers to (the place names footprint) can be given as a point, a bounding box or a polygon,
all represented by coordinates.
13. Geo- referencing (continued)
Bounding Box:
Gives a better idea of the entire referenced area.
Does not require a lot of data storage.
However it overlaps other areas around it and is inaccurate.
14. Geo-referencing (continued)
Approximated Polygon approach:
Most accurate in terms of referencing.
However takes a lot of data storage space.
The best approach would be to have something in the middle of the polygon and bounded
box approach like a fixed points polygon approach.
15. Searching for Georeferenced Information
Letting the user specify one or more place names in as keywords in a traditional keyword
based query. When parsing the query, the GIR/IR treats the found place names as special
keywords by the GIR/IR system, indicating the geographical scope of the information need
of the user.
e.g: Googling for Restaurants around you?
Letting users specify the geographic constraint to a query by drawing on one or more
maps.
e.g: Google Maps
and what about GPS Apps like "Here and Now", "Google Latitude"?
16. Searching for Georeferenced Information
Typical Queries:
○ Point in Polygon - asking for georeferenced information that contains,
surrounds or refers to a particular geographic point location
○ Region Queries - asking for anything contained in, adjacent to, or overlaps
the region.
○ Distance and Buffer Zone Queries - asking for information within some fixed
distance of a geographic object (point, line, polygon)
○ Path Queries - asking for the presence of a network structure that can be
queried for network traversal information
○ Multimedia Queries - combining multiple geo-referenced information sources
in resolving a query.
17. Related Projects:
SPIRIT:(Spatially-aware information retrieval on the internet) - funded by the EC
Fifth Framework Programme. To improve the search capabilities on the internet by using
geographical and conceptual ontologies to model both vocabulary and the spatial structure
of places for purposes of IR.This ontology, which is envisioned as an extension to traditional
gazetteers and related locations as well as help ranging hits based on geographic
properties.
∙ ontologies that model geographical terminology;
∙ query expansion and relevance ranking procedures based on
the geographical ontologies;
∙ machine learning techniques for the extraction of
geographical context from web documents and for generating
metadata providing spatial context;
∙ a multi-modal user interface providing textual input and
interactive map feedback of the context of retrieved
documents;
∙ spatial indices for web collections
18. Geo-Ontologies
Ontologies relating Geographical Terminology and Spatial Relationships
● Reference to a geographic place: <PL-Name,PL-Type,{(x,y)}>
○ eg: <Charminar, Monument,{(x,y)}>
● Relative Place Reference : <Spatial Relationship,PL-Name, Type,PL-FP>
○ eg: <In, Hyderabad, City, {(x,y)}>
A Query to SPIRIT will contain one or more references to a PL-REF
Geographic content is a set of <Place reference> expressions and the Geometric Footprint
is a function of this set.
Basically Geo Ontologies can be applied in :
1) User's query interpretation: (+ domain specific ontologies) for disambiguation of place
name
2) System query formulation: to generate alternate names and spatially associated names
3) Metadata extraction: to extract info from free text documents to generate foot print(s)
4) Relevance Ranking: potential for geographical relevance ranking (Dominos Pizza? :) )
19. Geo-Ontologies
Ontology"formal, explicit specification of a shared conceptualisation"
20. Geo-Ontologies
● Types of Atomic Queries:
○ A place name
○ An aspatial entity with relation to a place name
○ An aspatial entity with a spatial relation to a place name
○ An aspatial entity with a spatial relation to a place name
○ A place name with spatial relation to a place name
○ A place type with spatial relation to a place name
○ A place type with spatial relation to a place type
● Geo Ontology = Geographic Feature Ontology + Geographic Type Ontology + Spatial
Relation Ontology
21. User evaluation of the spirit prototype gave consistent results with SPIRIT priorities on
innovative features. Yet, users explain a feeling of frustration which highlights that their
requirements are beyond SPIRIT achievements and that there is still more work to be
done in this area.
The last publication on the website dates back to 2005.
22. Relevance
In Information Retrieval, relevance denotes how well a retrieved document or set of
documents meets the information need of the user.
Geographic Information Retrieval is concerned with retrieving documents in response to a
spatially related query. Thus, the ranking of documents by both textual and spatial
relevance have to be considered.
The most common way to return a set of documents obtained from a Web query is by
a ranked list. The search engine attempts to determine which document seems to be the
most relevant to the user and will put it first in the list. In short, every document receives
a score, or distance to the query, and the returned documents are sorted by this score or
distance.
There are situations where the sorting by score may not be the most useful one. When
a more complex query is done, composed of more than one query term or aspect,
documents can also be returned with two or more scores instead of one.
23. For example, the Web search could be for campings in the neighborhood of
Neuschwanstein, and the documents returned ideally have a score for the query
term “camping” and a score for the proximity to Neuschwanstein. This implies that a Web
document resulting from this query can be mapped to a point in the 2-dimensional plane,
where both axes represent a score. The map indicates campings near the castle
Neuschwanstein, which is situated close to Schwangau, with the distance to the castle
on the x-axis and the rating on the y-axis.
24.
Another weakness of our methods lies in the way we treat multiple-footprint documents.
While we assume that a query can have only one footprint (a user is interested in only one
location), documents may have multiple footprints (refer to more than one location).
The method we followed so far in order to calculate the spatial score considers only the
best-matching document footprint. For example, if a user is looking for “airports near
London”, a document that refers to both “Gatwick” and “Stansted” is scored as referring
only to “Gatwick” since it’s the nearest airport of the two. Such a document, however,
should be scored higher than another that refers only to “Gatwick” since it provides more
relevant information. Another thing is , the number of footprints occurring: Gatwick’s
official web-pages should be more important than a web-list of all airports in UK.
25.
For high-quality ranking two things are required. Firstly, we need a good spatial score
between query and document footprints. Secondly, we need a good combination of the
spatial and textual (BM25) scores.
For finding spatial scores, the spatial relationships (distance, containment, and direction)
were converted into numeric values that indicate how close, how much inside, or how
much North-of the relationship between two objects is. Those numeric values were first
attempts at obtaining a score to quantify spatial relationships.
However, certain issues do come up in this method. For example, let us assume three
cities, A, B, and C, where A lies in equal distance (in a Euclidean sense) from B and C. If
C is bigger than B, then the score of B being close to A should be lower than that of C
being close to C. In other words, the distance scores of cities around A may depend on the
context, i.e. which other cities are around A. Also, natural barriers can influence the
concept of proximity. It matters a lot whether a distance of 10 km (as the crow flies) can be
covered by a direct road, or requires a large detour around a mountain range (or a small
road over a mountain pass)
26.
In traditional information retrieval, the separate scores of each document would be
combined into a single score (e.g., by a weighted sum or product) which produces the
ranked list by sorting.
Now, we are going to incorporate two pieces of information into the way that a spatial
document score is calculated:
• The number n of unique footprints in a document.
• The frequencies f_1,…, f_n, of occurrence of the footprints in the document.
Moreover, the total spatial score of a document will be derived from fractional score
contributions of all occurring document footprints.
27.
A simple way of taking into account all document footprints is to define the total spatial
score as a linear combination (e.g. the simple average) of the individual scores of the
footprints:
S = 1/n * (s_1+…+s_n)
where s_i is the score of the ith document footprint in respect to the query
footprint. Incorporating also the frequencies of occurrence f_i, let us define the weight of
a footprint:
tf_i = 1 + log (f_i).
A footprint that occurs in the document only once will get a weight of one, where any extra
occurrences will increase the weight in a log fashion. The total score may be calculated as
S = 1/(tf_1+…+tf_n) * (tf_1*s_1+…+tf_n*s_n),
that is the weighted average of the individual scores.
28.
Considering again the example about “airports near London”, such a scoring function like
the last one would score higher
Gatwick’s official web-page than a web-list of all UK airports. Moreover, it takes into
account more than the best-matched document footprint. The last formula may serve as a
starting point for improving the spatial scoring function.
29. Evaluation:
2 Indicators:
1) Recall = No. of Relevant Docs returned / Total No. of rel. Docs
2) Precission = No. of relevant Docs returned / Total No. of Indexed Docs
Trec has been evaluated using the ISO 9241 standard: based on Effectiveness (can users
find relevant docs?) , Efficiency (resourcs consumed per result) and Satisfaction (User
feedback)
30. Gazetteer Server and Service for UK
Academia - James Reid
Gazetteer :- Geographical dictionary or directory. Serves as reference for information about
places.
● Geographic searching is powerful information retrieval tool, because the results
obtained hereafter are more specific.
● Geographic searching is restricted because Geographic metadata creation is very
resource intensive and the resources having geographic metadata exists only to
names.
● There is no particular mentioning of the geographic footprint i.e. directly. There might
be direct or indirect reference to the place.
Constant change in Geographic metadata:-
● Names of places may vary.
● Names may have changed from time to time.
● Boundaries can be fuzzy.
● Spoken in some context.
31. GeoXwalk is a comprehensive Gazetteer linking vocabulary of
current and historical geographical names to a standard spatial
coding scheme ( longitude, latitude ).
Technically GeoXwalk has basically three components :-
● Gazetteer database to support spatial searches.
● Middleware components to issue spatial/aspatial queries.
● Geo parser to parse non geographically indexed documents
for some place name as reference to it.
32. Gazetteer database
Each geographical feature must include :-
● Feature name.
● Feature type.
● Geometry ( spatial footprints ).
Marking out the places can be done better by using Polygons as opposed to Points.
Explicit relationships can be defined which is of particular use when Gazetteer hold
significant amount of historical data for which geometries doesn't exist.
Middleware components:
Protocols supported by geoXwalk are:-
● ADL Gazetteer protocol
● OGC filter encoding implementation.
This is to translate XML queries to database specific SQL queries.
33. GeoParser
Most data and metadata existing have some sort of geo-reference that is not in format
which will allow it to be easily spatially searched.
One task associated is how non spatially referenced documents could be spatially indexed.
Could be done using a Gazetteer as reference.
Prototype based geo-parser has been implemented that semi automatically identifies place
name in a document and extract a suitable spatial footprint.
The rule based approach takes in account the structure and context in which words occur.
One issue that is faced by GeoXwalk are Map conflation i.e. detecting duplicate entries.
Like a place spoken differently in different language but has a same geographic footprint.
34. Related Projects: GeoVSM
Geographic Vector Space Model: The project integrates coordinate based
geographic indexing with the key-word based vector space model in are presenting
information space. Relevance measures are based on both geographic measures and on
thematic measures which can be combined into one single measure system.
Vector Space Model: One of the most popular models of document space developed
in textual-based information retrieval research. It is an algebraic model for representing
text or graphical documents (and any objects, in general) as vectors of identifiers.
Using a vector space model, the content of each geographic document can be
approximately described by a vector of (content-bearing) terms, which are a combination of
thematic
subjects and place names.
● Documents and queries are represented as vectors. Each dimension corresponds to a
separate term
An information retrieval system stores a representation of a
document collection using a document-by-term matrix, where the element at position (i, j)
corresponds to the frequency of occurrence of term j in the ith document. In the vector
space model, all the objects (terms, documents, queries, concepts, etc) can be similarly
represented as vectors.
● Vector space model is well accepted as an effective approach in modelling thematic
35. However, the vector space model has some serious problems when used for
modeling the geographic subspace.
The geographic space is inherently continuous and cannot be
adequately approximated using a set of place names (which are discrete in nature). if a
document mentions four place names—Pittsburgh, Philadelphia, Harrisburg, and
Hagerstown—the four place names will be treated as four independent dimensions in a
vector space model, whereas in fact, they are points (or regions) in a two-dimensional
geographic space.
Additional concerns of using locational terms as geographic indexes include: ambiguity in
meaning, non-unique place names, place name might change over time, and spelling
variations
36. Geographical Model
● Geographical model of document space is capable of processing arbitrarily complex
spatial queries.
● The most common spatial are believed to be of three types:
1.Point query: Return the geometric object that contains a given query point
2.Region query :Given a region R, find all objects in the collection that intersect R
3.Buffer zone :A buffer query involves two spatial data sets and a distance d. The answer
to this query are pairs of objects, one from each input set, that are within distanced of
each other. For e.g. “find house-power line pairs that are within 50 meters of each
other.”
● Spatial indexing based on coordinates generates persistent indexes for documents,
since it is well defined and is immune from any changes in place names, political
boundaries, and linguistic variations
37. VSM / Geographical model (contd..)
● Disadvantages of using the Geographical model in retrieving geographical
information
-There are considerable amount of geographical information existing in textual forms
that are not easily integrated into geographical model for mapping and spatial
analysis, due to the difficulties of natural language understanding for geo-referencing
text.
-
38.
39. GeoVSM
● Model obtained by combining the advantages of both the geographical model and
vector space model.
● Each document will be indexed both by footprint (in geographical coordinate space)
and by a term vector (in vector space).
● Geographical indexes will only represent the geographical scope of the document,
and term vectors will only represent thematic scope of documents
40.
41.
Assume that any document has a limited geographic scope, GSd, and
a thematic scope, TSd. Similarly, a query on a document collection also has a geographic
scope, GSq and a thematic scope, TSq. The degree of relevance of a document
to a query can be determined by the following measure:
Rel(d, q) = ƒ(SimG(GSd, GSq), SimT(TSd, TSq) ) (1)
where SimG(•) measures the similarity (i.e., the degree of overlapping) between the
geographic scopes of the document and the query; SimT(•) measures the degree of
overlapping between the thematic scopes of the document and the query; and ƒ(*) is a
function for combining relevance measures of geographic dimensions and thematic
dimensions.
42. References
* GeoVSM: An Integrated Retrieval Model for Geographic Information
Guoray Cai
School of Information Sciences and Technology
The Pennsylvania State University
002K Thomas Building, University Park, PA 16802
* http://www.geo-spirit.org/public_deliverables.html
* http://www.geo-spirit.org/publications/SPIRIT_WP5_D17_5201_final.pdf
* http://www.geo-spirit.org/publications/SPIRIT_DeliverableD18_5302_final.pdf
* http://www.geo-spirit.org/publications/GIR_distrib_ranking.pdf
* Distributed Ranking Methods for Geographic Information Retrieval by
Marc van Kreveld Iris Reinbacher Avi Arampatzis Roelof van Zwol