Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data

Content-based Image Retrieval Using a Combination of
Visual Features and Eye Tracking Data
Zhen Liang1*, Hong Fu1, Yun Zhang1, Zheru Chi1, Dagan Feng1,2
1
Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering
The Hong Kong Polytechnic University, Hong Kong, China
2
School of Information Technologies, The University of Sydney, Sydney, Australia

Abstract research topic since the early 1990s. In CBIR, image contents
are characterized for searching similar images to the query im-
Image retrieval technology has been developed for more than age. Usually, low-level features, such as colors, shapes and tex-
twenty years. However, the current image retrieval techniques tures, are used to form a feature vector to represent images. A
cannot achieve a satisfactory recall and precision. To improve similarity measurement between images is usually computed
the effectiveness and efficiency of an image retrieval system, a based on the distance between the corresponding feature vectors.
novel content-based image retrieval method with a combination However, low-level features are quite often not sufficient
of image segmentation and eye tracking data is proposed in this enough to describe an image, and the gap between low-lever
paper. In the method, eye tracking data is collected by a non- features and high-level semantic concepts becomes a major dif-
intrusive table mounted eye tracker at a sampling rate of 120 Hz, ficulty that hinders a further development of CBIR systems
and the corresponding fixation data is used to locate the human’s [Smeulders et al. 2002]. As an attempt to reduce the semantic
Regions of Interest (hROIs) on the segmentation result from the gap and to improve retrieval performance, Region-Based Image
JSEG algorithm. The hROIs are treated as important informative Retrieval (RBIR) techniques have been proposed.
segments/objects and used in the image matching. In addition,
the relative gaze duration of each hROI is used to weigh the In an RBIR system, local information is extracted from the
similarity measure for image retrieval. The similarity measure whole image and used for image retrieval. The basic rational is
proposed in this paper is based on a retrieval strategy emphasiz- that one who searches for similar images is normally interested
ing the most important regions. Experiments on 7346 Hemera in visual objects/segments and the features extracted from the
color images annotated manually show that the retrieval results whole image may not properly represent the characteristics of
from our proposed approach compare favorably with conven- the objects. The standard process of an RBIR system includes
tional content-based image retrieval methods, especially when three steps: (1) segmenting an image into a set of regions; (2)
the important regions are difficult to be located based on visual extracting the features from segmented regions, which are
features. known as “local features”; and (3) measuring the similarity be-
tween the query image and every candidate image in terms of
CR Categories: H.3.3 [Information Storage and Retrieval]: local features. Many recent algorithms are focusing on improv-
Information Search and Retrieval—Relevance feedback, Search ing the efficiency and effectiveness of image segmentation, fea-
process; H.5.2 [Information Interfaces and Representation]: User ture extraction and similarity measurement in the RBIR system
interfaces [Tsai et al. 2003; Lv et al. 2004; Marques et al. 2006; Wang et al.
2006]. On the other hand, sometimes the segmentation process
Keywords: eye tracking, content-based image retrieval (CBIR), may fail to produce objects if they are not salient based on their
visual perception, similarity measure, fixation visual features although these objects carries important semantic
information. One of the most influencing RBIR approaches is to
integrate a process of manually selecting important regions and
1 Introduction indicating feature importance into a system to overcome the
problem mentioned above [Carson et al. 1999]. However, these
Due to an exponential growth of digital images in a daily basis, bring a huge burden to users and are not convenient at all.
content-based image retrieval (CBIR) has been a very active
In 2003, Parkhurst and Niebur pointed out that eye movements
1 under natural viewing conditions are determined by Selective
email: zhenliang@eie.polyu.edu.hk Visual Attention Model (SVAM) [Parkhurst and Niebur, 2003].
enhongfu@inet.polyu.edu.hk The SVAM is composed of two stages: a bottom-up procedure
tvsunny@gmail.com with low-level features and a top-down process guided by a
enzheru@inet.polyu.edu.hk high-level understanding of the image. For integrating the top-
1,2
email: enfeng@polyu.edu.hk down processing in an RBIR system, eye tracking technique can
provide a more natural, convenient and imperceptible way to
understand the user’s intention instead of asking him/her to ma-
Copyright © 2010 by the Association for Computing Machinery, Inc.
nually select the ROIs. It has been found that longer-duration
Permission to make digital or hard copies of part or all of this work for personal or and more frequent fixations appear on the objects in a scene [De
classroom use is granted without fee provided that copies are not made or distributed Graef et al. 1990; Henderson and Hollingworth, 1998]. There-
for commercial advantage and that copies bear this notice and the full citation on the fore, the relative gaze duration could be utilized to improve re-
first page. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on
trieval performance by weighing the corresponding hROI.
servers, or to redistribute to lists, requires prior specific permission and/or a fee.
Request permissions from Permissions Dept, ACM Inc., fax +1 (212) 869-0481 or e-mail
permissions@acm.org.
ETRA 2010, Austin, TX, March 22 – 24, 2010.
© 2010 ACM 978-1-60558-994-7/10/0003 $10.00

41

In this paper, a model of using a combination of visual features based on low-level features (color & texture). The image seg-
and eye tracking data is proposed to reduce the semantic gap and mentation step is similar as the human bottom-up processing that
improve retrieval performance. The flowchart of our proposed can help locate objects and boundaries in image retrieval system.
model is shown in Figure 1. After eye tracking data are collected In our experiment, images are downsized into a maximum
by a non-intrusive table mounted eye tracker and the image is width/length of 100 cm with a fixed aspect ratio before segmen-
segmented by the JSEG algorithm, the fixation data is extracted tation that can reduce the computational complexity and increase
and used to locate the hROIs on the segmented image. The rela- the retrieval efficiency. Figure 2(b) gives some segmentation
tive gaze duration in each hROI is also computed to weigh the results.
importance. The selected hROIs are treated as important infor-
mative segments/objects and used in the image retrieval. 4 Image Retrieval Model
The rest of this paper is organized as follows. Eye tracking data
A novel image retrieval model using a combination of image
acquisition is described in Section 2. In Section 3, the JSEG
algorithm is explained. Then we discuss how to construct an segmentation results and eye tracking data is proposed in this
section. The aim of image segmentation step is to simulate the
image retrieval model with eye tracking data in terms of region
bottom-up processing that coarsely interpret and parse images
selection, feature extraction, weight calculation as well as simi-
larity measurement in Section 4. In Section 5, experimental re- into coherent segments based on the low-level features [Spe-
kreijse 2000]. [Rutishauser et al. 2004] has demonstrated that the
sults are reported with a comparison of our new approach with
bottom-up attention is partially useful for object recognition. But
the conventional image retrieval methods. Finally, a conclusion
is drawn and future work is outlined in Section 6. it is not sufficient. In the second stage of the human visual atten-
tion, top-down processing, one or a few of objects are selected
from the whole image for a more thorough analysis [Spekreijse
2000]. The selection procedure is not only guided by elementary
features but also by human understandings. Thus, if an image
retrieval strategy could incorporate bottom-up and top-down
mechanisms, the accuracy and efficiency will be largely im-
proved. Eye tracking technique provides us with an important
signal to understand which region(s) is concerned by the user or
which object(s) in the image is the target the user wants to
Figure 1The flowchart of our proposed model.
search for. Fixation is one of the most important features in eye
tracking data, which can tells us where the subject’s attention
points are and how long one fixates on each attention point.
Thus, in our proposed model, the fixation data is used to define
(a) Original images
the hROIs on the segmented image, and the relative gaze dura-
tion in each hROI is treated as the corresponding region signi-
ficance.
(b) Segmented maps
4.1 Selection Process of Human’s ROI

Here, we use eye tracking data to locate the observer’s interest-
(c) Eye tracking data acquisition ing regions in an image, and an importance value for each seg-
mented region is defined as the relative gaze duration on the
Figure 2 representative images (a) with the corresponding seg-
region. Some example images with their eye tracking data are
mentation results (b) and eye tracking data acquisition (c). shown in Figure 2(c). Suppose that an image I is composed of N
segmented regions (Eq. (1)), and the relative gaze duration on
2 Eye Tracking Data Acquisition the image I is D.

A non-intrusive table-mounted eye tracker, Tobii X120, was I = S1 , … , Si , … , SN , (1)
used to collect eye movement data in a user-friendly where �� is the ��th segmented region.
environment with a high accuracy (0.5 degree) at 120Hz sample
rate. The freedom of head movement is 30x22x30 cm. Before A concept of importance value is introduced to show the degree
each data collection, a calibration will be conducted under a grid of the observer’s interest on the region. Assume that the relative
of nine calibration points for minimizing errors in eye tracking. gaze duration on the segmented region �� is �� , a corresponding
Fixations (location and duration) were extracted from the raw importance value �� can be calculated by Eq. (2). The value will
eye tracking data using a criterion of fixation radius (35 pixels) be 0 if there is no fixation on the region.
and minimum fixation duration (100 ms). Each image in the ��
�� = �� , ��
and ��=1 �� = 1 . (2)
7346 Hemera color image database is viewed by a normal vision ��
��
participant within 5 seconds. The viewing distance is where �� = ��=1 �� .
approximately 68 cm away from the display with a display
resolution of 1920 x 1200 pixels. The corresponding subtended As shown in Figure 3(a), the popping-out process of human
visual angel is about 41.5º x 26.8º. ROIs consists of the following steps. Step 1: Scale eye tracking
data to the segmented map size. Step 2: Determine whether a
segmented region is a hROI or not. Step 3: If all segmented re-
3 Image Segmentation gions in the image are processed, terminate the procedure; oth-
erwise, go to Step 2. Figure 3(b) gives some examples to show
A state-of-art segmentation algorithm, JSEG [Deng and Manju- the selection results in terms of weighting maps. The higher of
nath 2001], is used in this paper to segment images into regions

42

��
the importance values, the redder region in the map. In the next ��=1 �� =1 ��
s �� , �� = . (7)
image retrieval step, the selected hROIs are treated as important ��
��=1 �� =1 ��
informative segments/objects, and the corresponding importance When the query image and a candidate image are identical, the
values are used as the region significance to weigh the similarity distance in Eq. (7) is zero. Thus, for a query image, a smaller
measure. distance indicates that there are more matched regions in the
candidate image. In the other words, the corresponding image is
more relevant to the query one.

5 Experimental Results and Discussion

5.1 Image Database and Evaluation Criterion

The retrieval experiments are conducted on the 7,346 Hemera
color images annotated by keywords manually. Figure 4 shows
example images from a few categories. The evaluation criterion
for image retrieval performance applied here is not simply by
labeling images as “relevant” or “irrelevant”, but based on the
(a) (b) ratio of matched keywords between the query image and a data-
Figure 3 Selection process of hROIs (a) and weighting maps (b) base image returned. Suppose that the query image and retrieval
image have M and N keywords, respectively, with P matched
(the value in the weighting map is the corresponding important keywords, then the semantic similarity is defined as
value o f the region). �� query image, retrieval image =
��
. (8)
(��+��)/2

4.2 Feature Extraction

Color and texture properties are extracted from the selected
hROIs for similarity measure. For the color property, the HSV
color space is used because it approximates the human percep- (a) (b)
tion [Paschos 2001]. For the texture property, the Sobel operator
is used to produce the edge map from the gray-scale image. A
feature set including an 11 x 11 x 11 HSV color histogram and a
1 x 41 texture histogram of the edge map is used to characterize
the region.
(c) (d)
4.3 Similarity Measure Figure 4 Example images in the image database. (a) Asian ar-
chitecture; (b) Close-up; (c) People; (d) Landmarks.
An image is represented by several regions in the image retrieval
system. Suppose that there are m ROIs from the query image,
�� = ��1 , … , �� , and n ROIs from a candidate image,
�� = {��1 , … , �� }. As discussed in Section 4.1, the correspond-
ing region weight vectors of the query image and the candidate
image are �� = {��1 , … , �� } and �� = {��1 , … , �� }, respec-
��
tively. The similarity matrix among the regions of two images is
defined as (a) (b)
�� = �� = �� , �� , �� = 1, … , ��; �� = 1, … , ��. , (3)
where �� , �� is the Euclidean distance between the feature
vectors of region �� and �� . The weight matrix, which indi-
cates the importance of the corresponding region similarity
measure in the similarity matrix, is defined as
�� = �� = �� , �� = 1, … , ��; �� = 1, … , ��. . (4)
(c) (d)
To find the most similar region in the candidate image, a match-
Figure 5 Average semantic similarity vs. the number of images
ing matrix is defined as
returned for four themes of images shown in Figure 4.
�� = �� , �� = 1, … , ��; �� = 1, … , ��. , (5)

where 5.2 Performance Evaluation of Image Retrieval
1 if �� = ��∗ and ��∗ = arg min�� , The performances of our proposed image retrieval model on
�� = �� = 1, … , ��. (6)
0 otherwise different types of query images (Figure 4) are shown in Figure 5,
In the matching matrix, there is only one element is 1 in each compared with the following three methods: 1) Global based:
row and the others are 0. The value of 1 represents the corres- retrieval based on the Euclidean distance of the global color and
ponding �� in the similarity matrix is the minimum in the row. texture histograms of two images; 2) Attention based: attention-
Thus, the distance between two images in the proposed image driven image retrieval strategy proposed in [Fu et al. 2006]; 3)
retrieval model is defined as Attention object1 based: retrieval using the first popped-out

43

object only in the attention-driven image retrieval strategy [Fu et favorably with conventional CBIR methods, especially when the
al. 2006]; important regions are difficult to be located based on the visual
features of an image. Future work to be carried out includes
The fixation-based image retrieval system is one proposed in collecting eye tracking data during the relevance feedback
this paper. Figure 5 shows the results of image retrieval by using process and the refinement on both the feature extraction and
the above mentioned four image retrieval methods for different weight computation.
image classes. We can see that our fixation-based method is
significantly better than the other two in “Asian Architecture” Acknowledgements
and “People” image classes in terms of average semantic simi-
larity. For the other two image classes “Closeup”, and “Land-
marks”, our method is better when the number of return images This work is supported by the Research Grants Council of the
Hong Kong Special Administrative Region, China (Project code:
is not large (20 or smaller), suggesting that our method can have
PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9).
a better effectiveness.

5.3 Discussion References

Our proposed model achieves a better retrieval performance than CARSON, C., THOMAS, M., BELONGIE, S., HELLERSTEIN, J.M., AND
the other three image retrieval methods when the objects are MALIK, J. 1999. BlobWORLD: A System for Region-Based Image
hidden in the background (the low-level features of the objects Indexing and Retrieval. Proc. Visual Information Systems, 509-
are not conspicuous) or there are multi-objects in the image. For 516.
example, an image shown in Figure 6 (left), “man working out DENG, Y., AND MANJUNATH, B. 2001. Unsupervised Segmenta-
in the gym”, Fu et al.’s model places a higher importance value tion of Color-Texture Regions in Images and Videos. IEEE
on the white ground and the other part is treated as the back- Trans. Pattern Anal. Mach. Intell., vol. 23(8), 800-810.
ground, while the man and the woman in the corner are consi- FU, H., CHI, Z., AND FENG. D. 2006. Attention-Driven Image
dered as the two most important hROIs in our model. A compar- Interpretation with Application to Image Retrieval. Pattern Rec-
ison of the selection of HOIs on the example image for the fixa- ognition, Vol. 39(9), 1604-1621.
tion-based and attention-based approaches is shown in Figure 6
(right). On the other hand, in the global based image retrieval, DE GRAEF, P., CHRISTIAENS, D., AND D’YDEWALLE, G. 1990.
all the information is mixed together, which cannot distinguish Perceptual Effects of Scene Context on Object Recognition.
Psychological Research, Vol. 52, 317-329.
objects from the image with different significances especially
when the objects are hidden in the background. In our method, HENDERSON, J. M., AND HOLLINGWORTH, A. 1998. Eye Move-
the important information in the image can be extracted and well ments During Scene Viewing: An Overview. in: Eye Guidance
ranked based on the human visual attention process. For exam- While Reading and While Watching Dynamic Scenes, Under-
ple, for the image with a cow in the grass background (Figure 2), wood, G. (Ed.). Elsevier Science, Amsterdam, 269-293.
the grass has a much larger area than the cow. As a result, the LV, Q., CHARIKAR, M., AND LI. K. 2004. Image Similarity Search
global based image retrieval prefers to retrieving images that With Compact Data Structure. In Proceedings of The ACM In-
also have green objects and/or the green background. On the ternational Conference on Information and Knowledge Man-
contrary, our method identifies the cow as the most important agement, 208-217.
object in the image and accordingly the retrieval performance is MARQUES, Q., MAYRON, L., BORBA, G., AND GAMBA, H. 2006.
much improved. Using Visual Attention to Extract Regions of Interest in the
Context of Image Retrieval. In Proceedings of ACM Annual
Southeast Regional Conference, 638-643.
PARKHURST, D. J., AND NIEBUR, E. 2003. Scene Content Selected
by Active vision. Spatial Vision, Vol. 16(2), 125-154.
PASCHOS, M. 2001. Perceptually Uniform Color Spaces for Col-
or Texture Analysis: An Empirical Evaluation. IEEE Trans.
Image Process, Vol. 10(6), 932-937.
RUTISHAUSER, U., WALTHER, D., KOCH, C., AND PERONA, P. 2004.
Is Bottom-Up Attention Useful for Object Recognition?
Figure 6 Fixation-based vs. attention-based selection where the CVPR2004, Vol. 2, 37-44.
value below the left image is the corresponding importance val- SMEULDERS, A. W. M., WORRING, M., AND SANTINI, S. 2002.
ue. Content-Based Image Retrieval At the End of the Early Years.
IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22(12), 1349-1380.
6 Conclusion SPEKREIJSE, H. 2000. Pre-attentive and Attentive Mechanism in
Vision. Perceptual Organization and Dysfunction. Vision Search,
Vol. 40, 1179-1638.
In this paper, we report our study on imitating the human visual
attention process for CBIR by combining the image segmenta- TSAI, C. F., MCGARRY, K., AND TAIT, J. 2003. Image Classifica-
tion and eye tracking techniques. JSEG algorithm is used to tion Using Hybrid Neural Network, In Proceedings of The ACM
parse the image into homogeneous sub-regions and eye tracking SIGIR Conference on Research and Development in Information
Retrieaval, 431-432.
data is utilized to locate the hROIs on the segmented image. In
the similarity measurement step, each hROI is weighed by the WANG, X. Y., HU. F. L. AND YANG, H. Y. 2006. A Novel Re-
fixation duration on each hROI as the importance value to em- gions-of-Interest Based Image Retrieval Using Multiple Features.
phasize the most important regions. Retrieval results on 7346 In Proceedings of The Multi-Media Modeling International Con-
Hemera color images show that our proposed approach compare ference, Vol. 1, 377-380.

44

Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (20)

Similar a Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data

Similar a Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data (20)

Más de Kalle

Más de Kalle (20)

Liang Content Based Image Retrieval Using A Combination Of Visual Features And Eye Tracking Data