Image retrieval technology has been developed for more than twenty years. However, the current image retrieval techniques cannot achieve a satisfactory recall and precision. To improve the effectiveness and efficiency of an image retrieval system, a novel content-based image retrieval method with a combination of image segmentation and eye tracking data is proposed in this paper. In the method, eye tracking data is collected by a non-intrusive table mounted eye tracker at a sampling rate of 120 Hz, and the corresponding fixation data is used to locate the human’s Regions of Interest (hROIs) on the segmentation result from the JSEG algorithm. The hROIs are treated as important informative segments/objects and used in the image matching. In addition, the relative gaze duration of each hROI is used to weigh the similarity measure for image retrieval. The similarity measure proposed in this paper is based on a retrieval strategy emphasiz-ing the most important regions. Experiments on 7346 Hemera color images annotated manually show that the retrieval results from our proposed approach compare favorably with conventional content-based image retrieval methods, especially when the important regions are difficult to be located based on visual features.
2. In this paper, a model of using a combination of visual features based on low-level features (color & texture). The image seg-
and eye tracking data is proposed to reduce the semantic gap and mentation step is similar as the human bottom-up processing that
improve retrieval performance. The flowchart of our proposed can help locate objects and boundaries in image retrieval system.
model is shown in Figure 1. After eye tracking data are collected In our experiment, images are downsized into a maximum
by a non-intrusive table mounted eye tracker and the image is width/length of 100 cm with a fixed aspect ratio before segmen-
segmented by the JSEG algorithm, the fixation data is extracted tation that can reduce the computational complexity and increase
and used to locate the hROIs on the segmented image. The rela- the retrieval efficiency. Figure 2(b) gives some segmentation
tive gaze duration in each hROI is also computed to weigh the results.
importance. The selected hROIs are treated as important infor-
mative segments/objects and used in the image retrieval. 4 Image Retrieval Model
The rest of this paper is organized as follows. Eye tracking data
A novel image retrieval model using a combination of image
acquisition is described in Section 2. In Section 3, the JSEG
algorithm is explained. Then we discuss how to construct an segmentation results and eye tracking data is proposed in this
section. The aim of image segmentation step is to simulate the
image retrieval model with eye tracking data in terms of region
bottom-up processing that coarsely interpret and parse images
selection, feature extraction, weight calculation as well as simi-
larity measurement in Section 4. In Section 5, experimental re- into coherent segments based on the low-level features [Spe-
kreijse 2000]. [Rutishauser et al. 2004] has demonstrated that the
sults are reported with a comparison of our new approach with
bottom-up attention is partially useful for object recognition. But
the conventional image retrieval methods. Finally, a conclusion
is drawn and future work is outlined in Section 6. it is not sufficient. In the second stage of the human visual atten-
tion, top-down processing, one or a few of objects are selected
from the whole image for a more thorough analysis [Spekreijse
2000]. The selection procedure is not only guided by elementary
features but also by human understandings. Thus, if an image
retrieval strategy could incorporate bottom-up and top-down
mechanisms, the accuracy and efficiency will be largely im-
proved. Eye tracking technique provides us with an important
signal to understand which region(s) is concerned by the user or
which object(s) in the image is the target the user wants to
Figure 1The flowchart of our proposed model.
search for. Fixation is one of the most important features in eye
tracking data, which can tells us where the subject’s attention
points are and how long one fixates on each attention point.
Thus, in our proposed model, the fixation data is used to define
(a) Original images
the hROIs on the segmented image, and the relative gaze dura-
tion in each hROI is treated as the corresponding region signi-
ficance.
(b) Segmented maps
4.1 Selection Process of Human’s ROI
Here, we use eye tracking data to locate the observer’s interest-
(c) Eye tracking data acquisition ing regions in an image, and an importance value for each seg-
mented region is defined as the relative gaze duration on the
Figure 2 representative images (a) with the corresponding seg-
region. Some example images with their eye tracking data are
mentation results (b) and eye tracking data acquisition (c). shown in Figure 2(c). Suppose that an image I is composed of N
segmented regions (Eq. (1)), and the relative gaze duration on
2 Eye Tracking Data Acquisition the image I is D.
A non-intrusive table-mounted eye tracker, Tobii X120, was I = S1 , … , Si , … , SN , (1)
used to collect eye movement data in a user-friendly where ������������ is the ������th segmented region.
environment with a high accuracy (0.5 degree) at 120Hz sample
rate. The freedom of head movement is 30x22x30 cm. Before A concept of importance value is introduced to show the degree
each data collection, a calibration will be conducted under a grid of the observer’s interest on the region. Assume that the relative
of nine calibration points for minimizing errors in eye tracking. gaze duration on the segmented region ������������ is ������������ , a corresponding
Fixations (location and duration) were extracted from the raw importance value ������������ can be calculated by Eq. (2). The value will
eye tracking data using a criterion of fixation radius (35 pixels) be 0 if there is no fixation on the region.
and minimum fixation duration (100 ms). Each image in the ������
������������ = ������ , ������
and ������=1 ������������ = 1 . (2)
7346 Hemera color image database is viewed by a normal vision ������
������
participant within 5 seconds. The viewing distance is where ������ = ������=1 ������������ .
approximately 68 cm away from the display with a display
resolution of 1920 x 1200 pixels. The corresponding subtended As shown in Figure 3(a), the popping-out process of human
visual angel is about 41.5º x 26.8º. ROIs consists of the following steps. Step 1: Scale eye tracking
data to the segmented map size. Step 2: Determine whether a
segmented region is a hROI or not. Step 3: If all segmented re-
3 Image Segmentation gions in the image are processed, terminate the procedure; oth-
erwise, go to Step 2. Figure 3(b) gives some examples to show
A state-of-art segmentation algorithm, JSEG [Deng and Manju- the selection results in terms of weighting maps. The higher of
nath 2001], is used in this paper to segment images into regions
42
3. ������ ������
the importance values, the redder region in the map. In the next ������=1 ������ =1 ������ ������������ ������ ������������ ������������������
s ������������ , ������������ = . (7)
image retrieval step, the selected hROIs are treated as important ������ ������
������=1 ������ =1 ������ ������������ ������ ������������
informative segments/objects, and the corresponding importance When the query image and a candidate image are identical, the
values are used as the region significance to weigh the similarity distance in Eq. (7) is zero. Thus, for a query image, a smaller
measure. distance indicates that there are more matched regions in the
candidate image. In the other words, the corresponding image is
more relevant to the query one.
5 Experimental Results and Discussion
5.1 Image Database and Evaluation Criterion
The retrieval experiments are conducted on the 7,346 Hemera
color images annotated by keywords manually. Figure 4 shows
example images from a few categories. The evaluation criterion
for image retrieval performance applied here is not simply by
labeling images as “relevant” or “irrelevant”, but based on the
(a) (b) ratio of matched keywords between the query image and a data-
Figure 3 Selection process of hROIs (a) and weighting maps (b) base image returned. Suppose that the query image and retrieval
image have M and N keywords, respectively, with P matched
(the value in the weighting map is the corresponding important keywords, then the semantic similarity is defined as
value o f the region). ������ query image, retrieval image =
������
. (8)
(������+������)/2
4.2 Feature Extraction
Color and texture properties are extracted from the selected
hROIs for similarity measure. For the color property, the HSV
color space is used because it approximates the human percep- (a) (b)
tion [Paschos 2001]. For the texture property, the Sobel operator
is used to produce the edge map from the gray-scale image. A
feature set including an 11 x 11 x 11 HSV color histogram and a
1 x 41 texture histogram of the edge map is used to characterize
the region.
(c) (d)
4.3 Similarity Measure Figure 4 Example images in the image database. (a) Asian ar-
chitecture; (b) Close-up; (c) People; (d) Landmarks.
An image is represented by several regions in the image retrieval
system. Suppose that there are m ROIs from the query image,
������������ = ������������1 , … , ������������������ , and n ROIs from a candidate image,
������������ = {������������1 , … , ������������������ }. As discussed in Section 4.1, the correspond-
ing region weight vectors of the query image and the candidate
image are ������ = {������������1 , … , ������������������ } and ������ = {������������1 , … , ������������������ }, respec-
������ ������
tively. The similarity matrix among the regions of two images is
defined as (a) (b)
������ = ������������������ = ������ ������������������ , ������������������ , ������ = 1, … , ������; ������ = 1, … , ������. , (3)
where ������ ������������������ , ������������������ is the Euclidean distance between the feature
vectors of region ������������������ and ������������������ . The weight matrix, which indi-
cates the importance of the corresponding region similarity
measure in the similarity matrix, is defined as
������ = ������������������ = ������������������ ������������������ , ������ = 1, … , ������; ������ = 1, … , ������. . (4)
(c) (d)
To find the most similar region in the candidate image, a match-
Figure 5 Average semantic similarity vs. the number of images
ing matrix is defined as
returned for four themes of images shown in Figure 4.
������ = ������������������ , ������ = 1, … , ������; ������ = 1, … , ������. , (5)
where 5.2 Performance Evaluation of Image Retrieval
1 if ������ = ������∗ and ������∗ = arg min������ ������������������ , The performances of our proposed image retrieval model on
������������������ = ������ = 1, … , ������. (6)
0 otherwise different types of query images (Figure 4) are shown in Figure 5,
In the matching matrix, there is only one element is 1 in each compared with the following three methods: 1) Global based:
row and the others are 0. The value of 1 represents the corres- retrieval based on the Euclidean distance of the global color and
ponding ������������������ in the similarity matrix is the minimum in the row. texture histograms of two images; 2) Attention based: attention-
Thus, the distance between two images in the proposed image driven image retrieval strategy proposed in [Fu et al. 2006]; 3)
retrieval model is defined as Attention object1 based: retrieval using the first popped-out
43
4. object only in the attention-driven image retrieval strategy [Fu et favorably with conventional CBIR methods, especially when the
al. 2006]; important regions are difficult to be located based on the visual
features of an image. Future work to be carried out includes
The fixation-based image retrieval system is one proposed in collecting eye tracking data during the relevance feedback
this paper. Figure 5 shows the results of image retrieval by using process and the refinement on both the feature extraction and
the above mentioned four image retrieval methods for different weight computation.
image classes. We can see that our fixation-based method is
significantly better than the other two in “Asian Architecture” Acknowledgements
and “People” image classes in terms of average semantic simi-
larity. For the other two image classes “Closeup”, and “Land-
marks”, our method is better when the number of return images This work is supported by the Research Grants Council of the
Hong Kong Special Administrative Region, China (Project code:
is not large (20 or smaller), suggesting that our method can have
PolyU 5141/07E) and the PolyU Grant (Project code: 1-BBZ9).
a better effectiveness.
5.3 Discussion References
Our proposed model achieves a better retrieval performance than CARSON, C., THOMAS, M., BELONGIE, S., HELLERSTEIN, J.M., AND
the other three image retrieval methods when the objects are MALIK, J. 1999. BlobWORLD: A System for Region-Based Image
hidden in the background (the low-level features of the objects Indexing and Retrieval. Proc. Visual Information Systems, 509-
are not conspicuous) or there are multi-objects in the image. For 516.
example, an image shown in Figure 6 (left), “man working out DENG, Y., AND MANJUNATH, B. 2001. Unsupervised Segmenta-
in the gym”, Fu et al.’s model places a higher importance value tion of Color-Texture Regions in Images and Videos. IEEE
on the white ground and the other part is treated as the back- Trans. Pattern Anal. Mach. Intell., vol. 23(8), 800-810.
ground, while the man and the woman in the corner are consi- FU, H., CHI, Z., AND FENG. D. 2006. Attention-Driven Image
dered as the two most important hROIs in our model. A compar- Interpretation with Application to Image Retrieval. Pattern Rec-
ison of the selection of HOIs on the example image for the fixa- ognition, Vol. 39(9), 1604-1621.
tion-based and attention-based approaches is shown in Figure 6
(right). On the other hand, in the global based image retrieval, DE GRAEF, P., CHRISTIAENS, D., AND D’YDEWALLE, G. 1990.
all the information is mixed together, which cannot distinguish Perceptual Effects of Scene Context on Object Recognition.
Psychological Research, Vol. 52, 317-329.
objects from the image with different significances especially
when the objects are hidden in the background. In our method, HENDERSON, J. M., AND HOLLINGWORTH, A. 1998. Eye Move-
the important information in the image can be extracted and well ments During Scene Viewing: An Overview. in: Eye Guidance
ranked based on the human visual attention process. For exam- While Reading and While Watching Dynamic Scenes, Under-
ple, for the image with a cow in the grass background (Figure 2), wood, G. (Ed.). Elsevier Science, Amsterdam, 269-293.
the grass has a much larger area than the cow. As a result, the LV, Q., CHARIKAR, M., AND LI. K. 2004. Image Similarity Search
global based image retrieval prefers to retrieving images that With Compact Data Structure. In Proceedings of The ACM In-
also have green objects and/or the green background. On the ternational Conference on Information and Knowledge Man-
contrary, our method identifies the cow as the most important agement, 208-217.
object in the image and accordingly the retrieval performance is MARQUES, Q., MAYRON, L., BORBA, G., AND GAMBA, H. 2006.
much improved. Using Visual Attention to Extract Regions of Interest in the
Context of Image Retrieval. In Proceedings of ACM Annual
Southeast Regional Conference, 638-643.
PARKHURST, D. J., AND NIEBUR, E. 2003. Scene Content Selected
by Active vision. Spatial Vision, Vol. 16(2), 125-154.
PASCHOS, M. 2001. Perceptually Uniform Color Spaces for Col-
or Texture Analysis: An Empirical Evaluation. IEEE Trans.
Image Process, Vol. 10(6), 932-937.
RUTISHAUSER, U., WALTHER, D., KOCH, C., AND PERONA, P. 2004.
Is Bottom-Up Attention Useful for Object Recognition?
Figure 6 Fixation-based vs. attention-based selection where the CVPR2004, Vol. 2, 37-44.
value below the left image is the corresponding importance val- SMEULDERS, A. W. M., WORRING, M., AND SANTINI, S. 2002.
ue. Content-Based Image Retrieval At the End of the Early Years.
IEEE Trans. Pattern Anal. Mach. Intell., Vol. 22(12), 1349-1380.
6 Conclusion SPEKREIJSE, H. 2000. Pre-attentive and Attentive Mechanism in
Vision. Perceptual Organization and Dysfunction. Vision Search,
Vol. 40, 1179-1638.
In this paper, we report our study on imitating the human visual
attention process for CBIR by combining the image segmenta- TSAI, C. F., MCGARRY, K., AND TAIT, J. 2003. Image Classifica-
tion and eye tracking techniques. JSEG algorithm is used to tion Using Hybrid Neural Network, In Proceedings of The ACM
parse the image into homogeneous sub-regions and eye tracking SIGIR Conference on Research and Development in Information
Retrieaval, 431-432.
data is utilized to locate the hROIs on the segmented image. In
the similarity measurement step, each hROI is weighed by the WANG, X. Y., HU. F. L. AND YANG, H. Y. 2006. A Novel Re-
fixation duration on each hROI as the importance value to em- gions-of-Interest Based Image Retrieval Using Multiple Features.
phasize the most important regions. Retrieval results on 7346 In Proceedings of The Multi-Media Modeling International Con-
Hemera color images show that our proposed approach compare ference, Vol. 1, 377-380.
44