We have developed a low-latency combined eye and head tracker suitable for teleoperating a remote robotic head in real-time. Eye and head movements of a human (wizard) are tracked and replicated by the robot with a latency of 16.5 ms. The tracking is achieved by three fully synchronized cameras attached to a head mount. One forward-looking, wide-angle camera is used to determine the wizard’s head pose with respect to the LEDs on the video monitor; the other two cameras are for binocular eye tracking. The whole system operates at a sample rate of 220 Hz, which allows the capture and reproduction of biological movements as precisely as possible while keeping the overall latency low. In future studies, this setup will be used as an experimental platform for Wizard-of-Oz evaluations of gaze-based human-robot interaction. In particular, the question will be addressed as to what extent aspects of human eye
movements need to be implemented in a robot in order to guarantee a smooth interaction.
2. of – the robot. Besides a constant low latency, a high sampling 2.1 Eye Tracking
rate of around 200 Hz is required to correctly reproduce the human
movements in dynamic situations. A system that is able to measure The binocular eye tracker consists of two infrared cameras mounted
both head and eye movements simultaneously is preferred, since laterally on a pair of swimming goggles. The eyes are illuminated
it does not require a subsequent complex synchronization between by infrared light sources in the goggle frame. A translucent hot
the different components. To summarize, the requirements are low mirror in front of each eye reflects only the infrared light in the
latency, high sampling rate, and combined head and eye tracking. direction of the camera. Figure 1 shows a picture of the setup.
Numerous methods for combined measurements of eye and head A calibration laser is positioned between and slightly above the
movements can be found in the literature. [Huebner et al. 1992] user’s eyes. The laser projects a calibration pattern to a wall in
placed an additional search coil on the subject’s forehead to investi- front of the user by means of a diffraction grating [Pelz and Canosa
gate the relationship between the vestibulo ocular reflex and smooth 2001]. For sufficiently large projection distances (>5 m), the par-
pursuit. While the search coil offers superior spatial and temporal allax error introduced by the translation between eyeball center
resolution, it is an invasive method requiring contact lenses with and laser becomes irrelevant, thus allowing the laser pattern to be
wires. This limits the examination time of a subject to half an hour. used for calibration. Once calibrated, the eye tracker provides two-
[Allison et al. 1996] combined a video-based eye tracker with a dimensional gaze direction values for each eye, with the primary
magnetic head tracker for dynamic testing of the vestibular system. position being parallel to the central laser beam.
While this is an interesting approach, we prefer to avoid integrating
different systems for the above reasons. Commercially available re- The additional, forward-looking scene view camera used for head
mote eye trackers like the Tobii T120, for example, are not suitable, tracking is placed on the center of the forehead (see Fig. 1). All
because they have a latency of up to 33 ms and they cannot mea- cameras run at a resolution of 188×120 pixels and are synchronized
sure the head pose. An increasingly popular method for measuring at 220 Hz, i.e., they take pictures at the same time and do not drift.
combined eye and head movements is the use of computer vision
[Smith et al. 2000; La Cascia et al. 2000]. The face is observed by 2.2 Head Tracking
a camera, and a head model is matched against the image to deter-
mine the head pose. Then the eyes in the image are detected, and Head tracking is accomplished by analyzing the images of infrared
the gaze direction is calculated with respect to the head direction. LEDs that are located at the corners of a computer monitor [Cor-
While this is the least invasive method – no physical contact with nelissen et al. 2002]. The head tracking algorithm involves two
the subject is required –, it lacks the necessary spatial and, most steps. First, the marker LEDs are detected by image processing and
notably, temporal resolution. Head-mounted eye trackers have the assigned to their corresponding projections in the image. Then the
advantage of providing sufficient spatial resolution, but they do not position and orientation of the LED plane are computed with re-
track head pose per se. Consequently, head tracking must be added spect to the scene camera, which, in turn, also provides the head
to such a system for a suitable solution. [Cornelissen et al. 2002] position and orientation with respect to the LED plane.
introduced such a system in which head position is inferred from
four LEDs on the edges of a computer screen. A head-mounted 2.2.1 LED Detection and Assignment
scene camera tracks the LEDs.
To facilitate image processing, infrared LEDs are used and the
Our system uses a head-mounted eye tracker reported on earlier scene camera is equipped with an infrared filter. The shutter and
[Schneider et al. 2009a] to which an infrared scene camera was gain values can be adjusted so that only the LEDs appear in the
added that tracks LEDs affixed to the corners of a computer screen image as bright, white spots. This allows robust and fast LED de-
(Fig. 1, the presence of a fifth LED is explained later). The eye tection. The position of the LED projections are determined with
tracker and the scene camera can be operated at up to 600 Hz. The subpixel accuracy by using a center of mass algorithm.
system was previously designed to record each frame exactly at the
same point in time – an attribute that was easily extended to the third A plane in 3D space is defined by three points. If only their two-
camera. In contrast to Cornelissen’s setup, our system additionally dimensional projection onto a camera image plane is known, a
tracks both eyes simultaneously in order to detect the distance to fourth point and the knowledge of the correspondences between
the point of gaze by using vergence eye movements. Normally, the the projections and their original points are needed to reconstruct
point of gaze should be on the computer screen. Our binocular eye the original pose in 3D space.
tracker can discriminate between fixations on the screen plane and
measurement artifacts that are indicated by invalid depth informa- The ambiguity that arises with four LEDs can be resolved in two
tion. Because of the low latency of the original eye tracker and its different ways. First, the head rolling angle can be constrained to
high sampling rate, the system meets all requirements and is also remain between ]−45◦ ; 45◦ [. Then the relations between the points
fully customizable and synchronized to measure dynamic eye and are given by the quadrants in which they lie and with respect to the
head movements. center of mass of all points. While this might be acceptable when
the subject sits in front of a computer monitor, it is not a suitable
method if the plane of interest lies flat on the surface of a table
1.2 Paper Outline and the subject is allowed to move around freely. With such an
application in mind, a second solution was implemented, in which
In the following, we present the algorithms used for eye and head an additional LED point is placed between the upper left and right
tracking, and provide information on the model used for calibrating corners (see Fig. 1).
the complete system. We also describe how the eye tracker fits into
the Wizard-of-Oz setup. Finally, we evaluate the system resolution, 2.2.2 Head Pose Estimation
latency, and accuracy.
To maximize wizard mobility, the scene camera is equipped
2 Methods with a wide angle lens (2.1 mm focal length, 1/3”), which
results in a horizontal view angle of 145◦ . As this
This section describes the methods used for eye and head tracking requires compensation for spherical and tangential distor-
and gives an overview of the robot setup. tion, we used the Camera Calibration Toolbox for Matlab
118
3. (http://www.vision.caltech.edu/bouguetj/calib doc/) to determine 4.2 Head Tracker Resolution
the intrinsic camera parameters.
The resolution of the head tracker was determined by mounting the
Pose estimation of the computer monitor is based on an algorithm goggles 32 cm in front of the monitor and measuring head position
described by [Shapiro and Haralick 1993]. Three lines in a plane and orientation noise for 1.5 s (see Fig. 2). Table 1 shows the root
with one perpendicular to the other two can be reconstructed by mean square (RMS) of each component.
their projections. In our case the upper, lower, and right borders
of the screen are used. This allows the scene camera position and 0.2
orientation to be calculated with respect to the monitor. As the
x [mm]
0
eye tracker uses the coordinate system defined by the calibration
pattern, the rotation and translation between scene camera and cal- 0.2
ibration laser must be taken into account. The translation can be 0.2
y [mm]
easily measured. To determine the rotation, the calibration laser is 0
turned on and pointed toward the computer monitor, which displays 0.2
the estimated calibration points. Now the rotation parameters can 0.2
be adjusted until both sets of points match. These parameters are
z [mm]
0
systematic and have to be measured only once. Now the point at
which the laser calibration pattern intersects with the monitor plane 0.2
can be predicted. 0.02
Hor [°]
0
2.3 Combined Tracking 0.02
0.02
Ver [°]
To determine the intersection between the user’s line of sight and 0
the monitor, the centers of the eyeball must be identified. They are 0.02
found in a separate calibration step, in which the user has to fixate 0 0.5 1 1.5
two known points on the monitor without moving his head. Then Time [s]
the centers of the eyeballs can be calculated from the intersection of
the line of sight from each point. From now on the system is fully Figure 2: The noise of the head tracking algorithm, with the gog-
calibrated and the intersection between each eye’s line of sight and gles fixated at a distance of 32 cm from the computer monitor. Hor-
the computer monitor can be determined. izontal (x) and vertical (y) head positions as well as distance from
the monitor (z) are shown in [mm]. The horizontal and vertical
3 Robot Setup head angles are given in [◦ ].
The remote controlled robot is based on a commercially available Hor. Pos. Vert. Pos. Distance Angle Hor. Angle Ver.
platform (MetraLabs, Ilmenau, Germany). It is equipped with a 0.0395 0.0319 0.0157 0.0051◦ 0.0045◦
movable neck and pivotable eyes. The original eyes could not mm mm mm
match human eye performance, so they were replaced with our own RMS RMS RMS RMS RMS
camera orientation devices [Schneider et al. 2009b]. These active
robotic vision cameras (see Fig 1) surpass human eye movements Table 1: Resolution of the head tracker measured at a distance of
in terms of angular velocity and acceleration as well as bandwidth 32 cm from the monitor.
by a factor of five. Thus, the robotic system is able to exactly re-
produce the eye movement dynamics of the wizard.
4.3 System Accuracy
Additionally, an extra wide angle scene view camera was mounted
on the robot’s neck, by means of which the wizard oversees the A healthy subject was instructed to fixate 20 points on a computer
experimental scene. The camera is mounted near the pivot point monitor at a distance of 35 cm (see Fig. 3: o,+,x). The stimulus
of the head and eyes. After calibration, the wizard’s head and eye points (o) were arranged in a grid with an intra-point distance of
movements can be directly mapped to head and eye movements of 6.8 cm. The intersections between the monitor plane and the left
the robot. Since the robot’s scene camera is fixed with respect to its eye’s line of sight were plotted; the subject looked once straight at
body, the scene view remains stable and thus does not generate any the monitor (+) and then again with the head turned 12.5◦ to the
visual feedback effects. right (x). Horizontal eye movements were in the range [−25◦ ; 15◦ ]
when looking straight at the monitor and in the range [−10◦ ; 25◦ ]
when the head was turned to the right. Vertical eye movements
4 Results stayed between a range of [−10◦ ; 20◦ ]. The accuracy was lower
near to the bottom edge, because the eyelashes distorted the pupil
This section explains the rationale behind the chosen geometrical image. For our application, the accuracy achieved was more than
setup and gives details on the resolution of the head tracker as well sufficient.
as the accuracy and latency of the whole system.
4.4 System Latency
4.1 Geometrical Setup
In order to determine the system latency a gyroscope was attached
Since the linear region of the robot’s scene camera (ca. 90◦ ) had to an artificial eye that was mounted on a servo motor. The eye
to cover the human wizard’s field of view, the wizard was placed tracking computer also output the pupil position immediately after
35 cm in front of a 20” monitor. At this distance, the wide angle calculation. A second gyroscope was mounted on one of the robot’s
head tracking camera still allowed for head movements of 40◦ in eyes. Colored noise with a cutoff frequency of 10 Hz and a dura-
yaw and 20◦ in pitch. tion of 10 s was used to drive the artificial eye. Fig. 4 exemplarily
119
4. Monitor x [cm] The above-described research platform can now be used to con-
0 5 10 15 20 25 30 35 40 duct experiments that will help to define the critical aspects of gaze-
based human-robot interaction. With its ability to detect fixations
5 in 3D space relative to any given rectangular surface by intersecting
the lines of sight of both eyes, the combined eye and head tracker
10 will in a future step be used together with a 3D monitor. Then,
the robot will be equipped with a stereo scene camera, the output of
Monitor y [cm]
15 which will be presented to the wizard on a stereo monitor. Thus, the
robot will also be able to correctly reproduce the wizard’s vergence
20 eye movements.
25 Acknowledgements
30 The authors would like to thank our project partner, Frank Wall-
hoff, MMK, TU-Muenchen for providing the robot ELIAS. We also
thank Judy Benson for critically reading the manuscript. This work
Figure 3: Subject fixating validation grid (o, 6.8 cm) at a distance is supported by Bundesministerium fuer Bildung und Forschung
of 35 cm, when looking both straight at the monitor (+) and with (IFB, LMU) and in part within the DFG excellence initiative re-
the head turned 12.5◦ to the right (x). search cluster “Cognition for Technical Systems – CoTeSys”, see
also www.cotesys.org.
shows the resulting velocity profiles. The eye tracker detects the References
movement of the artificial eye after 5 ms (which equals one frame
period of 4.5 ms plus 0.5 ms for transmission and calculation). The A LLISON , R., E IZENMAN , M., AND C HEUNG , B. 1996. Com-
overall latency between movement of the artificial eye and move- bined head and eye tracking system for dynamic testing of the
ment of the robot’s eye is 16.5 ms. vestibular system. IEEE Transactions on Biomedical Engineer-
ing 43, 11, 1073–1082.
3
2 C ORNELISSEN , F., P ETERS , E., AND PALMER , J. 2002. The eye-
Relative Velocity
1 link toolbox: eye tracking with matlab and the psychophysics
0 toolbox. Behavior Research Methods Instruments and Comput-
1 ers 34, 4, 613–617.
2 Art. Eye Eye Tracking Robot Eye
3
H UEBNER , W., L EIGH , R., S EIDMAN , S., T HOMAS , C., B IL -
0 50 100 150 200 250 300 350 400 450 500 LIAN , C., D I S CENNA , A., AND D ELL’O SSO , L. 1992. Ex-
Time [ms]
perimental tests of a superposition hypothesis to explain the re-
1 lationship between the vestibuloocular reflex and smooth pursuit
Correlation Coefficient
Eye Tracking
0.5 Robot Eye during horizontal combined eye-head tracking in humans. Jour-
nal of neurophysiology 68, 5, 1775–1792.
0
0.5
L A C ASCIA , M., S CLAROFF , S., AND ATHITSOS , V. 2000. Fast,
reliable head tracking under varying illumination: an approach
1 based on registration of texture-mapped 3 d models. IEEE Trans-
100 80 60 40 20 0 20 40 60 80 100
Lag [ms] actions on Pattern Analysis and Machine Intelligence 22, 4, 322–
336.
Figure 4: Latency between movement of an artificial eye (Art. Eye),
P ELZ , J., AND C ANOSA , R. 2001. Oculomotor behavior and per-
calculation of the pupil position (Eye Tracking) and movement of
ceptual strategies in complex tasks. Vision Research 41, 25-26,
the robot’s eyes (Robot Eye). The artificial eye was driven by col-
3587–3596.
ored noise with a cutoff frequency of 10 Hz. The correlation func-
tions reveal a delay of 5 ms until the pupil is detected, and an over- S CHNEIDER , E., V ILLGRATTNER , T., VOCKEROTH , J., BARTL ,
all delay of 16.5 ms between movement of the artificial eye and K., KOHLBECHER , S., BARDINS , S., U LBRICH , H., AND
movement of the robot’s eyes. B RANDT, T. 2009. EyeSeeCam: An eye movement-driven head
camera for the examination of natural visual exploration. Annals
of the New York Academy of Sciences 1164, 1 Basic and Clinical
5 Conclusion and Future Work Aspects of Vertigo and Dizziness, 461–467.
This novel system allows real-time teleoperation of a robot’s head S CHNEIDER , E., KOHLBECHER , S., BARTL , K., AND WALL -
HOFF , F. 2009. Experimental platform for wizard-of-oz eval-
and eyes by synchronizing the combined tracking of a human wiz-
ard’s head and eyes. The accuracy and workspace are well suited uations of biomimetic active vision in robots. In 2009 IEEE In-
for the given application. The overall latency between the wizard’s ternational Conference on Robotics and Biomimetics (ROBIO).
eye movement and the eye movement of the robot is roughly 5 ms S HAPIRO , L., AND H ARALICK , R. 1993. Computer and Robot
for image acquisition and processing, plus about 11.5 ms for trans- Vision, vol. 2. Addison-Wesley, ch. 13, 73f.
mitting the commands to the robot and moving its eyes, i.e. 16.5 ms
total. This is already on the order of the fastest human (vestibulo S MITH , P., S HAH , M., AND DA V ITORIA L OBO , N. 2000. Mon-
ocular) reflex, which has a delay of about 10 ms. In one of the itoring head/eye motion for driver alertness with one camera. In
next steps, the motor latency will be further decreased by taking Pattern Recognition, 2000. Proceedings. 15th International Con-
non-linearities at low speeds into account. ference on, vol. 4.
120