Kismet is an expressive robot designed to interact naturally with humans. It has various sensors and motors to facilitate communication, including cameras, microphones, and abilities to move its eyes, head, facial features and vocalize. Its hardware and software are designed to process visual and auditory inputs in real-time with minimal latency. The robot uses its perception systems to understand its environment, attention system to focus on salient stimuli, and motivation and behavior systems to determine how to respond through its motor skills and expressions.
99.99% of Your Traces Are (Probably) Trash (SRECon NA 2024).pdf
Kismet, the robo t
1. KISMET, THE ROBOT
HARDWARE DESIGN
Kismet is an expressive robotic creature with
perceptual and motor modalities tailored to
natural human communication channels. To
facilitate a natural infant-caretaker interaction,
the robot is equipped with visual, auditory, and
proprioceptive sensory inputs. The motor
outputs include vocalizations, facial
expressions, and motor capabilities to adjust
the gaze direction of the eyes and the
orientation of the head. Note that these motor
systems serve to steer the visual and auditory
3. Our hardware and software control architectures
have been designed to meet the challenge of realtime processing of visual signals (approaching 30
Hz) and auditory signals (8 kHz sample rate and
frame windows of 10 ms) with minimal latencies
(less than 500 ms). The high-level perception
system, the motivation system, the behavior
system, the motor skill system, and the face motor
system execute on four Motorola 68332
microprocessors running L, a multi-threaded Lisp
developed in our lab. Vision processing, visual
attention and eye/neck control is performed by
nine networked 400 MHz PCs running QNX (a
real-time Unix operating system). Expressive
speech synthesis and vocal affective intent
recognition runs on a dual 450 MHz PC running
NT, and the speech recognition system runs on a
5. Vision System
The robot's vision system consists of four color CCD cameras
mounted on a stereo active vision head. Two wide field of view
(fov) cameras are mounted centrally and move with respect to
the head. These are 0.25 inch CCD lipstick cameras with 2.2
mm lenses manufactured by Elmo Corporation. They are used
to decide what the robot should pay attention to, and to
compute a distance estimate. There is also a camera mounted
within the pupil of each eye. These are 0.5 inch CCD foveal
cameras with an 8 mm focal length lenses, and are used for
higher resolution post-attentional processing, such as eye
detection. Kismet has three degrees of freedom to control
gaze direction and three degrees of freedom to control its
neck. The degrees of freedom are driven by Maxon DC servo
motors with high resolution optical encoders for accurate
position control. This gives the robot the ability to move and
orient its eyes like a human, engaging in a variety of human
visual behaviors. This is not only advantageous from a visual
7. Auditory System
The caregiver can influence the robot's behavior through
speech by wearing a small unobtrusive wireless
microphone. This auditory signal is fed into a 500 MHz PC
running Linux. The real-time, low-level speech processing
and recognition software was developed at MIT by the
Spoken Language Systems Group. These auditory features
are sent to a dual 450 mHz PC running NT. The NT machine
processes these features in real-time to recognize the
spoken affective intent of the caregiver.
8. EXPRESSIVE MOTOR SYSTEM
Kismet has a 15 DoF face that displays a wide assortment of facial
expressions to mirror its ``emotional'' state as well as to serve other
communicative purposes. Each ear has two degrees of freedom that
allows Kismet to perk its ears in an interested fashion, or fold them
back in a manner reminiscent of an angry animal. Each eyebrow can
lower and furrow in frustration, elevate upwards for surprise, or slant
the inner corner of the brow upwards for sadness. Each eyelid can open
and close independently, allowing the robot to wink an eye or blink
both. The robot has four lip actuators, one at each corner of the mouth,
that can be curled upwards for a smile or downwards for a frown. There
is also a single degree of freedom jaw.
10. VOCALIZATION SYSTEM
The robot's vocalization capabilities are generated through an articulatory
synthesizer. The underlying software (DECtalk v4.5) is based on the
Klatt synthesizer which models the physiological characteristics of the
human articulatory tract. By adjusting the parameters of the synthesizer
it is possible to convey speaker personality (Kismet sounds like a
young child) as well as adding emotional qualities to synthesized
speech (Cahn 1990).
11. THE FRAMEWORK
The system architecture consists of six subsystems: the low-level feature
extraction system, the high-level perception system, the attention
system, the motivation system, the behavior system, and the motor
system. The low-level feature extraction system extracts sensor-based
features from the world, and the high-level perceptual system
encapsulates these features into percepts that can influence behavior,
motivation, and motor processes. The robot has many behaviors in its
repertoire, and several motivations to satiate, so its goals vary over
time. The motor system carries out these goals by orchestrating the
output modalities (actuator or vocal) to achieve them. For Kismet, these
actions are realized as motor skills that accomplish the task physically,
or expressive motor acts that accomplish the task via social signals.
13. THE LOW-LEVEL FEATURE EXTRACTION SYSTEM
The low-level feature extraction system is responsible for processing the
raw sensory information into quantities that have behavioral
significance for the robot. The routines are designed to be cheap, fast,
and just adequate. Of particular interest are those perceptual cues that
infants seem to rely on. For instance, visual and auditory cues such as
detecting eyes and the recognition of vocal affect are important for
infants.
14. THE ATTENTION SYSTEM
The low-level visual percepts are sent to the attention system. The purpose
of the attention system is to pick out low-level perceptual stimuli that
are particularly salient or relevant at that time, and to direct the robot's
attention and gaze toward them. This provides the robot with a locus of
attention that it can use to organize its behavior. A perceptual stimulus
may be salient for several reasons. It may capture the robot's attention
because of its sudden appearance, or perhaps due to its sudden
change. It may stand out because of its inherent saliency such as a red
ball may stand out from the background. Or perhaps its quality has
special behavioral significance for the robot such as being a typical
indication of danger
15. THE PERCEPTUAL SYSTEM
The low-level features corresponding to the target stimuli of the attention
system are fed into the perceptual system. Here they are encapsulated
into behaviorally relevant percepts. To environmentally elicit processes
in these systems, each behavior and emotive response has an
associated releaser. As conceptualized by Tinbergen and Lorenz, a
releaser can be viewed as a collection of feature detectors that are
minimally necessary to identify a particular object or event of
behavioral significance. The function of the releasers is to ascertain if
all environmental (perceptual) conditions are right for the response to
become active.
16. THE MOTIVATION SYSTEM
The motivation system consists of the robot's basic ``drives'' and
``emotions''. The ``drives'' represent the basic ``needs'' of the robot and
are modeled as simple homeostatic regulation mechanisms. When the
needs of the robot are being adequately met, the intensity level of each
``drive'' is within a desired regime. However, as the intensity level
moves farther away from the homeostatic regime, the robot becomes
more strongly motivated to engage in behaviors that restore that
``drive''. Hence the ``drives'' largely establish the robot's own agenda,
and play a significant role in determining which behavior(s) the robot
activates at any one time. The ``emotions'' are modeled from a
functional perspective. Based on simple appraisals of the benefit or
detriment of a given stimulus, the robot evokes positive emotive
responses that serve to bring itself closer to it, or negative emotive
responses in order to withdraw from it. There is a distinct emotive
response for each class of eliciting conditions
17. THE BEHAVIOR SYSTEM
The behavior system organizes the robot's task-based behaviors into a
coherent structure. Each behavior is viewed as a self-interested, goaldirected entity that competes with other behaviors to establish the
current task. An arbitration mechanism is required to determine which
behavior(s) to activate and for how long, given that the robot has
several motivations that it must tend to and different behaviors that it
can use to achieve them. The main responsibility of the behavior
system is to carry out this arbitration. In particular, it addresses the
issues of relevancy, coherency, persistence, and opportunism. By doing
so, the robot is able to behave in a sensible manner in a complex and
dynamic environment
18. THE MOTOR SYSTEM
The motor system arbitrates the robot's motor skills and expressions. It
consists of four subsystems: the motor skills system, the facial
animation system, the expressive vocalization system, and the oculomotor system. Given that a particular goal and behavioral strategy have
been selected, the motor system determines how to move the robot so
as to carry out that course of action. Overall, the motor skills system
coordinates body posture, gaze direction, vocalizations, and facial
expressions to address issues of blending and sequencing the action
primitives from the specialized motor systems.
19. SOCIALIZING WITH PEOPLE
Kismet is designed to make use of human social protocol for various
purposes. One such purpose is to make life easier for its vision system.
If a person is visible, but is too distant for their face to be imaged at
adequate resolution, Kismet engages in a calling behavior to summon
the person closer. People who come too close to the robot also cause
difficulties for the cameras with narrow fields of view, since only a small
part of a face may be visible. In this circumstance, a withdrawal
response is invoked, where Kismet draws back physically from the
person. This behavior, by itself, aids the cameras somewhat by
increasing the distance between Kismet and the human. But the
behavior can have a secondary and greater effect through social
amplification -- for a human close to Kismet, a withdrawal response is a
strong social cue to back away, since it is analogous to the human
response to invasions of ``personal space.'
21. ENVELOPE DISPLAYS
Such regulatory mechanisms play roles in more complex social
interactions, such as conversational turn-taking. Here control of gaze
direction is important for regulating conversation rate. In general,
people are likely to glance aside when they begin their turn, and make
eye contact when they are prepared to relinquish their turn and await a
response. People tend to raise their brows when listening or waiting for
the other to speak. Blinks occur most frequently at the end of an
utterance. These envelope displays and other cues allow Kismet to
influence the flow of conversation to the advantage of its auditory
processing. The visual-motor system can also be driven by the
requirements of a nominally unrelated sensory modality, just as
behaviors that seem completely orthogonal to vision (such as earwiggling during the call behavior to attract a person's attention) are
nevertheless recruited for the purposes of regulation.
22. OTHER REGULATORY DISPLAYS
Some regulatory displays also help protect the robot. Objects that suddenly
appear close to the robot trigger a looming reflex, causing the robot to
quickly withdraw and appear startled. If the event is repeated, the
response quickly habituates and the robot simply appears annoyed,
since its best strategy for ending these repetitions is to clearly signal
that they are undesirable. Similarly, rapidly moving objects close to the
robot are threatening and trigger an escape response. These
mechanisms are all designed to elicit natural and intuitive responses
from humans, without any special training. But even without these
carefully crafted mechanisms, it is often clear to a human when
Kismet's perception is failing, and what corrective action would help,
because the robot's perception is reflected in behavior in a familiar way.
Inferences made based on our human preconceptions are actually likely
to work.