The smartphone presents a set of usability challenges that can be solved only with a combination of all input and output modalities available to the user. In this workshop, we review some basic principles for building highly usable, multimodal applications. The principles will be illustrated through concrete implementation examples.
3. Natural User Interface
Is based on natural elements
Not Natural: Type / Select from a
drop down / Click on a check box
– Using Mouse / Keyboard /
Stylus
Natural: Point / Touch / Drag /
Speech / Motion
- Using finger, voice, body
Invisible
• Focus is on the task at hand,
not on the mediating interface
4. Natural vs Familiar Interfaces
Yet Naturalness does not
mean ease of use for
everyone
Familiarity with UI can
render UI invisible
Naturalness is crucial in
new adoption
5. Natural User Interface: Smartphones
• Key is to enable users to
interact with device
effortlessly
• Everywhere Mobility
• All the time Mobility
• Hence need for
multimodality: different
ways to interact
depending on context
6. Smartphone: Strengths
• Mobility: I can take it with me and use it virtually
anywhere.
• Size: It fits in my pocket. I can have it with me anytime.
• Multi-purpose: phone, email, texting, photos, contacts,
calendar, etc.
• Identity: It's tied to me personally. It is not tied to a
location
• (as in landline) or to a family (desktop).
• Personalization: I load up my music, I take my photos,
link to my friends, etc.
• The iPhone is an extension of myself.
• Opt-in automation: When I fire up an application I
chose to fire it up: I chose to self-serve using the
application.
7. Smartphone: Weaknesses
• Interactional real estate: forces multi-step
• Informational real estate: get only small
amount of information at a time before needing
to touch the screen to get more (breaks
reading/concentration flow)
• Typing is difficult: Typing on a flat surface is a
challenge, especially is the surface gets dirty or
ages
• Power: Need to charge the device periodically.
8. VUI Weaknesses
Time linearity: unlike graphical interfaces, voice interfaces are linearly
coupled with time.
Uni-directionality: When you hear something, you can’t
easily go back and listen to it again. Contrast that to reading
a piece of text where you can go back and forth at will.
Invisibility: In a voice interface, no easy markers exist that
the user can check when they feel lost.
Imposed automation: When people into a toll free number,
they are usually not calling to use an automated system but
rather to talk to a person.
Listening/Speaking: not always the best mode of communication.
9. VUI Strengths
In the Cloud: all IVRs are in the cloud.
Easy to start: All they need to do is to call a phone
number.
Universally accessible: They can call the IVR from
any phone.
Easy to use: All they need to do is listen to
instructions and provide input when asked for it.
Uniform deployment: because the IVR is in the
cloud, users are always running the same version of
the application.
10. Available Modes in Smartphone
Input
• Touch/Swipe
• Shaking
• Biometrics
• Speech
• Typing
Output
• Images/Videos
• Text
• Audio
• Vibration
11. Smartphone Contexts = All Contexts
• Noisy environment:
can’t hear/can’t be
heard
• Quiet environment can’t
speak/can’t make noise
• Private information:
don’t want to share
information
• Hands busy:
assembling a chair,
can’t touch, can’t type
• Eyes busy: driving,
can’t read
13. UI Actions
Input
• Type full text
• Touch/Pick/Swipe
• Speak fully phrases/Sentences
• Speak partially (give short answers: yes/no)
Output
• Read full text
• Read short text (pick list)
• See (but not read – e.g., colors/shapes)
• Hear language
• Hear beeps/sounds
14. Keys to Effective Smartphone NUI
Key is to enable user to interact with device the
way the user choose to
1. User has at their disposal several modes of interaction
2. User is never forced to use any one mode at any time:
user chooses what mode to use
3. User can complete any task purely using a single mode
4. User can turn off any given mode at any time and can
switch it back on at any time
5. Flow progress is not penalized because user switched
modes – i.e., redo steps already done or starting over
15. Our Focus: Transactional Interactions
Multi-step Interactions aimed at solving a problem/
accomplishing something
User: What is Chipotle trading at?
App: Chipotle Mexican Grill is at $321.56. Up just a tad.
User: What’s the highest it has been in the last three months?
App: July 10 was highest in the last 3 months, trading at
$344.21.
User: Buy 100 shares.
App: You have Schwab and Fidelity. Which would you like?
User: Schwab.
App: Got it. I see you have an account ending in 2234. Use that
account?”
User: Yes.
App: OK. 100 shares at Market or at a Specific Price?
User: Market.
App: Got it. That trade has been placed for 100 shares at market.
I will send you an email confirmation when the shares are
purchased.
16. Why Spoken Conversation?
Speech is Natural
Conversation is Natural
Speech is efficient: speaking
requires less effort than typing
Use cases
• Dictation
• When searching is easier than
selecting
• Several interactions that require simple
responses
• Hands are busy
• Eyes are busy
• Short questions from device
• Short responses from user
• Sharing a spoken joke with friends
17. Example of Smartphone Conversations
- Book flight
- Order a Book
- Hotel reservations
- Order flowers
- Order Pizza
- Banking
- Movie tickets
- Restaurant reservations
18. Conversational NUI
- Transaction requires multiple pieces
of information
- Complex requests that can be
efficiently formulated in a sentence:
“What’s the highest it has been in the
last three months?”
- Short responses from user:
“Schwab,” “Yes,” “Market.”
- Short commands from user: “Buy 100
shares.”
19. Why would you want to use voice?
• We speak faster than we type
• We hear faster than we read read
• Sound is public: its value is the existence of
distance between the source and the destination
(but could use earphones)
Use cases
• Dictation
• When searching is easier than selecting
• Several interactions that require simple
responses
• Hands are busy
• Eyes are busy
• Short questions from device
• Short responses from user
• Sharing a spoken joke with friends
20. Why would you NOT want to use voice?
• Sound is public: its value
is the existence of
distance between the
source and the
destination (but could use
earphones)
Use cases
• Privacy: sharing
personal info., credit
card info.
• Noisy environment:
can’t hear or can’t be
heard
21. When Visual
• Privacy
• Accuracy
• Pictures
• Videos
• Long text
22. When Visual is not Optimal
• Input
• Screen small: typing,
picking
• Can’t write: small child
• Hands busy
• Output
• Screen small, bad
lighting
• Can’t read: small child
• Eyes busy
23. How Visual helps Audio
• Redundancy
• Visual Confirmation
• No match issues: present
menu to select option/or give
keyboard to type
• Help: visual help more
effective than spoken help
• Complementary info: Show
bill/show device
• When visual is needed:
location in bill
• Summary of info. collected
• Enable user to quickly correct
info provided earlier
24. The Elements of Conversation
Actions
• Start/Initiate
• Take turn/give turn
• Interrupting
• Pausing
• Resuming
• Repeating
• Starting over
• Ending/Terminating
Sates
• Speaking
• Listening
• Paused
• Processing/Thinking
Context
• Point in conversation
• Information
26. Conversation Signaling
Crucial part of communication is signaling states
and state transitions
• States
• Initial
• Paused
• Processing/Thinking
• Speaking
• Transition between states
27. State Signaling
State Visual Audio
Initial YES NO
Paused YES NO
Processing YES YES
Listening YES NO
Speaking YES YES
34. State Transition Signaling
User wishes to speak: user taps
User finished speaking
• User stops talking or
• User taps
Lexee is Listening: Lexee makes start listening
sound mark and changes state visual
Lexee is finished Listening: Lexee makes finished
listening sound mark and changes state visual
35. Initiating Conversation
• When user starts the
conversation, should the
application say something?
• Should it say nothing and only
show something?
36. Pausing
Explicit Pausing
• User Says “Pause”
• User swipes
Implicit Pausing
• User doesn't respond
• User says wrong thing several time in row
• User minimized app
When should user be allowed to pause
• Anytime?
• How about when app is processing a
transaction?
37. Resuming
When resuming
• Should it pick up where it left off in the prompt?
• Should it play the prompt again?
• Should it take the length of pause into question to
determine which one?
• What is pause lasted a few seconds: retrieving an
address
• What if pause lasted a few days: given up on
ordering
• If we want to pick up from where we left off:
• How long is the context to be remembered?
• Give the user a summary of what had happened so
far?
38. Interactional Investment
Look ahead
• Don't waste user's time (service down/account
suspended)
Provide GUI for Reviewing Collected Input
Provide GUI for Changing Collected Input
• Enable user to change via GUI values collected
39. Multi-Modality
Reinforcing
• Audio and Visual match
Clashing
• Audio Input and Visual Input don’t match
• Should the interface pick the first that came in?
• Should it privilege one over the other: e.g., assume
that audio is misrec and go with visual?
• Should it pick up both and signal ambiguity: ask user
to resolve?
Alternative
• Audio OR Visual
Complementary
• Take me here.