7. “Iterative design, with its repeating cycle of design
and testing, is the only validated methodology in
existence that will consistently produce successful
results.
If you don’t have user-testing as an integral part
of your design process you are going to throw
buckets of money down the drain.”
Bruce Tognazzini
9. What is Evaluation?
• Concerned with gathering data about the
quality of a design/product (UI)
• User performance, Usability
• User experience, User acceptance, …
• Why Evaluate?
• To validate/refine the prototype/solution
• To learn more about user and the problem
• To move forward to the next iteration
10. When to evaluate?
• Once the product has been developed
• pros : rapid development, small evaluation cost
• cons : rectifying problems
• During design and development
• pros : find and rectify problems early
• cons : higher evaluation cost, longer development
design implementationevaluation
redesign &
reimplementation
design implementation
11. Types of Evaluation
• Formative testing
• early stage of development (Low-fidelity prototypes)
• focus on user perception of experience
• comparing multiple design option
• Summative testing
• later development (High-fidelity interactive prototype)
• evaluate the effectiveness of specific design choices
• focus on performance and usability
14. 1. Quick and Dirty Testing (Formative)
• Informal feedback from users to confirm that their
ideas are in-line with users’ needs and are liked.
• Quick and dirty evaluations are done any time.
• Emphasis is on fast input to the design process
rather than carefully documented findings.
16. 2. User Testing (Formative/Summative)
• Investigations of the users and their use of
human-computer interface
• Observe and Describe
• Explain
• Predict
• Determination of the Causes
• Main research method in HCI field
• Closely related to ...
• Behavioral Science in Psychology
• Ergonomics & Human factors in Industrial Engineering
17. User Testing Methods
• Interview
• Focus group
• Survey
• Questionnaire
• Usability Testing
• User Experiment
Informal, qualitative feedback
Formative testing
Formal, quantitative feedback
Summative Testing
Mix & match to your needs.
Triangulation.
Experimental
investigation
Descriptive
/
Relational
investigation
18. Interview and Focus Group
• Ask the user
• Direct conversations as tools
for feedback collection
• Understand requirements,
needs, problems
• Interviews – one at a time
• Focus groups – many
19. Survey and Questionnaire
• Survey
• Using questionnaire to which a user is asked to respond
• Questionnaire
• a well-defined and well-written set of questions
• Typically self-administered
• Surveys are good at:
• getting a large number of responses quickly from a
geographically dispersed population
• collecting sensitive/private information
• You can capture the “big picture” relatively quickly
20. Example: Net Promoter Score
• Standardized measure across large number of industries
22. Experimental Study (Summative)
• Experiments discover/verify new knowledge by investigating the
causal effect between two or more things (i.e. variables).
• Independent Variable (IV)
• Manipulated to create different experimental conditions
• e.g. type of tool used, design alternatives
• Dependent Variable (DV)
• Measured to find out the effects of changing the independent variables
• e.g. user performance, satisfaction, usability
23. Experimental Design
Randomly
assigned
Statistical data analysis
Experimental
task
Condition
2
Condition
3
Condition
1
Subjects
data data data
Between-Subject
Randomly
assigned
Statistical data analysis
Subjects
data data data
Within-Subject
Experimental
tasks
Condition
2
Condition
3
Condition
1
Experimental
tasks
Condition
1
Condition
3
Condition
2
Experimental
tasks
Condition
1
Condition
2
Condition
3
24. Between vs. Within-Subject Design
Between-subject Within-subject
- Avoids interference effects
(e.g. practice / learning effect)
- Shorter time for each participant
(less fatigue and frustration)
- Learning effect
- Longer time for each participant
(Larger impact of fatigue and frustration)
- Impact of individuals difference
- Harder to detect difference between - conditions
- Require larger sample size
- Individual difference can be isolated
- Easier to detect difference between conditions
- Requires smaller sample size
- gender, age, experience must be between-
subject factors
- change over time must be within-subject factor
Important:
Randomised assignment to conditions
Important:
Counterbalance/randomise the order of
presenting conditions
25. Objective vs. Subjective Measures
• Objective measures
• Not influenced by personal feeling/opinion
• Based on observation, compared against standardized scale.
• More consistent
• Subjective measures
• Based on user's opinions, interpretations, points of view,
emotions and judgment.
• More vulnerable to context and users’ status
26. Data Types
• Subjective (Qualitative)
• Subjective survey
• Likert Scale, condition rankings
• Observations
• Think Aloud
• Interview responses
• Objective (Quantitative)
• Performance measures
• Time, accuracy, errors
• Process measures
• Video/audio analysis
How easy was the task
1 2 3 4 5
Not very easy Very easy
27. Standard Questionnaires in HCI
• Existing validated questionnaires in the HCI literature:
• System Usability Scale (SUS)
• Computer System Usability Questionnaire (CSUQ)
• Interface Consistency Testing Questionnaire (ICTQ)
• Questionnaire for User Interaction Satisfaction (QUIS)
• User Experience Questionnaire (UEQ)
• NASA Task Load Index (TLX)
• See book and web for more questionnaires
• http://oldwww.acm.org/perlman/question.html
• http://www.usabilitynet.org/tools/r_questionnaire.htm
• http://www.measuringu.com/blog/ux-questions.php
• Surveys on: Game Experience, Presence, Engagement, User Experience
28. 3. Field Studies (Formative/Summative)
• Field studies are done in natural settings
• The aim is to understand what users do naturally
and how technology impacts them.
• In product design field studies can be used to:
- identify opportunities for new technology
- determine design requirements
- decide how to introduce new technology
- evaluate technology in use.
29. Example Field Study
• AR map application
• Digital Map (D) vs. AR Map (M)
• Experimenter followed pairs of people
• Made observations, interviewed afterwards
• Found interesting behaviours – e.g. Map as shared artifact in (M), separate in (D)
Morrison, A., Oulasvirta, A., Peltonen, P., Lemmela, S., Jacucci, G., Reitmayr, G., ... & Juustila, A. (2009, April). Like bees around
the hive: a comparative study of a mobile augmented reality map. In Proceedings of the SIGCHI conference on human factors in
computing systems (pp. 1889-1898).
Digital Map AR Enhanced Map
30. 4. Predictive Evaluation (Formative)
• Experts apply their knowledge of typical users, often
guided by heuristics, to predict usability problems.
• Many heuristics available
• Nielsen’s 10 principles, Tognazzini’s 16 principles, etc..
• A key feature of predictive evaluation is that users
need not be present
• Relatively quick and inexpensive
31. How many Experts do you need?
• Nielsen
• 5 experts will find approximately
80% of problems
• However
• depends on how complex an interface is
• how many interface flaws exist
è Get as many as you can,
in the timeframe/resource given
32. • Using AR/VR to share communication cues
• Gaze, gesture, head pose, body position
• Collaboration between AR/VR
• VR user appears in AR user’s space
• Sharing same environment
• Virtual copy of real world
• What is the effect of gaze cues?
Piumsomboon, T., Dey, A., Ens, B., Lee, G., & Billinghurst, M. (2019). The effects of sharing awareness cues
in collaborative mixed reality. Frontiers in Robotics and AI, 6, 5.
Example: Sharing Communication Cues (2019)
34. Conditions
• Baseline: In the Baseline condition, we showed only the head and hands of the
collaborator in the scene. The head and hands were presented in all conditions
• Field-of-view (FoV): We showed the FoV frustum of each collaborator to the
other. This enabled collaborators to understand roughly where their partner was
looking and how much area the other person could see at any point in time.
• Head-gaze (FoV + Head-gaze ray): FoV frustum plus a ray originating from the
user's head to identify the center of the FoV, which provided a more precise
indication where the other collaborator was looking
• Eye-gaze (FoV + Eye-gaze ray): In this cue, we showed a ray originating from
the user's eye to show exactly where the user was looking at.
35. Task
• Search task
• Find specific blocks together
• Two phases:
• Object identification
• Object placement
• Designed to force collaboration
• Each person seeing different information
• Within-subject Design
• Everyone experiences all conditions
36. Measures
• Performance (Objective)
• Rate of Mutual Gaze
• Task completion time
• Observed (Objective)
• Number of hand gestures
• Physical movement
• Distance between collaborator
• Subjective
• Usability Survey (SUS)
• Social Presence Survey
• Interview
37. Data Collected
• Participants
• 16 pairs = 32 people
• 9 women
• Aged 20 – 55, average 31 years
• Experience
• No experience with VR (6), no experience AR (10), no HMD (7).
• Data collection
• Objective
• 4 (conditions) × 8 (trials per condition) × 16 pairs = 512 data points
• Subjective
• 4 (conditions) × 32 (participants) = 128 data points.
40. Results
• Predictions
• Eye/Head pointing better than no cues
• Eye/head pointing could reduce need for pointing
• Results
• No difference in task completion time
• Head-gaze/eye-gaze great mutual gaze rate
• Using head-gaze greater ease of use than baseline
• All cues provide higher co-presence than baseline
• Pointing gestures reduced in cue conditions
• But
• No difference between head-gaze and eye-gaze
41. Lessons Learned
• Decide on type of experiment
• Within subject vs. between subject
• Have well designed task with measurable outcomes
• Use both qualitative and quantitative measures
• Performance + user preference
• Have enough subjects for significant results
• Use the appropriate statistics
• Compare conditions + perform post hoc analysis
• Provide subject training on task
• Observe user behavior and interview subjects
43. Moving Beyond Questionnaires
• Consider the entire user
• Cultural, social factors
• Move data capture from post experiment to during experiment
• Move from performance measures to process measures
• Richer types of data captured
• Physiological Cues - EEG, GSR, EMG, Heart rate, etc.
• Richer Behavioural Cues - Body motion, user positioning, etc.
• Higher level understanding
• Map data to Emotion recognition, Cognitive load, etc.
• Use better analysis tools
• Video analysis, conversation analysis, multi-modal analysis, etc.
45. Social Acceptance
• People don’t want to look silly
• Only 12% of 4,600 adults would be willing to wear AR glasses
• 20% of mobile AR browser users experience social issues
• Acceptance more due to Social than Technical issues
• Needs further study (ethnographic, field tests, longitudinal)
49. Physical Ergonomics
• Evaluate the human motion range
• Consider human comfort and natural posture
• Example: Ergonomics for hand input
• Coarse and fine scale motions, gripping and grasping
• Avoid “Gorilla arm syndrome” from holding arm pose
50. Gorilla Arm in AR
• Design interface to reduce mid-air gestures
51. XRgonomics
• Uses physiological model to calculate ergonomic interaction cost
• Difficulty of reaching points around the user
• Customizable for different users
• Programmable API, Hololens demonstrator
• GitHub Repository
• https://github.com/joaobelo92/xrgonomics
Evangelista Belo, J. M., Feit, A. M., Feuchtner, T., & Grønbæk, K. (2021, May). XRgonomics: Facilitating the Creation of
Ergonomic 3D Interfaces. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (pp. 1-11).
53. New Tools
• New types of sensors
• EEG, ECG, GSR, etc
• Sensors integrated into AR/VR systems
• Integrated into HMDs
• Data processing and capture tools
• iMotions, etc
• AR/VR Analytics tools
• Cognitive3D, etc
54. HP Reverb G2 Omnicept
• Wide FOV, high resolution, best in class VR display
• Eye tracking, heart rate, pupillometry, and face camera
• SDK for measuring cognitive load
60. Example: Measuring Trust
• How to reliably measure trust?
• Using physiological sensors (EEG, GSR, HRV)
• Subjective measures (STS, SMEQ, NASA-TLX)
• Relationship between cognitive load (CL) and trust?
• Novelty:
• Use EEG, GSR, HRV to evaluate trust at different CL
• Implemented custom VR environment with virtual agent
• Compare physiological, behavioral, subjective measures
Gupta, K., Hajika, R., Pai, Y. S., Duenser, A., Lochner, M., & Billinghurst, M. (2020, March).
Measuring human trust in a virtual assistant using physiological sensing in virtual reality.
In 2020 IEEE Conference on Virtual Reality and 3D User Interfaces (VR) (pp. 756-765). IEEE.
62. Experiment Design
• Two factors
• Cognitive Load (Low, High)
• Low = N-Back with N = 1
• High = N-Back with N = 2
• Agent Accuracy (No, Low, High)
• No = No agent
• Low = 50% accurate
• High = 100% accurate
• Within Subject Design
• 24 subjects (12 Male), 23-35 years old
• All experienced with virtual assistant
2 x 3 Expt Design
63. Results
• Physiological Measures
• EEG sign. diff. in alpha band power level with CL
• GSR/HRV – sign. diff. in FFT mean/peak frequency
• Performance
• Better with more accurate agent, no effect of CL
• Subjective Measures
• Sign. diff. in STS scores with accuracy, and CL
• SMEQ had a significant effect of CL
• NASA-TLX significant effect of CL and accuracy
• Overall
• Trust for virtual agents can be measured using combo
of physiological, performance, and subjective measures
”I don’t trust you anymore!!”
65. Meta-Review
Review of 10 years of AR user studies
Dey, A., Billinghurst, M., Lindeman, R. W.,
& Swan, J. (2018). A systematic review of
10 years of augmented reality usability
studies: 2005 to 2014. Frontiers in
Robotics and AI, 5, 37.
70. Summary
• Few AR papers have a formal experiment (~10%)
• Most papers use within-subjects design (73%)
• Most experiments in controlled environments (76%)
• Lack of experimentation in real world conditions, heuristic, pilot studies
• Half of paper collect Qualitative and Quantitative measures (48%)
• Performance measures (76%), surveys (50%)
• Most papers focus on visual senses (96%)
• Young participants dominate (University students) (62%)
• Females in minority (36%)
• Most use HMD (35%) or handheld displays (34%)
• Handheld/mobile AR studies becoming more common
• Most studies are in interaction (23%), very few collaborative studies (4%)
71. Research Opportunities
• Need for increased user studies in collaboration
• More use of field studies, natural user
• Need a wider range of evaluation methods
• Use a more diverse selection of participants
• Increase number of participants
• More user studies conducted outdoors are needed
• Report participant demographics, study design, or experimental task
73. Conclusions
•Evaluate throughout the design process
•Move from formative to summative evaluation
•Use multiple types of evaluation
•Collect multiple types of data
•Go beyond questionnaires
•Many directions for future research
74. To study more …
Jonathan Lazar, Jinjuan Heidi Feng Harry Hochheiser
Research Methods In Human-Computer Interaction