This document discusses evaluating user engagement in information retrieval systems from small-scale to large-scale. It begins by discussing traditional evaluation methods in information retrieval which focus on retrieval effectiveness and relevance through metrics like precision, recall and click-through rates. It then introduces the concept of user engagement, which looks beyond relevance to consider emotional, cognitive and behavioral connections between users and systems. Key aspects of user engagement discussed include novelty, aesthetics and motivation. Methods for measuring engagement range from self-reports and physiological sensors for small-scale studies to analytics of user behavior like dwell times, abandonment rates and return visits for large-scale evaluation. The talk explores moving evaluation from intra-session metrics like dwell times to inter-session metrics like absence
A Journey into Evaluation: from Retrieval Effectiveness to User Engagement
1. A
Journey
into
Evalua0on:
from
Retrieval
Effec0veness
to
User
Engagement
Mounia Lalmas
Yahoo Labs London
mounia@acm.org
SPIRE 2015 – King’s College London
2. This talk
§ Introduction to user engagement
§ Evaluation in information retrieval
(retrieval effectiveness)
§ From retrieval effectiveness to user engagement
(from intra-session to inter-session evaluation)
(from small- to large-scale evaluation)
5. What is user engagement?
“User engagement is a quality of the user
experience that emphasizes the phenomena
associated with wanting to use a technological
resource longer and frequently” (Attfield et al, 2011)
self-report: happy, sad,
enjoyment, …
emotional, cognitive and behavioural connection
that exists, at any point in time and over time, between
a user and a technological resource
analytics: click, upload,
read, comment, share …
physiology: gaze, body heat,
mouse movement, …
6. Why is it important to engage users?
§ In today’s wired world, users have enhanced expectations
about their interactions with technology
… resulting in increased competition amongst the
purveyors and designers of interactive systems.
§ In addition to utilitarian factors, such as usability, we must
consider the hedonic and experiential factors of interacting
with technology, such as fun, fulfillment, play, and user
engagement.
(O’Brien, Lalmas & Yom-Tov, 2014)
7. Online sites differ with respect to
their engagement pattern
Games
Users spend
much time per
visit
Search
Users come
frequently and
do not stay long
Social media
Users come
frequently and
stay long
Niche
Users come on
average once
a week e.g. weekly
post
News
Users come
periodically,
e.g. morning and
evening
Service
Users visit site,
when needed,
e.g. to renew
subscription
(Lehmann etal, 2012)
8. Characteristics of user engagement
Novelty
(Webster & Ho, 1997; O’Brien,
2008)
Richness and control
(Jacques et al, 1995; Webster &
Ho, 1997)
Aesthetics
(Jacques et al, 1995; O’Brien,
2008)
Endurability
(Read, MacFarlane, & Casey,
2002; O’Brien, 2008)
Focused attention
(Webster & Ho, 1997; O’Brien,
2008)
Reputation, trust and
expectation (Attfield et al,
2011)
Positive Affect
(O’Brien & Toms, 2008)
Motivation, interests,
incentives, and benefits
(Jacques et al., 1995; O’Brien & Toms,
2008)
(O’Brien, Lalmas & Yom-Tov, 2014)
9. Measuring user engagement
Measures
Attributes
Self-report Questionnaire, interview,
think-aloud and think after
protocols
Subjective
Short- and long-term
Lab and field
Small scale
Physiology EEG, SCL, fMRI
eye tracking
mouse-tracking
Objective
Short-term
Lab and field
Small and large scale
Analytics within- and across-session
metrics
data science
Objective
Short- and long-term
Field
Large scale
10. Attributes of user engagement
§ Scale (small versus large)
§ Setting (laboratory versus field)
§ Objective versus subjective
§ Temporality (short- versus long-term)
We focus on
1. Temporality: from intra- to inter-session
2. Scalability: from small- to large-scale
12. How to evaluate a search engine
§ Coverage
§ Speed
§ Query
language
§ User
interface
§ User
happiness
› Users
find
what
they
want
and
return
to
the
search
engine
› Users
complete
the
search
task,
where
search
is
a
means,
not
an
end
Sec. 8.6
(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)
13. Within an online
session
› July 2012
› 2.5M users
› 785M page views
› Categorization of the most
frequent accessed sites
• 11 categories (e.g. news), 33
subcategories (e.g. news finance,
news society)
• 760 sites from 70 countries/regions
short sessions: average 3.01 distinct sites visited with revisitation rate 10%
long sessions: average 9.62 distinct sites visited with revisitation rate 22%
(Lehmann etal, 2013)
14. Measuring user happiness
Most
common
proxy:
relevance
of
search
results
Sec. 8.1
Relevant
Retrieved
all items
§ User
informa)on
need
translated
into
a
query
§ Relevance
assessed
rela0ve
to
informa)on
need
not
the
query
§ Example:
› Informa0on
need:
I
am
looking
for
tennis
holiday
in
a
country
with
no
rain
› Query:
tennis
academy
good
weather
Evaluation measures:
• precision, recall, R-precision; precision@n;
mean average precision; F-measure; …
• bpref; cumulative gains, …
precision
recall
15. Measuring user happiness
Most
common
proxy:
relevance
of
search
result
Sec. 8.1
Explicit signals
Test collection methodology (TREC, CLEF, …)
Human labeled corpora
Implicit signals
User behavior in online settings (clicks, skips, …)
16. Examples of implicit signals in web
search
§ Number of clicks
§ Click at given position
§ Time to first click
§ Skipping
§ Abandonment rate
§ Number of query reformulations
§ Dwell time
17. What is a happy user in web search
1. The user information need is satisfied
2. The user has learned about a topic and even
about other topics
3. The system was inviting and even fun to use
In-the-moment engagement
Users active on a site or stayed long
Long-term engagement
Users come back frequently and
over a long-term period
USER ENGAGEMENT
20. I just wanted the phone number … I am totally happy J
No clicks
21. Dwell time
DWELL TIME
used a proxy of
user experience
Publisher
click on
an ad on
mobile
device
Dwell time on non-optimized landing pages
comparable and even higher than on mobile-
optimized ones
… when mobile optimized, users realize quickly
whether they “like” the ad or not?
(Lalmas etal, 2015)
non-mobile optimized mobile optimized
24. top most popular tweets top most popular tweets + geographical diverse
Being from a central or peripheral location makes a difference.
Peripheral users did not perceive the timeline as being diverse
Objectivity versus subjectivity
It should never be just about the algorithm, but also how users respond to what the
algorithm returns to them à USER ENGAGEMENT
(Eduardo Graells, 2015)
27. Beyond clicks and
relevance towards
user engagement
§ From intra- to inter-session evaluation
› Dwell time and absence time
› Linking strategy
› Mobile advertising
§ From small- to large-scale evaluation
› Eye-tracking and user engagement questionnaire
› Mouse tracking and user engagement questionnaire
happy users
come back
we need to
properly
identify the
happy users
29. From short- to long-term engagement:
From intra- to inter-session engagement
intra-session
metric(s)
inter-session
metric(s)
how users engage within
a session?
how users engage across
sessions?
We monitor We know what it will mean
futureengagement
proxy
31. intra-session metrics
• Dwell time
• Session duration
• Bounce rate
• Play time (video)
• Mouse movement
• Click through rate (CTR)
• Number of pages
viewed (click depth)
• Conversion rate
• Number of UCG
(comments)
• …
Dwell time as a proxy of user interest
Dwell time as a proxy of relevance
Dwell time as a proxy of conversion
Dwell time as a proxy of post-click ad
quality
…
User engagement metrics
intra-session
inter-session
32. Dwell time
§ Definition
The contiguous time spent on
a site or web page
§ Similar measures
Play time (for video sites)
§ Cons
Not clear that the user was
actually looking at the site
while there à blur/focus
Distribution of dwell times on 50
websites
(O’Brien, Lalmas & Yom-Tov, 2014)
33. Dwell time
Dwell time varies by
site type:
• leisure sites tend to have
longer dwell times than
news, e-commerce, etc.
Dwell time has a
relatively large variance
even for the same site
Dwell time on 50 websites
(tourists, VIP, active …
users)
(O’Brien, Lalmas & Yom-Tov, 2014)
37. Absence time and survival analysis
story 1
story 2
story 3
story 4
story 5
story 6
story 7
story 8
story 9
0 5 10 15 20
0.00.20.40.60.81.0
Users (%) who did come back
Users (%) who read story 2 but did not come back after 10 hours
SURVIVE
DIE
DIE = RETURN TO SITE èSHORT ABSENCE TIME
hours
38. Absence time applied to search
Ranking function on Yahoo Answer Japan
Two-weeks click data on Yahoo Answer Japan: search
One millions users
Six ranking functions
30-minute session boundary
39. survival analysis: high hazard rate (die quickly) = short absence
5 clicks
control=noclick
Absence time and number of clicks on
search result page
3 clicks
40. Absence time – search experience
1. No click means a bad user experience
2. Clicking between 3-5 results leads to same user experience
3. Clicking on more than 5 results reflects poorer user experience;
users cannot find what they are looking for
4. Clicking lower in the ranking (2nd, 3rd) suggests more careful choice
from the user (compared to 1st)
5. Clicking at bottom is a sign of low quality overall ranking
6. Users finding their answers quickly (time to 1st click) return sooner to
the search application
7. Returning to the same search result page is a worse user experience
than reformulating the query
search session metrics à absence time
(Dupret & Lalmas, 2013)
42. Related
off-‐site
content
The context – Linking strategy in online
news
News provider
p(absence12h)
No Click Off-site click
Off-site link à absence time
Providing links to related off-site
content has a positive long-term
effect
(Lehmann etal, In Progress)
43. The Context –
Mobile advertising
0%
200%
400%
600%
short ad clicks long ad clicks
adclickdifference
Dwell time à ad click
Positive post-click
experience (“long” clicks)
has an effect on users
clicking on ads again
(Lalmas etal, 2015)
44. Beyond clicks and
relevance towards
user engagement
§ From intra- to inter-session evaluation
› Dwell time and absence time
› Linking strategy
› Mobile advertising
happy users
come back
46. Small scale measurement – focused
attention questionnaire
5-point scale (strong disagree to strong agree)
1. I lost myself in this news tasks experience
2. I was so involved in my news tasks that I lost track of time
3. I blocked things out around me when I was completing the news tasks
4. When I was performing these news tasks, I lost track of the world
around me
5. The time I spent performing these news tasks just slipped away
6. I was absorbed in my news tasks
7. During the news tasks experience I let myself go
(O'Brien & Toms, 2010)
47. Small scale measurement – PANAS
questionnaire
(10 positive items and 10 negative items)
§ You feel this way right now, that is, at the present moment
[1 = very slightly or not at all; 2 = a little; 3 = moderately;
4 = quite a bit; 5 = extremely]
[randomize items]
distressed, upset, guilty, scared, hostile,
irritable, ashamed, nervous, jittery, afraid
interested, excited, strong, enthusiastic, proud,
alert, inspired, determined, attentive, active
(Watson, Clark & Tellegen, 1988)
48. Small scale measurement – gaze and
self-reporting
News
interest
57 users
reading task (114)
• questionnaire (qualitative data)
• record eye tracking
• (quantitative data)
Three metrics: gaze,
focus attention and
positive affect
All three metrics align:
interesting content promote
all engagement metrics
(Arapakis etal, 2014)
49. From small- to large-scale
measurement – mouse tracking
§ Navigation & interaction with digital
environment usually involves the use
of a mouse (selecting, positioning, clicking)
§ Several works show mouse cursor as
weak proxy of gaze (attention)
§ Low-cost, scalable alternative
§ Can be performed in a non-invasive
manner, without removing users from
their natural setting
50. Relevance, dwell time & cursor
“reading” a relevant long document vs “scanning” a long non-relevant
document
(Guo & Agichtein, 2012)
52. Mouse tracking and self-reporting
§ 324 users from Amazon Mechanical Turk (between
subject design)
§ Two tasks (reading and search)
§ “Normal vs Ugly” interface
§ Questionnaires (qualitative data)
› focus attention, positive effect
› interest, aesthetics
§ Mouse tracking (quantitative data)
› movement speed, movement rate, click rate, pause length, percentage of time
still
(Warnock & Lalmas, 2015)
53. Mouse tracking could not tell much
about
• focused attention and positive affect
• user interests in the task/topic
• aesthetics
BUT BUT BUT BUT
› “ugly” variant did not result in lower USER aesthetics scores
› although BBC > Wikipedia
BUT – the comments left …
› Wikipedia: “The website was simply awful. Ads flashing everywhere, poor
text colors on a dark blue background.”; “The webpage was entirely blue. I don't
know if it was supposed to be like that, but it definitely detracted from the
browsing experience.”
› BBC News: “The website's layout and color scheme were a bitch to
navigate and read.”; “Comic sans is a horrible font.”
54. Flawed methodology? Non-existing
signal? Wrong metric? Wrong measure?
§ Hawthorne Effect
§ Design
› Usability versus engagement
› Within- versus between-subject
§ Mouse movement was not sophisticated enough
56. Towards a taxonomy of mouse gestures
for user engagement measurement
§ The top-ranked clustering configuration is the Spectral Clustering
for the original dataset, with hyperbolic tangent kernel, for k = 38
• certain types of mouse gestures occur more or less often, depending on user
interest in article
• significant correlations between certain types of mouse gestures and self-
report measures
• cursor behaviour goes beyond measuring frustration
• inform about the positive and negative interaction
57. Beyond clicks and
relevance towards
user engagement
§ From small- to large-scale evaluation
› Eye-tracking and user engagement questionnaire
› Mouse tracking and user engagement questionnaire
we need to
properly identify
the happy users
60. § “If you cannot measure it,
you cannot improve it” William
Thomson (Lord Kelvin)
§ “You cannot control what you
cannot measure” DeMarco
§ “The way you measure is
more important than what
you measure” Art Gust
Thank you