A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

A
Journey
into
Evalua0on:

from
Retrieval
Eﬀec0veness
to

User
Engagement

Mounia Lalmas
Yahoo Labs London
mounia@acm.org
SPIRE 2015 – King’s College London

This talk
§ Introduction to user engagement
§ Evaluation in information retrieval
(retrieval effectiveness)
§ From retrieval effectiveness to user engagement
(from intra-session to inter-session evaluation)
(from small- to large-scale evaluation)

This talk
beyond the click
beyond relevance
towards user engagement

What is user engagement?
“User engagement is a quality of the user
experience that emphasizes the phenomena
associated with wanting to use a technological
resource longer and frequently” (Attfield et al, 2011)
self-report: happy, sad,
enjoyment, …
emotional, cognitive and behavioural connection
that exists, at any point in time and over time, between
a user and a technological resource
analytics: click, upload,
read, comment, share …
physiology: gaze, body heat,
mouse movement, …

Why is it important to engage users?
§  In today’s wired world, users have enhanced expectations
about their interactions with technology
… resulting in increased competition amongst the
purveyors and designers of interactive systems.
§  In addition to utilitarian factors, such as usability, we must
consider the hedonic and experiential factors of interacting
with technology, such as fun, fulfillment, play, and user
engagement.
(O’Brien, Lalmas & Yom-Tov, 2014)

Online sites differ with respect to
their engagement pattern
Games
Users spend
much time per
visit
Search
Users come
frequently and
do not stay long
Social media
Users come
frequently and
stay long
Niche
Users come on
average once
a week e.g. weekly
post
News
Users come
periodically,
e.g. morning and
evening
Service
Users visit site,
when needed,
e.g. to renew
subscription
(Lehmann etal, 2012)

Characteristics of user engagement
Novelty
(Webster & Ho, 1997; O’Brien,
2008)
Richness and control
(Jacques et al, 1995; Webster &
Ho, 1997)
Aesthetics
(Jacques et al, 1995; O’Brien,
2008)
Endurability
(Read, MacFarlane, & Casey,
2002; O’Brien, 2008)
Focused attention
(Webster & Ho, 1997; O’Brien,
2008)
Reputation, trust and
expectation (Attfield et al,
2011)
Positive Affect
(O’Brien & Toms, 2008)
Motivation, interests,
incentives, and benefits
(Jacques et al., 1995; O’Brien & Toms,
2008)

Measuring user engagement
Measures
Attributes

Self-report Questionnaire, interview,
think-aloud and think after
protocols
Subjective
Short- and long-term
Lab and field
Small scale
Physiology EEG, SCL, fMRI
eye tracking
mouse-tracking
Objective
Short-term
Lab and field
Small and large scale
Analytics within- and across-session
metrics
data science
Objective
Short- and long-term
Field
Large scale

Attributes of user engagement
§ Scale (small versus large)
§ Setting (laboratory versus field)
§ Objective versus subjective
§ Temporality (short- versus long-term)
We focus on
1.  Temporality: from intra- to inter-session
2.  Scalability: from small- to large-scale

Evaluation in
information
retrieval

How to evaluate a search engine
§ Coverage

§ Speed

§ Query
language

§ User
interface

§ User
happiness

›  Users
ﬁnd
what
they
want
and
return
to
the
search
engine

›  Users
complete
the
search
task,
where
search
is
a
means,
not

an
end

Sec. 8.6
(Manning, Raghavan & Schütze, 2008; Baeza-Yates & Ribeiro-Neto, 2011)

Within an online
session
›  July 2012
›  2.5M users
›  785M page views
›  Categorization of the most
frequent accessed sites
•  11 categories (e.g. news), 33
subcategories (e.g. news finance,
news society)
•  760 sites from 70 countries/regions
short sessions: average 3.01 distinct sites visited with revisitation rate 10%
long sessions: average 9.62 distinct sites visited with revisitation rate 22%
(Lehmann etal, 2013)

Measuring user happiness
Most
common
proxy:
relevance
of
search
results

Sec. 8.1
Relevant
Retrieved
all items
§  User
informa)on
need
translated
into

a
query

§  Relevance
assessed
rela0ve
to

informa)on
need
not
the
query

§  Example:

›  Informa0on
need:
I
am
looking
for
tennis

holiday
in
a
country
with
no
rain

›  Query:
tennis
academy
good
weather

Evaluation measures:
•  precision, recall, R-precision; precision@n;
mean average precision; F-measure; …
•  bpref; cumulative gains, …
precision
recall

Measuring user happiness
Most
common
proxy:
relevance
of
search
result

Sec. 8.1
Explicit signals
Test collection methodology (TREC, CLEF, …)
Human labeled corpora
Implicit signals
User behavior in online settings (clicks, skips, …)

Examples of implicit signals in web
search
§  Number of clicks
§  Click at given position
§  Time to first click
§  Skipping
§  Abandonment rate
§  Number of query reformulations
§  Dwell time

What is a happy user in web search
1.  The user information need is satisfied
2.  The user has learned about a topic and even
about other topics
3.  The system was inviting and even fun to use
In-the-moment engagement
Users active on a site or stayed long
Long-term engagement
Users come back frequently and
over a long-term period
USER ENGAGEMENT

Click-through rates
CTR
new ranking algorithm
new design of search result page
…

I just wanted the phone number … I am totally happy J
No clicks

Dwell time
DWELL TIME
used a proxy of
user experience
Publisher
click on
an ad on
mobile
device
Dwell time on non-optimized landing pages
comparable and even higher than on mobile-
optimized ones
… when mobile optimized, users realize quickly
whether they “like” the ad or not?
(Lalmas etal, 2015)
non-mobile optimized mobile optimized

Multimedia search
activities often
driven by
entertainment
needs, not by
information needs
Relevance in multimedia search
(Slaney, 2011)

Explorative or serendipitous search
(Miliaraki, Blanco & Lalmas, 2015)

top most popular tweets top most popular tweets + geographical diverse
Being from a central or peripheral location makes a difference.
Peripheral users did not perceive the timeline as being diverse
Objectivity versus subjectivity
It should never be just about the algorithm, but also how users respond to what the
algorithm returns to them à USER ENGAGEMENT
(Eduardo Graells, 2015)

Interactive Information Retrieval
(Ingwersen, Human Aspects in IR, ESSIR 2011)
USERENGAGEMENT

Beyond clicks and
relevance towards
user engagement
§ From intra- to inter-session evaluation
›  Dwell time and absence time
›  Linking strategy
›  Mobile advertising
§ From small- to large-scale evaluation
›  Eye-tracking and user engagement questionnaire
›  Mouse tracking and user engagement questionnaire
happy users
come back
we need to
properly
identify the
happy users

From intra- to
inter-session
evaluation

From short- to long-term engagement:
From intra- to inter-session engagement
intra-session
metric(s)
inter-session
metric(s)
how users engage within
a session?
how users engage across
sessions?
We monitor We know what it will mean
futureengagement
proxy

intra-session metrics
•  Dwell time
•  Session duration
•  Bounce rate
•  Play time (video)
•  Mouse movement
•  Click through rate (CTR)
•  Number of pages
viewed (click depth)
•  Conversion rate
•  Number of UCG
(comments)
•  …
Dwell time as a proxy of user interest
Dwell time as a proxy of relevance
Dwell time as a proxy of conversion
Dwell time as a proxy of post-click ad
quality
…
User engagement metrics
intra-session
inter-session

Dwell time
§ Definition
The contiguous time spent on
a site or web page
§ Similar measures
Play time (for video sites)
§ Cons
Not clear that the user was
actually looking at the site
while there à blur/focus
Distribution of dwell times on 50
websites

Dwell time
Dwell time varies by
site type:
•  leisure sites tend to have
longer dwell times than
news, e-commerce, etc.
Dwell time has a
relatively large variance
even for the same site
Dwell time on 50 websites
(tourists, VIP, active …
users)

Dwell time across sessions
or absence time

The context – search experience

Absence time and survival analysis
story 1
story 2
story 3
story 4
story 5
story 6
story 7
story 8
story 9
0 5 10 15 20
0.00.20.40.60.81.0
Users (%) who did come back
Users (%) who read story 2 but did not come back after 10 hours
SURVIVE
DIE
DIE = RETURN TO SITE èSHORT ABSENCE TIME
hours

Absence time applied to search
Ranking function on Yahoo Answer Japan
Two-weeks click data on Yahoo Answer Japan: search
One millions users
Six ranking functions
30-minute session boundary

survival analysis: high hazard rate (die quickly) = short absence
5 clicks
control=noclick
Absence time and number of clicks on
search result page
3 clicks

Absence time – search experience
1.  No click means a bad user experience
2.  Clicking between 3-5 results leads to same user experience
3.  Clicking on more than 5 results reflects poorer user experience;
users cannot find what they are looking for
4.  Clicking lower in the ranking (2nd, 3rd) suggests more careful choice
from the user (compared to 1st)
5.  Clicking at bottom is a sign of low quality overall ranking
6.  Users finding their answers quickly (time to 1st click) return sooner to
the search application
7.  Returning to the same search result page is a worse user experience
than reformulating the query
search session metrics à absence time
(Dupret & Lalmas, 2013)

Related
oﬀ-‐site
content

The context – Linking strategy in online
news
News provider
p(absence12h)
No Click Off-site click
Off-site link à absence time
Providing links to related off-site
content has a positive long-term
effect
(Lehmann etal, In Progress)

The Context –
Mobile advertising
0%
200%
400%
600%
short ad clicks long ad clicks
adclickdifference
Dwell time à ad click
Positive post-click
experience (“long” clicks)
has an effect on users
clicking on ads again
(Lalmas etal, 2015)

Beyond clicks and
relevance towards
user engagement
§ From intra- to inter-session evaluation
›  Dwell time and absence time
›  Linking strategy
›  Mobile advertising
happy users
come back

From small- to
large-scale
evaluation

Small scale measurement – focused
attention questionnaire
5-point scale (strong disagree to strong agree)
1.  I lost myself in this news tasks experience
2.  I was so involved in my news tasks that I lost track of time
3.  I blocked things out around me when I was completing the news tasks
4.  When I was performing these news tasks, I lost track of the world
around me
5.  The time I spent performing these news tasks just slipped away
6.  I was absorbed in my news tasks
7.  During the news tasks experience I let myself go
(O'Brien & Toms, 2010)

Small scale measurement – PANAS
questionnaire
(10 positive items and 10 negative items)
§  You feel this way right now, that is, at the present moment
[1 = very slightly or not at all; 2 = a little; 3 = moderately;
4 = quite a bit; 5 = extremely]
[randomize items]
distressed, upset, guilty, scared, hostile,
irritable, ashamed, nervous, jittery, afraid
interested, excited, strong, enthusiastic, proud,
alert, inspired, determined, attentive, active
(Watson, Clark & Tellegen, 1988)

Small scale measurement – gaze and
self-reporting
News
interest
57 users
reading task (114)
•  questionnaire (qualitative data)
•  record eye tracking
•  (quantitative data)
Three metrics: gaze,
focus attention and
positive affect
All three metrics align:
interesting content promote
all engagement metrics
(Arapakis etal, 2014)

From small- to large-scale
measurement – mouse tracking
§  Navigation & interaction with digital
environment usually involves the use
of a mouse (selecting, positioning, clicking)
§  Several works show mouse cursor as
weak proxy of gaze (attention)
§  Low-cost, scalable alternative
§  Can be performed in a non-invasive
manner, without removing users from
their natural setting

Relevance, dwell time & cursor
“reading” a relevant long document vs “scanning” a long non-relevant
document
(Guo & Agichtein, 2012)

“Ugly”vs“Normal”Interface
BBC News
Wikipedia

Mouse tracking and self-reporting
§  324 users from Amazon Mechanical Turk (between
subject design)
§  Two tasks (reading and search)
§  “Normal vs Ugly” interface
§  Questionnaires (qualitative data)
›  focus attention, positive effect
›  interest, aesthetics
§  Mouse tracking (quantitative data)
›  movement speed, movement rate, click rate, pause length, percentage of time
still
(Warnock & Lalmas, 2015)

Mouse tracking could not tell much
about
•  focused attention and positive affect
•  user interests in the task/topic
•  aesthetics
BUT BUT BUT BUT
›  “ugly” variant did not result in lower USER aesthetics scores
›  although BBC > Wikipedia
BUT – the comments left …
›  Wikipedia: “The website was simply awful. Ads flashing everywhere, poor
text colors on a dark blue background.”; “The webpage was entirely blue. I don't
know if it was supposed to be like that, but it definitely detracted from the
browsing experience.”
›  BBC News: “The website's layout and color scheme were a bitch to
navigate and read.”; “Comic sans is a horrible font.”

Flawed methodology? Non-existing
signal? Wrong metric? Wrong measure?
§ Hawthorne Effect
§ Design
›  Usability versus engagement
›  Within- versus between-subject
§ Mouse movement was not sophisticated enough

Mouse Gestures
à Features
x0y0
x1y1
x2y2
x3y3 x4y4
x5y5
x6y6
x7y7
x8y8
t
Δt rest Δt rest
resting cursor
(500ms)
resting cursor
(1000ms)
resting cursor
(1500ms)
click
40006000
y
●●
●
●●●●●●●●●●●
●●●
(Arapakis, Lalmas & Valkanas, 2014)
22 users reading two articles
176,550 cursor positions
2,913 mouse gestures

Towards a taxonomy of mouse gestures
for user engagement measurement
§  The top-ranked clustering configuration is the Spectral Clustering
for the original dataset, with hyperbolic tangent kernel, for k = 38
•  certain types of mouse gestures occur more or less often, depending on user
interest in article
•  significant correlations between certain types of mouse gestures and self-
report measures
•  cursor behaviour goes beyond measuring frustration
•  inform about the positive and negative interaction

Beyond clicks and
relevance towards
user engagement
§ From small- to large-scale evaluation
›  Eye-tracking and user engagement questionnaire
›  Mouse tracking and user engagement questionnaire
we need to
properly identify
the happy users

Towards User Engagement
happy users
come back
we need to
properly identify
the happy users

§  “If you cannot measure it,
you cannot improve it” William
Thomson (Lord Kelvin)
§  “You cannot control what you
cannot measure” DeMarco
§  “The way you measure is
more important than what
you measure” Art Gust
Thank you

A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a A Journey into Evaluation: from Retrieval Effectiveness to User Engagement

Similar a A Journey into Evaluation: from Retrieval Effectiveness to User Engagement (20)

Más de Mounia Lalmas-Roelleke

Más de Mounia Lalmas-Roelleke (12)

Último

Último (20)

A Journey into Evaluation: from Retrieval Effectiveness to User Engagement