2. This class
• The need for information filtering
• Filtering algorithms
• Human-machine filters
• Filter bubbles and other problems
• The filter design problem
12. Old reddit comment ranking
“Hot” algorithm.
Up – down votes plus time
decay
13. Reddit Comment Ranking (new)
Hypothetically, suppose all users voted on the comment, and v out of N
up-voted. Then we could sort by proportion p = v/N of upvotes.
N=16
v = 11
p = 11/16 = 0.6875
14. Reddit Comment Ranking
Actually, only n users out of N vote, giving an observed approximate
proportion p’ = v’/n
n=3
v’ = 1
p’ = 1/3 = 0.333
15. Reddit Comment Ranking
Limited sampling can rank votes wrong when we don’t have enough
data.
p’ = 0.333
p = 0.6875
p’ = 0.75
p = 0.1875
17. Rank comments by lower bound
of confidence interval
p’ = observed proportion of upvotes
n = how many people voted
zα= how certain do we want to be before we assume that p’ is “close” to
true p
Analytic solution for confidence interval, known as “Wilson score”
How not to sort by average rating, Evan Miller
19. User-item matrix
Stores “rating” of each user for each item. Could also be
binary variable that says whether user clicked, liked,
starred, shared, purchased...
20. User-item matrix
• No content analysis. We know nothing about what is “in” each item.
• Typically very sparse – a user hasn’t watched even 1% of all
movies.
• Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
22. How to guess unknown rating?
Basic idea: suggest “similar” items.
Similar items are rated in a similar way by many different users.
Remember, “rating” could be a click, a like, a purchase.
o “Users who bought A also bought B...”
o “Users who clicked A also clicked B...”
o “Users who shared A also shared B...”
25. Other distance measures
“adjusted cosine similarity”
Subtracts average rating for each user, to compensate for general
enthusiasm (“most movies suck” vs. “most movies are great”)
29. Matrix factorization plate model
r
v
u
user rating
of item
variation in
user topics
λu
λv
variation in
item topics
topics for user
topics for item
i users
j items
31. Different Filtering Systems
Content:
Newsblaster analyzes the topics in the documents.
No concept of users.
Social:
What I see on Twitter determined by who I follow.
Reddit comments filtered by votes as input.
Amazon "people who bought X also bought Y” - no content analysis.
Hybrid:
Recommend based both on content and user behavior.
33. K topics
topic for word word in doc
topics in doc
topic
concentration
parameter
word
concentration
parameter
Content modeling - LDA
D docs
words in topics
N words
in doc
34. K topicstopic for word word in doctopics in doc
(content)
topic
concentration
weight of user
selections
variation in
per-user topics
topics for user
user rating
of doctopics in doc
(collaborative)
Collaborative Topic Modeling
53. Graph of political book sales during 2008 U.S. election, by orgnet.org
From Amazon "users who bought X also bought Y" data.
54. Retweet network of political tweets.
Political Polarization on Twitter, Conover, et. al.,
55. Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-Israeli
(Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks
56. The Filter Bubble
What people care about politically, and what they’re motivated to do something
about, is a function of what they know about and what they see in their media.
... People see something about the deficit on the news, and they say, ‘Oh, the
deficit is the big problem.’ If they see something about the environment, they
say the environment is a big problem.
This creates this kind of a feedback loop in which your media influences your
preferences and your choices; your choices influence your media; and you
really can go down a long and narrow path, rather than actually seeing the
whole set of issues in front of us.
- Eli Pariser,
How do we recreate a front-page ethos for a digital world?
57. Are filters causing our bubbles?
Increasing U.S. polarization predates Internet by decades.
58. Is the Internet Causing Political Polarization? Evidence from Demographics
Boxell, Gentzkow, Shapiro
Polarization increasing fastest
among those who are online the least
59. Exposure to Diverse Information on Facebook,
Eytan Bakshy, Lada Adamic, Solomon Messing
Will you see diverse content vs. will you click it?
61. Item Content My Data Other Users’ Data
Text analysis,
topic modeling,
clustering...
who I follow
what I’ve read/liked
social network
structure,
other users’ likes
62. Filter design problem
Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1]
relevance of story S to user U
63. Filter design problem, restated
When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely
67. How to evaluate/optimize?
• Netflix: try to predict the rating that the user gives a movie
after watching it.
• Amazon: sell more stuff.
• Google, Facebook: human raters A/B test every change (but
what do they optimize for?)
68. • Does the user understand how the filter works?
• Can they configure it as desired?
• Controls for abuse and harassment
• Can it be gamed? Spam, "user-generated censorship," etc.
How to evaluate/optimize?
69. Information diet
The holy grail in this model, as far as I’m
concerned, would be a Firefox plugin that would
passively watch your websurfing behavior and
characterize your personal information
consumption. Over the course of a week, it might
let you know that you hadn’t encountered any
news about Latin America, or remind you that a full
40% of the pages you read had to do with Sarah
Palin. It wouldn’t necessarily prescribe changes in
your behavior, simply help you monitor your own
consumption in the hopes that you might make
changes.
- Ethan Zuckerman,
Playing the Internet with PMOG
Editor's Notes
To open:
https://code.fb.com/core-data/recommending-items-to-more-than-a-billion-people/
NY comment quiz (in incognito)
http://www.nytimes.com/interactive/2016/09/20/insider/approve-or-reject-moderation-quiz.html?_r=0