SlideShare a Scribd company logo
1 of 69
Frontiers of
Computational Journalism
Columbia Journalism School
Week 3: Information Filter Design
September 26, 2016
This class
• The need for information filtering
• Filtering algorithms
• Human-machine filters
• Filter bubbles and other problems
• The filter design problem
The Need for Filtering
More video on YouTube than produced by TV networks during
entire 20th century.
10,000 legally-required reports filed by U.S. public
companies every day
Each day, the Associated Press publishes:
~10,000 text stories
~3,000 photographs
~500 videos
+ radio, interactive…
Comment Ranking
Comment voting
Problem: putting comments with most votes at top doesn’t work.
Why?
Old reddit comment ranking
“Hot” algorithm.
Up – down votes plus time
decay
Reddit Comment Ranking (new)
Hypothetically, suppose all users voted on the comment, and v out of N
up-voted. Then we could sort by proportion p = v/N of upvotes.
N=16
v = 11
p = 11/16 = 0.6875
Reddit Comment Ranking
Actually, only n users out of N vote, giving an observed approximate
proportion p’ = v’/n
n=3
v’ = 1
p’ = 1/3 = 0.333
Reddit Comment Ranking
Limited sampling can rank votes wrong when we don’t have enough
data.
p’ = 0.333
p = 0.6875
p’ = 0.75
p = 0.1875
Confidence interval
1-𝛼 probability that the true value p will lie within the central
region (when sampled assuming p=p’)
Rank comments by lower bound
of confidence interval
p’ = observed proportion of upvotes
n = how many people voted
zα= how certain do we want to be before we assume that p’ is “close” to
true p
Analytic solution for confidence interval, known as “Wilson score”
How not to sort by average rating, Evan Miller
User-item Recommendation
User-item matrix
Stores “rating” of each user for each item. Could also be
binary variable that says whether user clicked, liked,
starred, shared, purchased...
User-item matrix
• No content analysis. We know nothing about what is “in” each item.
• Typically very sparse – a user hasn’t watched even 1% of all
movies.
• Filtering problem is guessing “unknown” entry in matrix. High
guessed values are things user would want to see.
Filtering process
Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
How to guess unknown rating?
Basic idea: suggest “similar” items.
Similar items are rated in a similar way by many different users.
Remember, “rating” could be a click, a like, a purchase.
o “Users who bought A also bought B...”
o “Users who clicked A also clicked B...”
o “Users who shared A also shared B...”
Similar items
Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
Item similarity
Cosine similarity!
Other distance measures
“adjusted cosine similarity”
Subtracts average rating for each user, to compensate for general
enthusiasm (“most movies suck” vs. “most movies are great”)
Generating a recommendation
Weighted average of item ratings by their similarity.
Matrix factorization recommender
Matrix factorization recommender
Note: only sum over observed ratings rij.
Matrix factorization plate model
r
v
u
user rating
of item
variation in
user topics
λu
λv
variation in
item topics
topics for user
topics for item
i users
j items
New York Times recommender
Different Filtering Systems
Content:
Newsblaster analyzes the topics in the documents.
No concept of users.
Social:
What I see on Twitter determined by who I follow.
Reddit comments filtered by votes as input.
Amazon "people who bought X also bought Y” - no content analysis.
Hybrid:
Recommend based both on content and user behavior.
Combining collaborative filtering
and topic modeling
Collaborative Topic Modeling for Recommending Scientific Articles, Wang and Blei
K topics
topic for word word in doc
topics in doc
topic
concentration
parameter
word
concentration
parameter
Content modeling - LDA
D docs
words in topics
N words
in doc
K topicstopic for word word in doctopics in doc
(content)
topic
concentration
weight of user
selections
variation in
per-user topics
topics for user
user rating
of doctopics in doc
(collaborative)
Collaborative Topic Modeling
content only
content +
social
Filtering News on Twitter
Reuters News Tracer
Filter
Cluster into
events
Searches
and Alerts
Score
veracity &
newsworthy
Liu et. al, Reuters Tracer: A Large Scale System of Detecting &
Verifying Real-Time News Events from Twitter
Liu et. al, Reuters Tracer: A Large Scale System of Detecting &
Verifying Real-Time News Events from Twitter
Liu et. al, Reuters Tracer: A Large Scale System of Detecting &
Verifying Real-Time News Events from Twitter
Human-Machine Filters
TechMeme / MediaGazer
Facebook trending (with editors)
Facebook trending (without editors)
Facebook “trending review tool” screenshot from leaked documents
Approve or Reject: Can You Moderate Five New York Times Comments?
Revealed: Facebook's internal rulebook on sex, terrorism and violence, The Guardian
Facebook’s “Community Standards” document
Filter bubbles and other problems
Graph of political book sales during 2008 U.S. election, by orgnet.org
From Amazon "users who bought X also bought Y" data.
Retweet network of political tweets.
Political Polarization on Twitter, Conover, et. al.,
Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-Israeli
(Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple)
Gilad Lotan, Betaworks
The Filter Bubble
What people care about politically, and what they’re motivated to do something
about, is a function of what they know about and what they see in their media.
... People see something about the deficit on the news, and they say, ‘Oh, the
deficit is the big problem.’ If they see something about the environment, they
say the environment is a big problem.
This creates this kind of a feedback loop in which your media influences your
preferences and your choices; your choices influence your media; and you
really can go down a long and narrow path, rather than actually seeing the
whole set of issues in front of us.
- Eli Pariser,
How do we recreate a front-page ethos for a digital world?
Are filters causing our bubbles?
Increasing U.S. polarization predates Internet by decades.
Is the Internet Causing Political Polarization? Evidence from Demographics
Boxell, Gentzkow, Shapiro
Polarization increasing fastest
among those who are online the least
Exposure to Diverse Information on Facebook,
Eytan Bakshy, Lada Adamic, Solomon Messing
Will you see diverse content vs. will you click it?
Filter Design
Item Content My Data Other Users’ Data
Text analysis,
topic modeling,
clustering...
who I follow
what I’ve read/liked
social network
structure,
other users’ likes
Filter design problem
Formally, given
U = user preferences, history, characteristics
S = current story
{P} = results of function on previous stories
{B} = background world knowledge (other users?)
Define
r(S,U,{P},{B}) in [0...1]
relevance of story S to user U
Filter design problem, restated
When should a user see a story?
Aspects to this question:
normative
personal: what I want
societal: emergent group effects
UI
how do I tell the computer I want?
technical
constrained by algorithmic possibility
economic
cheap enough to deploy widely
“Conversational health”
Measuring the health of our public conversations, Cortico.ai
Exposure diversity as a design principle for recommender systems, Natali Helberger
How to evaluate/optimize?
How to evaluate/optimize?
• Netflix: try to predict the rating that the user gives a movie
after watching it.
• Amazon: sell more stuff.
• Google, Facebook: human raters A/B test every change (but
what do they optimize for?)
• Does the user understand how the filter works?
• Can they configure it as desired?
• Controls for abuse and harassment
• Can it be gamed? Spam, "user-generated censorship," etc.
How to evaluate/optimize?
Information diet
The holy grail in this model, as far as I’m
concerned, would be a Firefox plugin that would
passively watch your websurfing behavior and
characterize your personal information
consumption. Over the course of a week, it might
let you know that you hadn’t encountered any
news about Latin America, or remind you that a full
40% of the pages you read had to do with Sarah
Palin. It wouldn’t necessarily prescribe changes in
your behavior, simply help you monitor your own
consumption in the hopes that you might make
changes.
- Ethan Zuckerman,
Playing the Internet with PMOG

More Related Content

What's hot

Semantics based Summarization of Entities in Knowledge Graphs
Semantics based Summarization of Entities in Knowledge GraphsSemantics based Summarization of Entities in Knowledge Graphs
Semantics based Summarization of Entities in Knowledge Graphs
Artificial Intelligence Institute at UofSC
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
Melanie Swan
 
Domain-specific Knowledge Extraction from the Web of Data
Domain-specific Knowledge Extraction from the Web of DataDomain-specific Knowledge Extraction from the Web of Data
Domain-specific Knowledge Extraction from the Web of Data
Artificial Intelligence Institute at UofSC
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
Christoph Trattner
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
Christoph Trattner
 
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Lora Aroyo
 

What's hot (20)

Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Semantics based Summarization of Entities in Knowledge Graphs
Semantics based Summarization of Entities in Knowledge GraphsSemantics based Summarization of Entities in Knowledge Graphs
Semantics based Summarization of Entities in Knowledge Graphs
 
Ethics for Conversational AI
Ethics for Conversational AIEthics for Conversational AI
Ethics for Conversational AI
 
Tutorial Cognition - Irene
Tutorial Cognition - IreneTutorial Cognition - Irene
Tutorial Cognition - Irene
 
Philosophy of Deep Learning
Philosophy of Deep LearningPhilosophy of Deep Learning
Philosophy of Deep Learning
 
Domain-specific Knowledge Extraction from the Web of Data
Domain-specific Knowledge Extraction from the Web of DataDomain-specific Knowledge Extraction from the Web of Data
Domain-specific Knowledge Extraction from the Web of Data
 
ase-social-informatics (6)
ase-social-informatics (6)ase-social-informatics (6)
ase-social-informatics (6)
 
Applications: Prediction
Applications: PredictionApplications: Prediction
Applications: Prediction
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WSDM 2019 Tutorial)
 
DS4G
DS4GDS4G
DS4G
 
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
Privacy-preserving Data Mining in Industry (WWW 2019 Tutorial)
 
Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)Analyzing social media with Python and other tools (1/4)
Analyzing social media with Python and other tools (1/4)
 
Cognitive Models in Recommender Systems
Cognitive Models in Recommender SystemsCognitive Models in Recommender Systems
Cognitive Models in Recommender Systems
 
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
Crowdsourcing ambiguity aware ground truth - collective intelligence 2017
 
Fake News Detector
Fake News DetectorFake News Detector
Fake News Detector
 
Centrality in Time- Dependent Networks
Centrality in Time- Dependent NetworksCentrality in Time- Dependent Networks
Centrality in Time- Dependent Networks
 
Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011Contextual Ontology Alignment - ESWC 2011
Contextual Ontology Alignment - ESWC 2011
 
Paper Writing in Applied Mathematics (slightly updated slides)
Paper Writing in Applied Mathematics (slightly updated slides)Paper Writing in Applied Mathematics (slightly updated slides)
Paper Writing in Applied Mathematics (slightly updated slides)
 
A Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics CorporationA Blended Approach to Analytics at Data Tactics Corporation
A Blended Approach to Analytics at Data Tactics Corporation
 

Similar to Frontiers of Computational Journalism week 3 - Information Filter Design

Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
Davide Feltoni Gurini
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
idoguy
 
Tom Malone - Program for the Future Dec. 8
Tom Malone - Program for the Future Dec. 8Tom Malone - Program for the Future Dec. 8
Tom Malone - Program for the Future Dec. 8
TechVirtual
 
Opportunities with real time local search and content
Opportunities with real time local search and contentOpportunities with real time local search and content
Opportunities with real time local search and content
Sebastien Provencher
 

Similar to Frontiers of Computational Journalism week 3 - Information Filter Design (20)

What is a Creative Date Scientist (and why the $@%! do we need one?)
What is a Creative Date Scientist (and why the $@%! do we need one?)What is a Creative Date Scientist (and why the $@%! do we need one?)
What is a Creative Date Scientist (and why the $@%! do we need one?)
 
Netnography
NetnographyNetnography
Netnography
 
Let The People Speak
Let The People SpeakLet The People Speak
Let The People Speak
 
Everything you ever wanted to know about Google Analytics, but were afraid to...
Everything you ever wanted to know about Google Analytics, but were afraid to...Everything you ever wanted to know about Google Analytics, but were afraid to...
Everything you ever wanted to know about Google Analytics, but were afraid to...
 
National Geographic - Omniture Cafe 6/11/09
National Geographic - Omniture Cafe 6/11/09National Geographic - Omniture Cafe 6/11/09
National Geographic - Omniture Cafe 6/11/09
 
Using Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation QualityUsing Social- and Pseudo-Social Networks to Improve Recommendation Quality
Using Social- and Pseudo-Social Networks to Improve Recommendation Quality
 
Sentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and WhySentiment Analysis and Social Media: How and Why
Sentiment Analysis and Social Media: How and Why
 
Social Media Boot Camp at PACOM 3
Social Media Boot Camp at PACOM 3Social Media Boot Camp at PACOM 3
Social Media Boot Camp at PACOM 3
 
Smbc pacom-2
Smbc pacom-2Smbc pacom-2
Smbc pacom-2
 
Taxonomy, Social Networks and Pace Layering
Taxonomy, Social Networks and Pace LayeringTaxonomy, Social Networks and Pace Layering
Taxonomy, Social Networks and Pace Layering
 
Social Media Analytics: The Value Proposition
Social Media Analytics: The Value PropositionSocial Media Analytics: The Value Proposition
Social Media Analytics: The Value Proposition
 
Social Technology
Social TechnologySocial Technology
Social Technology
 
Social Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the usersSocial Media and AI: Don’t forget the users
Social Media and AI: Don’t forget the users
 
The evolution of research on social media
The evolution of research on social mediaThe evolution of research on social media
The evolution of research on social media
 
Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011Social Recommender Systems Tutorial - WWW 2011
Social Recommender Systems Tutorial - WWW 2011
 
Tom Malone - Program for the Future Dec. 8
Tom Malone - Program for the Future Dec. 8Tom Malone - Program for the Future Dec. 8
Tom Malone - Program for the Future Dec. 8
 
Opportunities with real time local search and content
Opportunities with real time local search and contentOpportunities with real time local search and content
Opportunities with real time local search and content
 
DMI Summer 2010 - Final Presentations
DMI Summer 2010 - Final PresentationsDMI Summer 2010 - Final Presentations
DMI Summer 2010 - Final Presentations
 
Nfais social discovery-v5
Nfais social discovery-v5Nfais social discovery-v5
Nfais social discovery-v5
 
Social Web .20 Class Week 6: Lightweight Authoring, Blogs, Wikis
Social Web .20 Class Week 6: Lightweight Authoring, Blogs, WikisSocial Web .20 Class Week 6: Lightweight Authoring, Blogs, Wikis
Social Web .20 Class Week 6: Lightweight Authoring, Blogs, Wikis
 

More from Jonathan Stray

More from Jonathan Stray (10)

Frameworks for Algorithmic Bias
Frameworks for Algorithmic BiasFrameworks for Algorithmic Bias
Frameworks for Algorithmic Bias
 
Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019Analyzing Bias in Data - IRE 2019
Analyzing Bias in Data - IRE 2019
 
Frontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and SecurityFrontiers of Computational Journalism week 11 - Privacy and Security
Frontiers of Computational Journalism week 11 - Privacy and Security
 
Frontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and TrustFrontiers of Computational Journalism week 10 - Truth and Trust
Frontiers of Computational Journalism week 10 - Truth and Trust
 
Frontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representationFrontiers of Computational Journalism week 9 - Knowledge representation
Frontiers of Computational Journalism week 9 - Knowledge representation
 
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
Frontiers of Computational Journalism week 7 - Randomness and Statistical Sig...
 
Frontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative FairnessFrontiers of Computational Journalism week 6 - Quantitative Fairness
Frontiers of Computational Journalism week 6 - Quantitative Fairness
 
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
Frontiers of Computational Journalism week 5 - Algorithmic Accountability and...
 
Frontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestionsFrontiers of Computational Journalism - Final project suggestions
Frontiers of Computational Journalism - Final project suggestions
 
Frontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical InferenceFrontiers of Computational Journalism week 4 - Statistical Inference
Frontiers of Computational Journalism week 4 - Statistical Inference
 

Recently uploaded

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 

Recently uploaded (20)

How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17How to Add New Custom Addons Path in Odoo 17
How to Add New Custom Addons Path in Odoo 17
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdfUnit 3 Emotional Intelligence and Spiritual Intelligence.pdf
Unit 3 Emotional Intelligence and Spiritual Intelligence.pdf
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 

Frontiers of Computational Journalism week 3 - Information Filter Design

  • 1. Frontiers of Computational Journalism Columbia Journalism School Week 3: Information Filter Design September 26, 2016
  • 2. This class • The need for information filtering • Filtering algorithms • Human-machine filters • Filter bubbles and other problems • The filter design problem
  • 3. The Need for Filtering
  • 4.
  • 5.
  • 6.
  • 7. More video on YouTube than produced by TV networks during entire 20th century.
  • 8. 10,000 legally-required reports filed by U.S. public companies every day
  • 9. Each day, the Associated Press publishes: ~10,000 text stories ~3,000 photographs ~500 videos + radio, interactive…
  • 11. Comment voting Problem: putting comments with most votes at top doesn’t work. Why?
  • 12. Old reddit comment ranking “Hot” algorithm. Up – down votes plus time decay
  • 13. Reddit Comment Ranking (new) Hypothetically, suppose all users voted on the comment, and v out of N up-voted. Then we could sort by proportion p = v/N of upvotes. N=16 v = 11 p = 11/16 = 0.6875
  • 14. Reddit Comment Ranking Actually, only n users out of N vote, giving an observed approximate proportion p’ = v’/n n=3 v’ = 1 p’ = 1/3 = 0.333
  • 15. Reddit Comment Ranking Limited sampling can rank votes wrong when we don’t have enough data. p’ = 0.333 p = 0.6875 p’ = 0.75 p = 0.1875
  • 16. Confidence interval 1-𝛼 probability that the true value p will lie within the central region (when sampled assuming p=p’)
  • 17. Rank comments by lower bound of confidence interval p’ = observed proportion of upvotes n = how many people voted zα= how certain do we want to be before we assume that p’ is “close” to true p Analytic solution for confidence interval, known as “Wilson score” How not to sort by average rating, Evan Miller
  • 19. User-item matrix Stores “rating” of each user for each item. Could also be binary variable that says whether user clicked, liked, starred, shared, purchased...
  • 20. User-item matrix • No content analysis. We know nothing about what is “in” each item. • Typically very sparse – a user hasn’t watched even 1% of all movies. • Filtering problem is guessing “unknown” entry in matrix. High guessed values are things user would want to see.
  • 21. Filtering process Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
  • 22. How to guess unknown rating? Basic idea: suggest “similar” items. Similar items are rated in a similar way by many different users. Remember, “rating” could be a click, a like, a purchase. o “Users who bought A also bought B...” o “Users who clicked A also clicked B...” o “Users who shared A also shared B...”
  • 23. Similar items Item-Based Collaborative Filtering Recommendation Algorithms, Sarwar et al
  • 25. Other distance measures “adjusted cosine similarity” Subtracts average rating for each user, to compensate for general enthusiasm (“most movies suck” vs. “most movies are great”)
  • 26. Generating a recommendation Weighted average of item ratings by their similarity.
  • 28. Matrix factorization recommender Note: only sum over observed ratings rij.
  • 29. Matrix factorization plate model r v u user rating of item variation in user topics λu λv variation in item topics topics for user topics for item i users j items
  • 30. New York Times recommender
  • 31. Different Filtering Systems Content: Newsblaster analyzes the topics in the documents. No concept of users. Social: What I see on Twitter determined by who I follow. Reddit comments filtered by votes as input. Amazon "people who bought X also bought Y” - no content analysis. Hybrid: Recommend based both on content and user behavior.
  • 32. Combining collaborative filtering and topic modeling Collaborative Topic Modeling for Recommending Scientific Articles, Wang and Blei
  • 33. K topics topic for word word in doc topics in doc topic concentration parameter word concentration parameter Content modeling - LDA D docs words in topics N words in doc
  • 34. K topicstopic for word word in doctopics in doc (content) topic concentration weight of user selections variation in per-user topics topics for user user rating of doctopics in doc (collaborative) Collaborative Topic Modeling
  • 36. Filtering News on Twitter
  • 37.
  • 38.
  • 39.
  • 40. Reuters News Tracer Filter Cluster into events Searches and Alerts Score veracity & newsworthy
  • 41. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  • 42. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  • 43. Liu et. al, Reuters Tracer: A Large Scale System of Detecting & Verifying Real-Time News Events from Twitter
  • 48. Facebook “trending review tool” screenshot from leaked documents
  • 49. Approve or Reject: Can You Moderate Five New York Times Comments?
  • 50. Revealed: Facebook's internal rulebook on sex, terrorism and violence, The Guardian
  • 52. Filter bubbles and other problems
  • 53. Graph of political book sales during 2008 U.S. election, by orgnet.org From Amazon "users who bought X also bought Y" data.
  • 54. Retweet network of political tweets. Political Polarization on Twitter, Conover, et. al.,
  • 55. Instagram co-tag graph, highlighting three distinct topical communities: 1) pro-Israeli (Orange), 2) pro-Palestinian (Yellow), and 3) Religious / muslim (Purple) Gilad Lotan, Betaworks
  • 56. The Filter Bubble What people care about politically, and what they’re motivated to do something about, is a function of what they know about and what they see in their media. ... People see something about the deficit on the news, and they say, ‘Oh, the deficit is the big problem.’ If they see something about the environment, they say the environment is a big problem. This creates this kind of a feedback loop in which your media influences your preferences and your choices; your choices influence your media; and you really can go down a long and narrow path, rather than actually seeing the whole set of issues in front of us. - Eli Pariser, How do we recreate a front-page ethos for a digital world?
  • 57. Are filters causing our bubbles? Increasing U.S. polarization predates Internet by decades.
  • 58. Is the Internet Causing Political Polarization? Evidence from Demographics Boxell, Gentzkow, Shapiro Polarization increasing fastest among those who are online the least
  • 59. Exposure to Diverse Information on Facebook, Eytan Bakshy, Lada Adamic, Solomon Messing Will you see diverse content vs. will you click it?
  • 61. Item Content My Data Other Users’ Data Text analysis, topic modeling, clustering... who I follow what I’ve read/liked social network structure, other users’ likes
  • 62. Filter design problem Formally, given U = user preferences, history, characteristics S = current story {P} = results of function on previous stories {B} = background world knowledge (other users?) Define r(S,U,{P},{B}) in [0...1] relevance of story S to user U
  • 63. Filter design problem, restated When should a user see a story? Aspects to this question: normative personal: what I want societal: emergent group effects UI how do I tell the computer I want? technical constrained by algorithmic possibility economic cheap enough to deploy widely
  • 64. “Conversational health” Measuring the health of our public conversations, Cortico.ai
  • 65. Exposure diversity as a design principle for recommender systems, Natali Helberger
  • 67. How to evaluate/optimize? • Netflix: try to predict the rating that the user gives a movie after watching it. • Amazon: sell more stuff. • Google, Facebook: human raters A/B test every change (but what do they optimize for?)
  • 68. • Does the user understand how the filter works? • Can they configure it as desired? • Controls for abuse and harassment • Can it be gamed? Spam, "user-generated censorship," etc. How to evaluate/optimize?
  • 69. Information diet The holy grail in this model, as far as I’m concerned, would be a Firefox plugin that would passively watch your websurfing behavior and characterize your personal information consumption. Over the course of a week, it might let you know that you hadn’t encountered any news about Latin America, or remind you that a full 40% of the pages you read had to do with Sarah Palin. It wouldn’t necessarily prescribe changes in your behavior, simply help you monitor your own consumption in the hopes that you might make changes. - Ethan Zuckerman, Playing the Internet with PMOG

Editor's Notes

  1. To open: https://code.fb.com/core-data/recommending-items-to-more-than-a-billion-people/ NY comment quiz (in incognito) http://www.nytimes.com/interactive/2016/09/20/insider/approve-or-reject-moderation-quiz.html?_r=0
  2. Editors are filters
  3. Editors are filters
  4. https://www.businessinsider.com/facebook-news-feed-is-flawed-2016-5
  5. http://www.tubefilter.com/2014/12/01/youtube-300-hours-video-per-minute/
  6. https://medium.com/hacking-and-gonzo/how-reddit-ranking-algorithms-work-ef111e33d0d9
  7. See also http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/
  8. http://www.evanmiller.org/how-not-to-sort-by-average-rating.html
  9. http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf
  10. http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf
  11. http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf
  12. http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf
  13. http://www.cs.columbia.edu/~blei/papers/WangBlei2011.pdf
  14. http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/?_r=0
  15. Reuters News Tracer
  16. https://www.researchgate.net/publication/309471330_Reuters_Tracer_A_Large_Scale_System_of_Detecting_Verifying_Real-Time_News_Events_from_Twitter
  17. https://www.researchgate.net/publication/309471330_Reuters_Tracer_A_Large_Scale_System_of_Detecting_Verifying_Real-Time_News_Events_from_Twitter
  18. https://www.researchgate.net/publication/309471330_Reuters_Tracer_A_Large_Scale_System_of_Detecting_Verifying_Real-Time_News_Events_from_Twitter
  19. https://www.researchgate.net/publication/309471330_Reuters_Tracer_A_Large_Scale_System_of_Detecting_Verifying_Real-Time_News_Events_from_Twitter
  20. http://news.techmeme.com/081203/automated
  21. https://www.theguardian.com/technology/2016/may/12/facebook-trending-news-leaked-documents-editor-guidelines
  22. https://assets.documentcloud.org/documents/2830513/Facebook-Trending-Review-Guidelines.pdf
  23. http://www.nytimes.com/interactive/2016/09/20/insider/approve-or-reject-moderation-quiz.html?_r=0
  24. https://www.theguardian.com/news/2017/may/21/revealed-facebook-internal-rulebook-sex-terrorism-violence
  25. https://www.facebook.com/communitystandards/introduction/
  26. https://www.vox.com/cards/congressional-dysfunction/what-is-political-polarization
  27. http://www.nber.org/papers/w23258
  28. https://research.fb.com/exposure-to-diverse-information-on-facebook-2/
  29. https://www.cortico.ai/blog/2018/2/29/public-sphere-health-indicators
  30. Photo from Munich algorithmic news conference