The document discusses criteria for selecting comments for the "NYT Picks" section of the New York Times website. It examines literature on positive criteria for inclusion such as thoughtfulness, brevity, relevance, and diversity. It poses research questions on whether NYT Picks comments reflect these criteria and whether algorithms could be developed to assess criteria and augment human moderation. While automation may scale moderation and improve the user experience, it also raises issues regarding over-generalization and need for transparency.
1. Picking the NYT Picks:
Editorial Criteria and Automation
in the Curation of Online News Comments
Nicholas Diakopoulos
University of Maryland, College Park – College of Journalism
@ndiakopoulos | nickdiakopoulos.com | nad@umd.edu
2.
3.
4. “NYT Picks is the most popular comment queue. We
spend a lot of time tweaking that and getting that
right.”
What are criteria for selection?
How can we augment moderator capability to consider more comments?
5. Criteria from Literature
Negative / Exclusion
Personal attacks, profanity, abusive
behavior
Positive / Inclusion Internal Coherence
Thoughtfulness
Brevity / Length
Relevance
Fairness / Diversity
Novelty
Argument Quality
Criticality
Emotionality
Entertainment Value
Readability
Personal Experience
11. But automation also raises questions about
over-generalization across contexts, and
algorithmic transparency
12. Questions?
Contact
Nick Diakopoulos
University of Maryland, College of Journalism
Twitter: @ndiakopoulos
Email: nad@umd.edu
Web: http://www.nickdiakopoulos.com
More Info
N. Diakopoulos. The Editor’s Eye: Curation and Comment
Relevance on the New York Times. Proc. CSCW. March,
2015.
Notas del editor
On September 11, 2013, Vladamir Putin published an op-ed in the NYT. Among other things, he questioned american exceptionalism – and if there’s one thing you shouldn’t do in ‘merica it’s that. He was prodding the american public.
In response, comments flooded in, 6,367 of them in fact. Of those 4,447 were published along with the piece.
How could you possibly organize thousands of comments and find the interesting or insightful ones?
Like other commenting systems users can vote up a comment by recommending it. Comments are sorts by oldest first, or they can be filtered by their recommendation scores. .
The published comments included 85 of which were deemed NYT Picks, which garner a little badge and reflect the “most interesting and thoughtful” comments.
What makes this most impressive though is that each of those comments was read by a human moderator, a trained journalist, at NYT before being published. That it, the NYT practices pre-moderation, in comparison to many other publications which only look at comments after they’re published.
In fact they’re read by a team led by Bassey Etim, the community manager at NYT. Together with him team of 13 moderators, they read almost every comment before it’s posted to the site.
Part of that job is choosing the NYT Picks comments. “NYT Picks is the most popular comment queue. Spend a lot of time tweaking that and getting that right.”
As a baseline they’re looking for about 5 picks per 100 comments. Outside of blogs they do about 22 queues a day, but they’d like to open comments on more articles. So how could we help them scale up?
Talk about the potential benefits of selecting comments: signals norms and expectations for behavior, creating a beneficial feedback loop.
Positive criteria considered in the literature from studies of: Letters to the editor, online comments for print, on-air radio comments
Readability: style, clarity, adherence to standard grammar, degree it’s well-articulated.
Stress that operationalizing these is hard and there are many challenges for future work.
The focus of this work is initially on crowdsourcing ratings for 9 of these dimensions, so excluding relevance, fairness, and novelty since they are much more difficult to measure using crowdsourcing, and also I have a previous paper that looked at relevance explicitly.
The crowdsourcing approach collected human ratings of 8 of the 9 criteria here (b/c length is trival to measure by counting words). 500 comments 250 each of NYT Picks and non picks. Rated on a scale from 1 to 5, collected on Amazon Mechanical Turk. 3 independent ratings of each comment. Restricted to workers with reliable history, and substantial history, and from US or Canada. Collected 1500 ratings from 89 different workers.
We measured the Krippendorf’s alpha which is a measure of the interrater reliability and got slight to moderate agreement among the 3 raters except for entertainment value (so people couldn’t agree on what was funny).
Eventually would like to compute scores for all of these criteria automatically, but for now we do three of them.
Readability is the reading level according to the SMOG index, and index that measures the usage of more complex words. There was a high correlation between the SMOG index and the crowdsourced ratings of readability.
Personal experience is based on detecting the proportion of words from LIWC dictionaries that reflect 1st person personal pronouns as well as family and friends relationships. Comment tokens are stemmed to match the dictionary
So I found a statistically sig diff for all criteria except entertaining and emotionality, and emotionality was actually sig at p=0.08.
Several of these criteria also correlated fairly well, such as thoughtfulness and readability, and argument quality with thoughtfulness. Future work might look at scaling up the data collection and looking at dimensionality reduction techniques
All stat sig at p =0.05 or lower.
Editorial selections (NYT Picks) do reflect many of the editorial criteria articulated in the literature. Continuity of professional criteria into online space (except brevity)
Online spaces don’t have same space constraints and we found NYT editors preferred longer comments for Picks. Raises question of how well this serves users from their perspective.
The scores we computed, in particular the personal experience score could have some really nice applications for amplifying the value of comments for moderators, as well as reporters. In some follow-up work we’ve shown the this to comment moderators and they’re excited about the possibilities.
Automation could also enable new end-user experiences, where users adapt their own view of the comments based on automatically computed scores along journalistically interesting lines.
Over-generalization … diff communities, or topics (e.g. sports) require different treatment, so algorithmic solutions can’t be one-size-fits-all. Is it always better to high a highly readable comment, and when does that come into tension with diversity or fairness of perspectives.
Do Picks affect community or individual behavior?
Mention CommentIQ project at UMD, funded by the Knight Foundation
We’re going to be hiring a fellow or fellows, so if you’re interested in joining the lab, please come speak to me. We work on everything from data visualization to algorithmic accountability and transparency, as well as data mining things like online comments. If you want to combine data and computing, with design, in the context of journalism, please come talk.