Building Real Time, Open-Source Tools for Wikipedia
with Rob Kenedi
Presented at FITC Toronto 2015
More info at www.fitc.ca
OVERVIEW
Wikipedia is one of the most frequently visited websites in the world. The vast online encyclopedia, editable by anyone, has become the go-to source for general information on any subject. Building user-friendly apps that people can actually use on top of Wikipedia’s massive dataset involves overcoming a number of challenges, but it can also be a lot of fun. Join Rob Kenedi, Entrepreneur In Residence at The Working Group (TWG) as he shares lessons learned in TWG’s Lab building WikiWash, a free tool for journalists that helps them uncover spin and bias in Wikipedia.
OBJECTIVE
Learn how to build useful products using Wikipedia’s dataset.
TARGET AUDIENCE
Software developers, data journalists, product managers
ASSUMED AUDIENCE KNOWLEDGE
What Wikipedia is, and how web applications work
FIVE THINGS AUDIENCE MEMBERS WILL LEARN
How we built an open-source tool for journalists using Wikipedia
How to manage the massive amounts of data in Wikipedia
How to turn a non-technical pitch presentation into a working product that the client loves
How TWG labs treats its projects, products and prototypes and what happens to them once they launch
How WikiWash can be used to expose bias and spin on Wikipedia
7. Today, some of the most relevant stories can
only be told by poring over datasets and
crunching numbers in Excel. It’s imperative
reporters have tools to find the stories hidden in
the data.
-Luke Simcoe, Data Journalist, Metro News Canada
“
8. Currently, English Wikipedia includes
4,852,854 articles.
More than 800 new articles are added every single day. *
*Source: Wikipedia
9.
10.
11. There is no political power
without control of the archive, if
not of memory.
-Jacques Derrida (1998)
“
13. Unique Value
Propositions
Problem
Spin is introduced into Wiki
pages by biased edits
Can’t connect edits with users,
or uncover agendas / story
angles
Can’t get the data out of the
system
Hard to vizualize data to find
patterns
Can’t track changes to pages
(relating to branded entities)
Can’t find all brand references
on Wikipedia
Wikipedia perceived to be
susceptible to biased revisions.
Very hard to track revisions on
Wikipedia, either historically or
as they occur.
Associate page edits with
users, and download the data
Ability to compare multiple
pages to uncover patterns in
edits, and download the data
Ability to track activity and
alert to edit activity / trends
(that may indicate bias intent)
No. pages ‘un-washed’
Number of connections /
biased edits uncovered
Number of edits to Wikipedia
pages caused by uncovered
biases
Number of stories published
siting data from the site
Viral, word of mouth
Partnerships with print / online
media organizations - cross
promotion
Social media referrals
PPC, Display, Email, SEO
Clearly demonstrate
connections between
Wikipedia page edits and
the users making those
edits.
Ability to track and uncover spin
and malicious edits.
Track page edits in near-real-
time, and offer alerts that
uncover trends and emerging
stories.
Developed by and for working
reporters
Reporters
Activists
Academics & Students
Citizen Journalists
HighUse$
PR & Media
Brand Stakeholders
Wikipedia
Existing Wikipedia revision history page
Wikistats
Wikiwatchdog
Article Revision Stats, Wiki Blame, etc
IT Infrastructure
Continuous reporting / scraping (unless partner up with
Wikipedia)
Marketing & Promotion
Free for single use on historic data/edits
Subscription model for activity alerts and real-time tracking
(uncover breaking stories / bias)
Competitors / Comparables Cost Structure Revenue Streams
Solution Unfair Advantages
ChannelsKey Metrics
Customer Segments
14. Unique Value
Propositions
Problem
Spin is introduced into Wiki
pages by biased edits
Can’t connect edits with users,
or uncover agendas / story
angles
Can’t get the data out of the
system
Hard to vizualize data to find
patterns
Can’t track changes to pages
(relating to branded entities)
Can’t find all brand references
on Wikipedia
Wikipedia perceived to be
susceptible to biased revisions.
Very hard to track revisions on
Wikipedia, either historically or
as they occur.
Associate page edits with
users, and download the data
Ability to compare multiple
pages to uncover patterns in
edits, and download the data
Ability to track activity and
alert to edit activity / trends
(that may indicate bias intent)
No. pages ‘un-washed’
Number of connections /
biased edits uncovered
Number of edits to Wikipedia
pages caused by uncovered
biases
Number of stories published
siting data from the site
Viral, word of mouth
Partnerships with print / online
media organizations - cross
promotion
Social media referrals
PPC, Display, Email, SEO
Clearly demonstrate
connections between
Wikipedia page edits and
the users making those
edits.
Ability to track and uncover spin
and malicious edits.
Track page edits in near-real-
time, and offer alerts that
uncover trends and emerging
stories.
Developed by and for working
reporters
Reporters
Activists
Academics & Students
Citizen Journalists
HighUse$
PR & Media
Brand Stakeholders
Wikipedia
Existing Wikipedia revision history page
Wikistats
Wikiwatchdog
Article Revision Stats, Wiki Blame, etc
IT Infrastructure
Continuous reporting / scraping (unless partner up with
Wikipedia)
Marketing & Promotion
Free for single use on historic data/edits
Subscription model for activity alerts and real-time tracking
(uncover breaking stories / bias)
Competitors / Comparables Cost Structure Revenue Streams
Solution Unfair Advantages
ChannelsKey Metrics
Customer Segments
18. How does it do that?
• Realtime
• Open source
• Export your data
• Free!
• Works with Wikipedia’s API
• Built in Javascript
• Uses Node.js, Express.js, Angular.js, Socket.IO
to facilitate involvement from others
http://blog.twg.ca/2014/11/building-wikiwash/
23. Limited by API; caching data
Lessons Learned
Focus on realtime changes
Trending articles aid understanding
Focus on product first, aesthetics second
27. Limited by API; caching data
Lessons Learned
Focus on realtime changes
Trending articles aid understanding
Focus on product first, aesthetics second
28. Limited by API; caching data
Lessons Learned
Focus on realtime changes
Trending articles aid understanding
Focus on product first, aesthetics second
29.
30. Limited by API; caching data
Lessons Learned
Focus on realtime changes
Trending articles aid understanding
Focus on product first, aesthetics second
32. • Notifications via email
• Website embed capability
• Access to Wikipedia’s firehose
• UX improvements
• Language support
• Next / previous navigation
Feature Roadmap
WHAT’S NEXT
33. • Clear product ownership
• Product / market fit
• Pirate Metrics as a guide
• Qualitative & quantitative feedback
• Incrementally invest until inflection
point
How TWG Decides Next Steps
WHAT’S NEXT