1. Turning Data Into Narrative:
Strategies for finding and sharing
stories embedded
within sets of data.
Daniel X. O’Neil
@juggernautco
2. Strategies for finding…
• Search is your friend
• Advanced search is your best friend
• Don’t default to FOIA
• Don’t deal with Public Information Officers
• The hidden web still exists
• Data is often more structured than you think
• It takes an abundance of data types to tell a
story
@juggernautco
3. FOIA is not your friend.
• The Internet is your friend.
• Example: Dallas crime reports
• Here’s there statement about getting data from them
on their public Web site:
– Open Records requests must be made in writing. They may
be:
– 1.Hand-carried to the Records Section, Dallas Police
Headquarters, 1400 S. Lamar Street, Dallas, TX
– 2.Faxed to 214-671-4636
– 3.E-mailed to openrecordunit@dpd.ci.dallas.tx.us
– 4.Mailed by US Postage to - Dallas Police Open
Records, 1400 S. Lamar Street, Dallas, TX. 75215
@juggernautco
8. …and sharing stories…
• Knowing more than anyone else is still the
only way to do this
• Surfacing from the hidden Web is doing
everybody a favor
• Information is not knowledge. Publishing data
without context is not super-useful
• Most data is boring. Why? Because data is
made by people, and most people are boring
most of the time
@juggernautco
10. Ten Databases
• Building permits
• Business licenses
• Historic preservation list
• Sanborn maps (1929 and 1950)
• County assessor
• County recorder of deeds
• Original photography
• Google search for news coverage
• New York Times archive
• Walgreens surplus property
@juggernautco
12. …embedded
within sets of data
• It’s got to be the other way around
• We’ve got to embed our data into our stories
rather than find stories embedded in our data
• I don’t want to search for anything
• I’d rather know everything
• Every object should have a page on the
Internet (so let’s get to work)
@juggernautco
13. We need a machine.
• A generic context engine
• To evenly distribute information
• And tell me what the information
means
• I know: that sounds like a
“reporter”
• But people used to think that
“search engine” sounded a lot like
“librarian”, too
• We need humans and machines
@juggernautco
14. It’s easy.
• Find dataset
• Review dataset
• Describe what the data means
• Find another dataset
• Describe what the other dataset
means
• Describe what the first dataset means
in the context of the second dataset
• Repeat
• Let’s do this thing.
@juggernautco
Editor's Notes
I’m Dan O’Neil, and I run the Smart Chicago Collaborative, an organization devoted to improving lives in Chicago through technology. Among other things, I work with Chicago city government, developers, and community groups to use civic data in new and useful ways. As a co-founder of EveryBlock, I’m also a previous Knight News Challenge granteeI certainly wouldn’t be doing any of this today if it weren’t for the vision of the Knight Foundation.
The main charge to the panelists is to talk about “Strategies for finding and sharing stories embedded within sets of data.” Let’s take that piece by piece. I’ve been responsible for data acquisition for quite some time, and I’ve found a goodly amount of data in my day. These are the main upshots I’ve got to share that are not already widely propagated.
One way that I think I differ from the may reporter/ journalism mode of finding data is that I prefer Searching to Asking. Search is your friend, and advanced search is your best friend.I think that the instinct is to make freedom of information act requests and go through traditional routes like calling Public Information Officers.That can waste a lot of time.Here’s an example in Dallas– if you use their default process, you’re in for a pretty traditional experience.Requests in writing, wait for an answer.
And if you use the default search for crime records, you get this screen.It has records going back to 2005.You fill out the form and you get your answers back.Pretty typical experience.
What you wouldn’t be able to tell, unless you searched the Dallas Police Web site more deeply, is this.The Dallas Police publishes an amazing cache of crime data in flat files.All of it, with no search, no letters or emails, going back 12 years.Why anyone would make any FOIA request– or why the Dallas Police would want anyone to do that– is beyond me.And this data has some of the most amazing crime details– the police narrative– that you can find in crime data anywhere.This is hidden in plain sight.
Data is often more structured than you think.Over the weekend I participated in the Knight-Mozilla-MIT "Story & Algorithm" Hack Day run by Dan Sinker.I met a couple of Boston developers and we executed on a project I’ve had for about 7 years.Like many of you here, I’m not smart enough to actually make things, so I have to rely on the kindness of developers.What we made was “Condition of Anonymity”– a Web site that automatically pulls the reason that anonymity was granted to an anonymous source by a reporter for the New York Times.We often think about data as the stuff inside spreadsheets and published in flat files to FTP servers, but there is a whole world of semi-structured data like this hidden in plain sight, inside plain text.We used the NYT Search API to review every article in the NYT back to January 1, 2000 for the phrase, “condition of anonymity”, then used a natural language processing toolkit to find what I call the “because clauses”.There’s some gold in there.It takes an abundance of data types to tell a story.This story feels like a Walt Whitman poem to me.
Lastly, I highly recommend the Data Journalism Handbook, which was created, in part, by many people in this room.It’s a really excellent resource.
I am not a journalist.But in my own time, I have published a pretty extensive set of stories based on data, and I have some insights maybe.The first one is that there aren’t any shortcuts.You still have to know more than anybody else about a subject in order to tell good stories.I’ve got an example to share.Next is kind of a gimme, which is that you shouldn’t mix up Information andknowledgeThe analysis is where it’s at.The most amazing insight I can share is that data is boring.I’ve had a long time to consider why that is true, and I think I have the answer.The reason is because people are boring.We forget that data is made by people.And most people are boring most of the timeEvery object should have a page on the Internet (so let’s get to work)
Here’s kind of a master example.I live near this building.It was been empty for a very long time.Then construction started.The construction was heralded by a building permit.But, of course, the building permit was boring.So I looked further.
I searched ten different databases and lo and beyhold, more data made it less boring.Why? Because almost all people are interesting some of the time.So if you look hard enough, you’ll find those stories.I found a business license for a 3-day pop-up store.So this place has been empty for decades, but was open for three days.And I missed it.It used to be a bank, and in 1937 I found out that– from the NYT archive, in PDF format– the hidden Web– that there was a bank run at this location in 1937.Again, not boring.
Here’s an example of two things:Finding data in unstructured text and finding interesting data.This is an Advanced Search in Google for the word “jimmied” in the Dallas crime data published by EveryBlock.So that site becomes a public, searchable instance of a previously hidden data set.Apparently police have used the word “jimmied” to describe an action taken by suspected criminals 2,430 times.All sorts of things are jimmied, apparently.It’s not boring.
Lastly, I want to encourage a different way to think about data.We’ve got to embed our data into our stories rather than find stories embedded in our dataI don’t want to search for anything.I’d rather know everything.Every object should have a page on the Internet– just like 1601 N. Milwaukee
This machine can be described as a generic context engineTo evenly distribute informationAnd tell me what the information meansI know: that sounds like a “reporter”But people used to think that “search engine” sounded a lot like “librarian”, tooWe need humans and machines
Find datasetReview datasetDescribe what the data meansFind another datasetDescribe what the other dataset meansDescribe what the first dataset means in the context of the second datasetRepeatLet’s do this thing.