Central Pennsylvania Open Source Conference, October 17, 2015
Data is a hot topic in the tech sector with big data, data processing, data science, linked open data and data visualization to name only a few examples. Before data can be processed or analyzed it often has to be cleaned. OpenRefine is an open source interactive data transformation tool for working with messy data. This presentation will begin with a short overview of the features of OpenRefine. To demonstrate basic concepts of data cleaning, manipulating, faceting and filtering with OpenRefine, Pennsylvania Heritage magazine subject index data will be used as a case study.
1. INTRODUCTION TO
OPENREFINE
CENTRAL PA OPEN SOURCE CONFERENCE
OCTOBER 17, 2015
Heather Myers / @privatestorm
https://www.linkedin.com/in/heathercmyers
2. ABOUT ME
Web administrator in the government and cultural heritage
sectors.
Currently working at the Pennsylvania Historical and
Museum Commission.
3. OPENREFINE
"a powerful tool for working with messy data: cleaning it;
transforming it from one format into another; and extending it
with web services and external data."
4. GETTING STARTED
Choose dataset
Decide what you want to accomplish with data
Install OpenRefine
Run OpenRefine
http://openrefine.org/download.html
http://127.0.0.1:3333
5. ABOUT THE DATASET
Pennsylvania Heritage magazine subject index.
Index of 12,000+ magazine terms for issues dated 1975–
2002.
http://bit.ly/1Udha8D
6. DATA TO DO LIST
Create lists
of terms for
specific
issues
Extract list
of terms