SlideShare una empresa de Scribd logo
1 de 17
Normalizing Data for Migration
Kyle Banerjee
banerjek@ohsu.edu
Migrations are a fact of life
Acquisitions data
Item data ERM bibliographic
Patron data Statistics
Holdings Information
Content Management Systems
Link resolver
Circulation data
Archival management software
Institutional Repository
You can do a lot without programming skills
Absolutely!
✓ Carriage returns in data
✓ Retain preferred value
of multivalued fields
✓ Missing or invalid data
✓ Find problems following
complex patterns
Maybe..
? Conditional logic
? Changes based on
multifield logic
? Convert free text fields
to discrete values
Excel
● Mangles your data
○ Barcodes, identifiers, and numeric data
at risk
● Cannot fix carriage returns in data
● Crashes with large files
● OpenRefine is a better tool for
situations where you think you need
Excel http://openrefine.org
Keys to success
� Understand differences between the old
and new systems
� Manually examine thousands of records
� Learn regular expressions
� Ask for help!
Watch out for
✓ Creative use of fields
○ Inconsistencies and changing policies
○ Embedded code
○ Data that exploits buggy behavior
✓ Different data structures
○ Acq, licensing, electronic, items, etc
✓ Different types of data within fields
(e.g. codes vs. text)
CONTENTdm migration example
● XML metadata export contained errors on
every field that contained an HTML entity
(& < > " ' etc)
<dc:subject>Oregon Health &amp</dc:subject>
<dc:subject> Science University</dc:subject>
● Error occurs in many fields scattered across
thousands of records
● But this can be fixed in seconds!
Regular expressions to the rescue!
● “Whenever a field ends in an HTML entity
minus the semicolon and is followed by an
identical field, join those into a single field and
fix the entity. Any line can begin with an
unknown number of tabs or spaces”
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
Regular expressions can...
● Use logic, capitalization, edges of
words/lines, express ranges, use bits (or
all) of what you matched in replacements
● Convert free text into XML into delimited
text or codes and vice versa
● Find complex patterns using proximity
indicators and/or involving multiple lines
● Select preferred versions of fields
Confusing at first, but easier than you think!
● Works on all platforms and is built into a
lot of software
● Ask for help! Programmers can help you
with syntax
● Let’s walk through our example which
involves matching and joining unknown
fields across multiple lines...
Regular Expression Analysis
/^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
^ Beginning of line
s*< Zero or more whitespace characters followed by “<”
([^>]+>) One or more characters that are not “>” followed by “>” (i.e.
a tag). Store in 1
(.*) Any characters to next part of pattern. Store in 2
(&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3
</1n “</ followed by 1 (i.e. the closing tag) followed by a newline
s*<1 Any number of whitespace characters followed by tag 1
/<123;/ Replace everything up to this point with “<” followed by 1
(opening tag), 2 (field contents), 3, and “;” (fix HTML
entity). This effectively joins the fields
A simpler example
● Find a line that contains 1 to 5 fields in a
tab delimited file (because you expect 6)
^([^t]*t){0,4}[^t]*$
● To automatically join it with the next line with a
space
/^(([^t]*t){0,4}[^t]*)n/1 /
However, it would be much safer and easier to use
syntax that detects the first or last field
If you want a GUI, use OpenRefine
http://openrefine.org
● Sophisticated, including regular
expression support and ability to create
columns from external data sources
● Convert between different formats
● Up to a couple hundred thousand rows
Normalization is more conceptual than technical
● Every situation is unique and depends on the
data you have and the config of the new
system
● Don’t fob off data analysis on technical
people who don’t understand library data
● It’s not possible to fix everything because the
systems work differently (if they didn’t,
migrating would be pointless)
Questions?
Kyle Banerjee
banerjek@ohsu.edu

Más contenido relacionado

La actualidad más candente

Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
Franz Inc. - AllegroGraph
 
[Mas 500] Data Basics
[Mas 500] Data Basics[Mas 500] Data Basics
[Mas 500] Data Basics
rahulbot
 

La actualidad más candente (20)

20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
20120606 Lazy Programmers Write Self-Modifying Code /or/ Dealing with XML Ord...
 
Xml databases
Xml databasesXml databases
Xml databases
 
Understanding XML DOM
Understanding XML DOMUnderstanding XML DOM
Understanding XML DOM
 
Introduction to mongo db
Introduction to mongo dbIntroduction to mongo db
Introduction to mongo db
 
JSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge GraphsJSON-LD and SHACL for Knowledge Graphs
JSON-LD and SHACL for Knowledge Graphs
 
NoSQL
NoSQLNoSQL
NoSQL
 
XML Document Object Model (DOM)
XML Document Object Model (DOM)XML Document Object Model (DOM)
XML Document Object Model (DOM)
 
Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.Comparison with storing data using NoSQL(CouchDB) and a relational database.
Comparison with storing data using NoSQL(CouchDB) and a relational database.
 
Why is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz IncWhy is JSON-LD Important to Businesses - Franz Inc
Why is JSON-LD Important to Businesses - Franz Inc
 
Difference between xml and json
Difference between xml and jsonDifference between xml and json
Difference between xml and json
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Dirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz ProjectDirk Goldhahn: Introduction to the German Wortschatz Project
Dirk Goldhahn: Introduction to the German Wortschatz Project
 
Indexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .netIndexing, searching, and aggregation with redi search and .net
Indexing, searching, and aggregation with redi search and .net
 
[Mas 500] Data Basics
[Mas 500] Data Basics[Mas 500] Data Basics
[Mas 500] Data Basics
 
Harnessing The Semantic Web
Harnessing The Semantic WebHarnessing The Semantic Web
Harnessing The Semantic Web
 
How to choose a database
How to choose a databaseHow to choose a database
How to choose a database
 
Using Webservice in iOS
Using Webservice  in iOS Using Webservice  in iOS
Using Webservice in iOS
 
Grails And The Semantic Web
Grails And The Semantic WebGrails And The Semantic Web
Grails And The Semantic Web
 
Xml
XmlXml
Xml
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 

Similar a Normalizing Data for Migrations

DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best Praticices
Andrzej Zydroń MBCS
 

Similar a Normalizing Data for Migrations (20)

DITA and Translation Best Praticices
DITA and Translation Best PraticicesDITA and Translation Best Praticices
DITA and Translation Best Praticices
 
Xml
XmlXml
Xml
 
XMLParser functionality demonstration...
XMLParser functionality demonstration...XMLParser functionality demonstration...
XMLParser functionality demonstration...
 
XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7XML, XML Databases and MPEG-7
XML, XML Databases and MPEG-7
 
Xml unit1
Xml unit1Xml unit1
Xml unit1
 
OAXAL
OAXALOAXAL
OAXAL
 
Xml
XmlXml
Xml
 
Intro xml
Intro xmlIntro xml
Intro xml
 
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptpptXML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
XML(EXtensible Markup Language). XML(EXtensible Markup Language).pptppt
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Semantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sourcesSemantic RDF based integration framework for heterogeneous XML data sources
Semantic RDF based integration framework for heterogeneous XML data sources
 
Xml Overview
Xml OverviewXml Overview
Xml Overview
 
Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)Markup For Dummies (Russ Ward)
Markup For Dummies (Russ Ward)
 
Xml iet 2015
Xml iet 2015Xml iet 2015
Xml iet 2015
 
Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4Inroduction to XSLT with PHP4
Inroduction to XSLT with PHP4
 
A Hitchhiker S Guide To LaTex (Or How I Learned To Stop Worrying And Love Wri...
A Hitchhiker S Guide To LaTex (Or How I Learned To Stop Worrying And Love Wri...A Hitchhiker S Guide To LaTex (Or How I Learned To Stop Worrying And Love Wri...
A Hitchhiker S Guide To LaTex (Or How I Learned To Stop Worrying And Love Wri...
 
Introduction to XML.ppt
Introduction to XML.pptIntroduction to XML.ppt
Introduction to XML.ppt
 
Introduction to XML.ppt
Introduction to XML.pptIntroduction to XML.ppt
Introduction to XML.ppt
 
Dos and donts
Dos and dontsDos and donts
Dos and donts
 
XML1.pptx
XML1.pptxXML1.pptx
XML1.pptx
 

Más de Kyle Banerjee

Más de Kyle Banerjee (6)

Demystifying RDF
Demystifying RDFDemystifying RDF
Demystifying RDF
 
Keep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital PreservationKeep it Safe, Stupid, or an Intro to Digital Preservation
Keep it Safe, Stupid, or an Intro to Digital Preservation
 
Future Directions in Metadata
Future Directions in MetadataFuture Directions in Metadata
Future Directions in Metadata
 
Переход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе АльмаПереход от отдельных библиотечных систем к объединенной системе Альма
Переход от отдельных библиотечных систем к объединенной системе Альма
 
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL DatabasesDropping ACID: Wrapping Your Mind Around NoSQL Databases
Dropping ACID: Wrapping Your Mind Around NoSQL Databases
 
Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...Batch metadata assignment to archival photograph collections using facial rec...
Batch metadata assignment to archival photograph collections using facial rec...
 

Último

Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
KarakKing
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Último (20)

This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
How to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POSHow to Manage Global Discount in Odoo 17 POS
How to Manage Global Discount in Odoo 17 POS
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 

Normalizing Data for Migrations

  • 1. Normalizing Data for Migration Kyle Banerjee banerjek@ohsu.edu
  • 2. Migrations are a fact of life Acquisitions data Item data ERM bibliographic Patron data Statistics Holdings Information Content Management Systems Link resolver Circulation data Archival management software Institutional Repository
  • 3. You can do a lot without programming skills Absolutely! ✓ Carriage returns in data ✓ Retain preferred value of multivalued fields ✓ Missing or invalid data ✓ Find problems following complex patterns Maybe.. ? Conditional logic ? Changes based on multifield logic ? Convert free text fields to discrete values
  • 4.
  • 5. Excel ● Mangles your data ○ Barcodes, identifiers, and numeric data at risk ● Cannot fix carriage returns in data ● Crashes with large files ● OpenRefine is a better tool for situations where you think you need Excel http://openrefine.org
  • 6. Keys to success � Understand differences between the old and new systems � Manually examine thousands of records � Learn regular expressions � Ask for help!
  • 7. Watch out for ✓ Creative use of fields ○ Inconsistencies and changing policies ○ Embedded code ○ Data that exploits buggy behavior ✓ Different data structures ○ Acq, licensing, electronic, items, etc ✓ Different types of data within fields (e.g. codes vs. text)
  • 8. CONTENTdm migration example ● XML metadata export contained errors on every field that contained an HTML entity (&amp; &lt; &gt; &quot; &apos; etc) <dc:subject>Oregon Health &amp</dc:subject> <dc:subject> Science University</dc:subject> ● Error occurs in many fields scattered across thousands of records ● But this can be fixed in seconds!
  • 9. Regular expressions to the rescue! ● “Whenever a field ends in an HTML entity minus the semicolon and is followed by an identical field, join those into a single field and fix the entity. Any line can begin with an unknown number of tabs or spaces” /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/
  • 10. Regular expressions can... ● Use logic, capitalization, edges of words/lines, express ranges, use bits (or all) of what you matched in replacements ● Convert free text into XML into delimited text or codes and vice versa ● Find complex patterns using proximity indicators and/or involving multiple lines ● Select preferred versions of fields
  • 11. Confusing at first, but easier than you think! ● Works on all platforms and is built into a lot of software ● Ask for help! Programmers can help you with syntax ● Let’s walk through our example which involves matching and joining unknown fields across multiple lines...
  • 12. Regular Expression Analysis /^s*<([^>]+>)(.*)(&[a-z]+)</1ns*<1/<123;/ ^ Beginning of line s*< Zero or more whitespace characters followed by “<” ([^>]+>) One or more characters that are not “>” followed by “>” (i.e. a tag). Store in 1 (.*) Any characters to next part of pattern. Store in 2 (&[a-z]+) Ampersand followed by letters (HTML entities). Store in 3 </1n “</ followed by 1 (i.e. the closing tag) followed by a newline s*<1 Any number of whitespace characters followed by tag 1 /<123;/ Replace everything up to this point with “<” followed by 1 (opening tag), 2 (field contents), 3, and “;” (fix HTML entity). This effectively joins the fields
  • 13. A simpler example ● Find a line that contains 1 to 5 fields in a tab delimited file (because you expect 6) ^([^t]*t){0,4}[^t]*$ ● To automatically join it with the next line with a space /^(([^t]*t){0,4}[^t]*)n/1 / However, it would be much safer and easier to use syntax that detects the first or last field
  • 14. If you want a GUI, use OpenRefine http://openrefine.org ● Sophisticated, including regular expression support and ability to create columns from external data sources ● Convert between different formats ● Up to a couple hundred thousand rows
  • 15.
  • 16. Normalization is more conceptual than technical ● Every situation is unique and depends on the data you have and the config of the new system ● Don’t fob off data analysis on technical people who don’t understand library data ● It’s not possible to fix everything because the systems work differently (if they didn’t, migrating would be pointless)