SlideShare una empresa de Scribd logo
1 de 53
I Can Convert!
by Sven Aas and Jason Proctor
I Can Convert!
•   Sven Aas: @svenaas / saas@mtholyoke.edu

•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu

•   #TPR2




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
We’re going to talk about
•   Stories

•   Patterns

•   Tools




                        ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools!



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools
•   Spreadsheet

•   Programmer’s Editor

•   Programming Language




                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Spreadsheet




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Spreadsheet




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor




                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programmer’s Editor




                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language




               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Programming Language




                                    ©2012 Sven Aas and Jason Proctor,
               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Use Your Tools!
  You’ve GOT this stuff.




                           ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Getting Deported



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Portal News




              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Unusual Data Representation
 +""""""""""""""+    |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$
 |$Data$$$$$$$$$|
 +""""""""""""""+    21139$|$71$1000009$1000010$1000011$1000012$1000013$
 |$node$$$$$$$$$|$   1000014$1000015$1000016$1000017$1000018$1000019$
 |$name$$$$$$$$$|$
 |$type$$$$$$$$$|$   1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$
 |$mode$$$$$$$$$|$   |$1170344089$|$21139$$|$$$$$$$1$|
 |$owner$$$$$$$$|$
 |$group$$$$$$$$|$
                     01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$
 |$url$$$$$$$$$$|$   of$new$student$orientation,$members$of$the$class$of$
 |$desc$$$$$$$$$|$   2010$worked$on$community$service$projects$across$the$
 |$parent$$$$$$$|$
 |$linkto$$$$$$$|$   Pioneer$Valley$on$September$16.$View$the$photo$
 |$ctime$$$$$$$$|$   gallery.||http://www.mtholyoke.edu/offices/comm/news/
 |$mtime$$$$$$$$|$
 |$mod_by$$$$$$$|$   sec_sat_06/page1.html|1158638400|1170305999|||||
 |$visible$$$$$$|$   11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$
 |$userdata$$$$$|$
 |$datasize$$$$$|$
                     Saturday:^:^:^:^0:^$
 |$datafilename$|$   |$$$$$2813$|$V1158673129"9689$|
 +""""""""""""""+


                                                       ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Ruby to the Rescue
          LegacyUser                  User

                                      Item
 Portal                                                              News
                       Importer
System                                                              System
          LegacyItem              Story      Link

                                    Channel




                                          ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
ActiveRecord
•   A Ruby library which implements the ActiveRecord software
    architecture pattern.

•   The original Model and ORM component of Ruby on Rails.

•   We used it to provide a convenient object layer on top of two
    underlying relational databases.




                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Conversion Patterns



                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Extraction
Context: Ingesting source data.

Problem: Source data objects contain multiple target objects.

Solution: Process or parse target data just enough to extract
objects.

Tools: String methods, RegEx, DOM/XML selection.



                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Encoding Change
Context: Mapping source data to target.

Problem: Source text encoding differs from target.

Solution: Perform intermediate translation.

Tools: String methods, RegEx, programming libraries.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
URL/Path Translation
Context: Preparing target environment and data.

Problem: Assets in target system will be available at different
paths or URLs from their locations in source system.

Solution: Map source locations to target locations. Replace
references in data before saving to target.

Tools: String methods, RegEx, DOM/XML selection.


                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Getting the News Out



                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Easy Come, Easy Go
1. Export Athletics news items to hosted service.

2. Export all news items to digital archives.




                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Exporting Athletics Items
•   10 years of Athletics news in 14 channels.

•   Export each item in a minimal, predictable HTML wrapper.

•   Include metadata for each item in <meta> tags in the <head>.

•   Group items by sport and by academic year.

•   Generally accommodate the target system.


                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
HAML
•   A lightweight markup language used to generate HTML.

•   A meta-markup language.

•   We used it to succinctly express the HTML we wanted from
    within our Ruby code.




                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Archiving Web News
•   14 years of news: 6,000 items, 5,000 images, 34 channels.

•   Export each news item in an archival form preserving the
    original markup and character entities (but not the design)

    •   PDF generated from HTML generated from HAML

•   Export Dublin Core metadata for each news item:

    •   XML generated via Builder

                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Builder
•   A Ruby library for generating XML.

•   We used it to dynamically generate simple XML from within a
    Ruby application.




                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
wkhtmltopdf
•   A shell utility for generating PDF files by rendering HTML
    documents using the WebKit rendering engine.

•   A Ruby library providing programmatic access to the
    wkhtmltopdf shell utility.

•   We used it so that we could use familiar web development
    techniques to generate PDFs without having to implement our
    own rendering and layout routines.

                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Object Extraction

•   Encoding Change

•   URL/Path Translation




                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Direct Translation
Context: Simple conversion.

Problem: Data conversion.

Solution: Read source objects and write targets in single pass.

Tools: Varies.




                                               ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Markup Change
Context: Mapping source data to target.

Problem: Source text markup differs from target.

Solution: Perform intermediate translation.

Tools: String methods, RegEx, DOM/XML selection,
programming libraries.



                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Data Cleanup
Context: Ingesting source data.

Problem: Source data is ... imperfect.

Solution: Fix what you can confidently fix.

Tools: Varies.




                                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Convert All the Things!



                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Finally Done with News?
•   HTML files scraped via Nokogiri scripts.

•   Quite a bit of cleanup: garbage in, garbage out.

•   Unscrapable news items.

•   “September 12, 2001”.




                                                   ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Nokogiri
•   A Ruby library for parsing XML and HTML.

•   Supports DOM or SAX parsing.

•   Implements both XPath and CSS3 selectors.

•   We used it to parse and extract content from the set of HTML
    files containing existing news stories.



                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Direct Translation

•   Encoding Change

•   Markup Change

•   URL/Path Translation

•   Data Cleanup


                             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
The Big One



              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
CMS Conversion
•   Old CMS pages all published with several different
    presentational styles, but all with the same DOM. That means
    we can scrape ’em!

•   We agreed not to change anything else during the import. That
    means we can treat it as a clean switchover.




                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Three-Pronged Conversion




                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Three-Pronged Conversion
•   Build the necessary structures and themes to accommodate
    and represent our old content.

•   Build a library of code for scraping the pages generated by the
    old site, cataloging data and metadata, and storing them in an
    intermediate representation.

•   Build a library of code for importing this intermediate
    representation into the new CMS structures.

                                                   ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Migrate
•   An Drupal module providing a framework for data import into
    the Drupal content management system.

•   Supports a variety of sources and targets out of the box.

•   Extensible to support additional migration sources and targets.

•   We used it to import the XML representation of our site into
    our Drupal system.


                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Familiar Patterns
•   Object Extraction

•   Encoding Change

•   Markup Change

•   URL/Path Translation

•   Data Cleanup


                            ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Intermediate Representation
Context: Complex conversion.

Problem: Data conversion.

Solution: Convert source data to intermediate representation in
one pass. Then convert intermediate representation to target.

Tools: Representation: Database, XML, CSV. Conversion: Varies.



                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Identity
Context: Ingesting source data.

Problem: Data objects are repeated in source data

Solution: Uniquely identify source objects.

Tools: String methods, RegEx, DOM/XML selection.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Object Aggregation
Context: Ingesting source data.

Problem: Target data objects contain multiple source objects.

Solution: Aggregate objects at intermediate or output stage.

Tools: Varies.




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Lessons
•   You already have a good toolbox. Keep your tools sharp.

•   Understand your source and target models.

•   Watch for familiar patterns.

•   Conversion is an opportunity for cleanup and improvement.

•   Human labor can sometimes be cheaper than automation.


                                                 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
YOU Can Convert



             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Questions?



             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Thank you, & keep in touch!
•   Sven Aas: @svenaas / saas@mtholyoke.edu

•   Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu

•   #TPR2




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon
•   This presentation is set in Exo Extra Bold from Natanael
    Gama’s ndiscovered, with headings in ChunkFive from The
    League of Movable Type.

•   Background images were adapted from
    FreeSeamlessTextures.com’s Red Watercolor and The Grid, by
    Willem Pirquin.



                                                ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•   Card-size survival tool photo via acreativeedge.info

•   Leatherman photo via SonnyandSandy

•   Studley Tool Chest photo via FineWoodworking.com




                                                  ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•       Audio from Wikipedia:Sound/List:
    •    Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by
         the Skidmore College Orchestra.

    •    W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik
         Eide.

    •    Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman

    •    J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel

    •    Mississippi John Hurt - “Nobody’s Dirty Business”



                                                                             ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
Colophon (continued)
•       Other Audio

    •    Jack Beaver - “Workaday World”

    •    Danny Elfman - “Breakfast Machine”




                                              ©2012 Sven Aas and Jason Proctor, Mount Holyoke College

Más contenido relacionado

Destacado

Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...FinPart
 
Benefits usa senior deck
Benefits usa senior deckBenefits usa senior deck
Benefits usa senior deckleeg69
 
Sammousa - The story in pictures
Sammousa - The story in picturesSammousa - The story in pictures
Sammousa - The story in picturessubravedula
 
The Power of Attendance
The Power of AttendanceThe Power of Attendance
The Power of AttendanceBIE Resources
 
กลุ่มอาการดาวน์
กลุ่มอาการดาวน์กลุ่มอาการดาวน์
กลุ่มอาการดาวน์Atirak Pakdepin
 
HHS Ignite: Year One Results
HHS Ignite: Year One Results HHS Ignite: Year One Results
HHS Ignite: Year One Results Steven Randazzo
 
Criolla music day
Criolla music dayCriolla music day
Criolla music dayalvarorv14
 
Assumptions in problem framing
Assumptions in problem framingAssumptions in problem framing
Assumptions in problem framingBhanu Pratap Singh
 
Installprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_beInstallprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_bejl_merino
 
Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48agus ZM
 
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec200361557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003Mohammad Khamiseh
 
Portafolio electronico
Portafolio electronicoPortafolio electronico
Portafolio electronicopaco-andrea
 

Destacado (15)

Report polsci
Report polsciReport polsci
Report polsci
 
Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...Respective scopes of european and national laws concerning crowdfunding opera...
Respective scopes of european and national laws concerning crowdfunding opera...
 
Benefits usa senior deck
Benefits usa senior deckBenefits usa senior deck
Benefits usa senior deck
 
Sammousa - The story in pictures
Sammousa - The story in picturesSammousa - The story in pictures
Sammousa - The story in pictures
 
The Power of Attendance
The Power of AttendanceThe Power of Attendance
The Power of Attendance
 
กลุ่มอาการดาวน์
กลุ่มอาการดาวน์กลุ่มอาการดาวน์
กลุ่มอาการดาวน์
 
HHS Ignite: Year One Results
HHS Ignite: Year One Results HHS Ignite: Year One Results
HHS Ignite: Year One Results
 
Criolla music day
Criolla music dayCriolla music day
Criolla music day
 
測試用簡報
測試用簡報測試用簡報
測試用簡報
 
Assumptions in problem framing
Assumptions in problem framingAssumptions in problem framing
Assumptions in problem framing
 
Installprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_beInstallprocedure bp publ_sector_en_be
Installprocedure bp publ_sector_en_be
 
Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48Jadwal pelajaran dan daftar piket kelas 48
Jadwal pelajaran dan daftar piket kelas 48
 
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec200361557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
61557874 volume-i-ericsson-umts-rf-optimization-12 dec2003
 
Portafolio electronico
Portafolio electronicoPortafolio electronico
Portafolio electronico
 
Hizb 37
Hizb 37Hizb 37
Hizb 37
 

Similar a I Can Convert

Archiving Web News (captioned)
Archiving Web News (captioned)Archiving Web News (captioned)
Archiving Web News (captioned)SvenAas
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachDATAVERSITY
 
NoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured PostgresNoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured PostgresEDB
 
Meandre Architecture
Meandre ArchitectureMeandre Architecture
Meandre ArchitectureLoretta Auvil
 
Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Loretta Auvil
 
SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009Loretta Auvil
 
Embedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing DocumentsEmbedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing DocumentsJim Downing
 
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...Terminalfour
 
Data Persistence as a Language Feature
Data Persistence as a Language FeatureData Persistence as a Language Feature
Data Persistence as a Language FeatureRob Tweed
 
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Lee Klement
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayMongoDB
 
MongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called OxMongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called OxMongoDB
 
DDS tutorial with connector
DDS tutorial with connectorDDS tutorial with connector
DDS tutorial with connectorJavier Povedano
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 

Similar a I Can Convert (20)

Archiving Web News (captioned)
Archiving Web News (captioned)Archiving Web News (captioned)
Archiving Web News (captioned)
 
SEASR eScience 2008
SEASR eScience 2008SEASR eScience 2008
SEASR eScience 2008
 
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical ApproachSlides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
Slides: NoSQL Data Modeling Using JSON Documents – A Practical Approach
 
NoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured PostgresNoSQL on ACID - Meet Unstructured Postgres
NoSQL on ACID - Meet Unstructured Postgres
 
Meandre Architecture
Meandre ArchitectureMeandre Architecture
Meandre Architecture
 
Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009Meandre Architecture Ws Apr 2009
Meandre Architecture Ws Apr 2009
 
SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009SEASR-Meandre Architecture Ws Jan 2009
SEASR-Meandre Architecture Ws Jan 2009
 
Embedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing DocumentsEmbedding Metadata In Word Processing Documents
Embedding Metadata In Word Processing Documents
 
MichaelLutherResume60
MichaelLutherResume60MichaelLutherResume60
MichaelLutherResume60
 
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
University of Liverpool: TERMINALFOUR & App Development- Making the Most of y...
 
Data Persistence as a Language Feature
Data Persistence as a Language FeatureData Persistence as a Language Feature
Data Persistence as a Language Feature
 
Json
JsonJson
Json
 
Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012Advanced Site Studio Class, June 18, 2012
Advanced Site Studio Class, June 18, 2012
 
Accelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO WayAccelerating Delivery of Data Products - The EBSCO Way
Accelerating Delivery of Data Products - The EBSCO Way
 
Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]Semantic Web For Energy [Malcolm Murray]
Semantic Web For Energy [Malcolm Murray]
 
394 wade word2007-ssp2008
394 wade word2007-ssp2008394 wade word2007-ssp2008
394 wade word2007-ssp2008
 
MongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called OxMongoDB using PHP: Using a New Framework Called Ox
MongoDB using PHP: Using a New Framework Called Ox
 
DDS tutorial with connector
DDS tutorial with connectorDDS tutorial with connector
DDS tutorial with connector
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Compiler project
Compiler  projectCompiler  project
Compiler project
 

Último

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

I Can Convert

  • 1. I Can Convert! by Sven Aas and Jason Proctor
  • 2. I Can Convert! • Sven Aas: @svenaas / saas@mtholyoke.edu • Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu • #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 3. We’re going to talk about • Stories • Patterns • Tools ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 4. Use Your Tools! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 5. Use Your Tools • Spreadsheet • Programmer’s Editor • Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 6. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 7. Spreadsheet ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 8. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 9. Programmer’s Editor ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 10. Programming Language ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 11. Programming Language ©2012 Sven Aas and Jason Proctor, ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 12. Use Your Tools! You’ve GOT this stuff. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 13. Getting Deported ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 14. Portal News ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 15. Unusual Data Representation +""""""""""""""+ |$4692909$|$G1158673129"8322$|$$$16$|$rwlrwlr"l$|$ |$Data$$$$$$$$$| +""""""""""""""+ 21139$|$71$1000009$1000010$1000011$1000012$1000013$ |$node$$$$$$$$$|$ 1000014$1000015$1000016$1000017$1000018$1000019$ |$name$$$$$$$$$|$ |$type$$$$$$$$$|$ 1000020$|$$$$$$|$$$$$$|$2100709$|$$$NULL$|$1158673129$ |$mode$$$$$$$$$|$ |$1170344089$|$21139$$|$$$$$$$1$| |$owner$$$$$$$$|$ |$group$$$$$$$$|$ 01|Second*Saturday:$MHC$Students$Hit$the$Road|As$part$ |$url$$$$$$$$$$|$ of$new$student$orientation,$members$of$the$class$of$ |$desc$$$$$$$$$|$ 2010$worked$on$community$service$projects$across$the$ |$parent$$$$$$$|$ |$linkto$$$$$$$|$ Pioneer$Valley$on$September$16.$View$the$photo$ |$ctime$$$$$$$$|$ gallery.||http://www.mtholyoke.edu/offices/comm/news/ |$mtime$$$$$$$$|$ |$mod_by$$$$$$$|$ sec_sat_06/page1.html|1158638400|1170305999||||| |$visible$$$$$$|$ 11.41|:^:^:^:^:^JPG:^75:^75:^2813:^Second$ |$userdata$$$$$|$ |$datasize$$$$$|$ Saturday:^:^:^:^0:^$ |$datafilename$|$ |$$$$$2813$|$V1158673129"9689$| +""""""""""""""+ ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 16. Ruby to the Rescue LegacyUser User Item Portal News Importer System System LegacyItem Story Link Channel ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 17. ActiveRecord • A Ruby library which implements the ActiveRecord software architecture pattern. • The original Model and ORM component of Ruby on Rails. • We used it to provide a convenient object layer on top of two underlying relational databases. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 18. Conversion Patterns ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 19. Object Extraction Context: Ingesting source data. Problem: Source data objects contain multiple target objects. Solution: Process or parse target data just enough to extract objects. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 20. Encoding Change Context: Mapping source data to target. Problem: Source text encoding differs from target. Solution: Perform intermediate translation. Tools: String methods, RegEx, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 21. URL/Path Translation Context: Preparing target environment and data. Problem: Assets in target system will be available at different paths or URLs from their locations in source system. Solution: Map source locations to target locations. Replace references in data before saving to target. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 22. Getting the News Out ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 23. Easy Come, Easy Go 1. Export Athletics news items to hosted service. 2. Export all news items to digital archives. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 24. Exporting Athletics Items • 10 years of Athletics news in 14 channels. • Export each item in a minimal, predictable HTML wrapper. • Include metadata for each item in <meta> tags in the <head>. • Group items by sport and by academic year. • Generally accommodate the target system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 25. HAML • A lightweight markup language used to generate HTML. • A meta-markup language. • We used it to succinctly express the HTML we wanted from within our Ruby code. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 26. Archiving Web News • 14 years of news: 6,000 items, 5,000 images, 34 channels. • Export each news item in an archival form preserving the original markup and character entities (but not the design) • PDF generated from HTML generated from HAML • Export Dublin Core metadata for each news item: • XML generated via Builder ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 27. Builder • A Ruby library for generating XML. • We used it to dynamically generate simple XML from within a Ruby application. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 28. wkhtmltopdf • A shell utility for generating PDF files by rendering HTML documents using the WebKit rendering engine. • A Ruby library providing programmatic access to the wkhtmltopdf shell utility. • We used it so that we could use familiar web development techniques to generate PDFs without having to implement our own rendering and layout routines. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 29. Familiar Patterns • Object Extraction • Encoding Change • URL/Path Translation ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 30. Direct Translation Context: Simple conversion. Problem: Data conversion. Solution: Read source objects and write targets in single pass. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 31. Markup Change Context: Mapping source data to target. Problem: Source text markup differs from target. Solution: Perform intermediate translation. Tools: String methods, RegEx, DOM/XML selection, programming libraries. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 32. Data Cleanup Context: Ingesting source data. Problem: Source data is ... imperfect. Solution: Fix what you can confidently fix. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 33. Convert All the Things! ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 34. Finally Done with News? • HTML files scraped via Nokogiri scripts. • Quite a bit of cleanup: garbage in, garbage out. • Unscrapable news items. • “September 12, 2001”. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 35. Nokogiri • A Ruby library for parsing XML and HTML. • Supports DOM or SAX parsing. • Implements both XPath and CSS3 selectors. • We used it to parse and extract content from the set of HTML files containing existing news stories. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 36. Familiar Patterns • Direct Translation • Encoding Change • Markup Change • URL/Path Translation • Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 37. The Big One ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 38. CMS Conversion • Old CMS pages all published with several different presentational styles, but all with the same DOM. That means we can scrape ’em! • We agreed not to change anything else during the import. That means we can treat it as a clean switchover. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 39. Three-Pronged Conversion ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 40. Three-Pronged Conversion • Build the necessary structures and themes to accommodate and represent our old content. • Build a library of code for scraping the pages generated by the old site, cataloging data and metadata, and storing them in an intermediate representation. • Build a library of code for importing this intermediate representation into the new CMS structures. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 41. Migrate • An Drupal module providing a framework for data import into the Drupal content management system. • Supports a variety of sources and targets out of the box. • Extensible to support additional migration sources and targets. • We used it to import the XML representation of our site into our Drupal system. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 42. Familiar Patterns • Object Extraction • Encoding Change • Markup Change • URL/Path Translation • Data Cleanup ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 43. Intermediate Representation Context: Complex conversion. Problem: Data conversion. Solution: Convert source data to intermediate representation in one pass. Then convert intermediate representation to target. Tools: Representation: Database, XML, CSV. Conversion: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 44. Object Identity Context: Ingesting source data. Problem: Data objects are repeated in source data Solution: Uniquely identify source objects. Tools: String methods, RegEx, DOM/XML selection. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 45. Object Aggregation Context: Ingesting source data. Problem: Target data objects contain multiple source objects. Solution: Aggregate objects at intermediate or output stage. Tools: Varies. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 46. Lessons • You already have a good toolbox. Keep your tools sharp. • Understand your source and target models. • Watch for familiar patterns. • Conversion is an opportunity for cleanup and improvement. • Human labor can sometimes be cheaper than automation. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 47. YOU Can Convert ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 48. Questions? ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 49. Thank you, & keep in touch! • Sven Aas: @svenaas / saas@mtholyoke.edu • Jason Proctor: @jmpmhc / jproctor@mtholyoke.edu • #TPR2 ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 50. Colophon • This presentation is set in Exo Extra Bold from Natanael Gama’s ndiscovered, with headings in ChunkFive from The League of Movable Type. • Background images were adapted from FreeSeamlessTextures.com’s Red Watercolor and The Grid, by Willem Pirquin. ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 51. Colophon (continued) • Card-size survival tool photo via acreativeedge.info • Leatherman photo via SonnyandSandy • Studley Tool Chest photo via FineWoodworking.com ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 52. Colophon (continued) • Audio from Wikipedia:Sound/List: • Edvard Grieg - Piano Concerto in A Minor, Op. 16 - iii. Allegro moderato molto, recorded by the Skidmore College Orchestra. • W.A. Mozart - 5th Piano Concerto, i. Allegro aperto, recorded by Ben Goldstein and Bendik Eide. • Anton Reicha - Variations for Bassooon, recorded by Arthur Grossman • J.S. Bach - Cello Suite 1 in G - Minuets, recorded by John Michel • Mississippi John Hurt - “Nobody’s Dirty Business” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College
  • 53. Colophon (continued) • Other Audio • Jack Beaver - “Workaday World” • Danny Elfman - “Breakfast Machine” ©2012 Sven Aas and Jason Proctor, Mount Holyoke College