SlideShare una empresa de Scribd logo
1 de 21
Data Wrangling: MSCS
View from the trenches
What we've learned
Where we failed
How we succeeded
You do what?

Liaisoning between tech services, project team,
and vendors on data manipulation and display

Skills:
− Marc and ILS data migration/manipulation
− Nitty Gritty details – hows and whys
− Knowledge sharing between partners
− Investigations and Implementations
− Project management
− Meeting management
Data driven? Start at the end!

What do you really want to know?

Do you have the data to answer that?

What are you going to do with the data

What is interesting vs. what is actionable

Test out your theories!!
We Needed Data
Data driven? Start at the end!

Comparisons across institutions – match points
Started with an OCLC reclamation project
Records Sent Returned Unresolved Updated
OCLC #
Ursus 2,100,299 13,232 171,474
Colby 474,438 373
26,334
Bowdoin 624,164
37,848
Bates 656,926
25,101
TOTALS 3,855,827 13,605
260,757
Start at the end...if your ordering out

Think about what you want to get back, make
sure it goes out.

HOW will you deal with returned data?

Can all the partners do the same things in terms
of processing?
Lists, lists, lists!
What will you in/exclude if you are extracting:
types: gov docs, serials, media, e-resources
locations: ref, off-site, reserve, special collections
status: billed, missing, suppressed, withdrawn (!)
use: circ, internal use, reserves
What constitutes a circulating copy?
How are the above encoded?
Can you get what you want?
Circ Data

How long has it been retained?

Any tech processing that included circing?

Has it ever been cleared?

(… and what does it really tell you ...)
Know your vendor / programmer

What exactly is going to happen to the data,
and what will be in(ex)cluded?

Leader bib level m , s

Gov Doc? (008 / 28) ?

Printed material? Media?
So, you think you know your data...
Can you get it out?
Export Tables

What exactly is exported

What do they do with weird data? (b b, b 930)

Do the add any data? v.v.29 , oclc prefix

Formats of dates
Your data may vary
35109002285482 3510900228549
Document!!! REALLY!!!

Export tables and field mappings

Locations

List creation criteria

Record ranges exported and dates

Files
… a few of the ugly things we saw...

Multiple fields used for internal use (INTL
USE, COPY USE, and IUSE3)

Records with multiple 001s

Records with multiple barcodes, duplicate
barcodes, bound with items

Barcodes in 949 not 'b'

Records with no 260

3 0000003 ocm3 3_
Your data through different lenses
Points of departure:
-Merged 001s
-FRBR
-Volume vs Title counts
-Unique vs Holdings counts
-Date of data used
-Definition of public domain
When things go wrong
MarcEdit is your friend!
One more reason to thank Terry Reese
SELECT T0xx.field_data
FROM T0xx, T9xx
WHERE T9xx.field = '945'
AND T9xx.subfield = "f"
AND T9xx.field_data > 0
AND T0xx.cid = T9xx.cid
AND T0xx.field = '001'
Data Wrangling: MSCS Side
Closing Haiku:
Data is messy
While it can be normalized
Nothing is perfect

Más contenido relacionado

Similar a Data Wrangling: MSCS View from the trenches

Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...
Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...
Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...Data Con LA
 
Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1Redwan Ferdous
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
 
On Machine Learning Readiness
On Machine Learning ReadinessOn Machine Learning Readiness
On Machine Learning ReadinessAnne-Marie Tousch
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...Clive Jordan - fighter of Evil BIM
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report Tom Donoghue
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisAnton Chuvakin
 
Business Intelligence_IADC.docx
Business Intelligence_IADC.docxBusiness Intelligence_IADC.docx
Business Intelligence_IADC.docxmary magdaline
 
What Hiring Managers Look For in Data Candidates
What Hiring Managers Look For in Data CandidatesWhat Hiring Managers Look For in Data Candidates
What Hiring Managers Look For in Data CandidatesRuben Kogel
 
How to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity RecognitionHow to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity RecognitionSkyl.ai
 
Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...KETL Limited
 

Similar a Data Wrangling: MSCS View from the trenches (20)

Data Exploration & BI
Data Exploration & BIData Exploration & BI
Data Exploration & BI
 
Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...
Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...
Data Con LA 2022 - Demystifying the Art of Business Intelligence and Data Ana...
 
Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1Elementary Data Analysis with MS excel_Day-1
Elementary Data Analysis with MS excel_Day-1
 
BPM2017 - Integrated Modeling and Verification of Processes and Data Part 1: ...
BPM2017 - Integrated Modeling and Verification of Processes and Data Part 1: ...BPM2017 - Integrated Modeling and Verification of Processes and Data Part 1: ...
BPM2017 - Integrated Modeling and Verification of Processes and Data Part 1: ...
 
The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products The 3 Key Barriers Keeping Companies from Deploying Data Products
The 3 Key Barriers Keeping Companies from Deploying Data Products
 
Verification of Data-Aware Processes at ESSLLI 2017 1/6 - Introduction and Mo...
Verification of Data-Aware Processes at ESSLLI 2017 1/6 - Introduction and Mo...Verification of Data-Aware Processes at ESSLLI 2017 1/6 - Introduction and Mo...
Verification of Data-Aware Processes at ESSLLI 2017 1/6 - Introduction and Mo...
 
On Machine Learning Readiness
On Machine Learning ReadinessOn Machine Learning Readiness
On Machine Learning Readiness
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...
Level of Information Need + ISO 19650 Appointment Workflow (and eSignature fo...
 
LoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BILoQutus: A deep-dive into Microsoft Power BI
LoQutus: A deep-dive into Microsoft Power BI
 
Data preprocessing.pdf
Data preprocessing.pdfData preprocessing.pdf
Data preprocessing.pdf
 
Data Warehouse Project Report
Data Warehouse Project Report Data Warehouse Project Report
Data Warehouse Project Report
 
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
Intelligent Data Extraction, Turning Content into Data, A Look at Advanced Ca...
 
Log Mining: Beyond Log Analysis
Log Mining: Beyond Log AnalysisLog Mining: Beyond Log Analysis
Log Mining: Beyond Log Analysis
 
Business Intelligence_IADC.docx
Business Intelligence_IADC.docxBusiness Intelligence_IADC.docx
Business Intelligence_IADC.docx
 
What Hiring Managers Look For in Data Candidates
What Hiring Managers Look For in Data CandidatesWhat Hiring Managers Look For in Data Candidates
What Hiring Managers Look For in Data Candidates
 
How to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity RecognitionHow to analyze text data for AI and ML with Named Entity Recognition
How to analyze text data for AI and ML with Named Entity Recognition
 
basis data 02.pptx
basis data 02.pptxbasis data 02.pptx
basis data 02.pptx
 
Acc 340 Preview Full Course
Acc 340 Preview Full CourseAcc 340 Preview Full Course
Acc 340 Preview Full Course
 
Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...Talend Community Use Group Bristol: Preparing your business for mastering dat...
Talend Community Use Group Bristol: Preparing your business for mastering dat...
 

Más de Maine_SharedCollections

Dismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsDismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsMaine_SharedCollections
 
Building a Shared Print Network in New England and Beyond
Building a Shared Print Network in New England and BeyondBuilding a Shared Print Network in New England and Beyond
Building a Shared Print Network in New England and BeyondMaine_SharedCollections
 
Developing a State-Wide Retention Policy
Developing a State-Wide Retention PolicyDeveloping a State-Wide Retention Policy
Developing a State-Wide Retention PolicyMaine_SharedCollections
 
Strategies for Preserving Maine's Collection in Print
Strategies for Preserving Maine's Collection in PrintStrategies for Preserving Maine's Collection in Print
Strategies for Preserving Maine's Collection in PrintMaine_SharedCollections
 
An Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsAn Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsMaine_SharedCollections
 
An Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsAn Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsMaine_SharedCollections
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network LevelMaine_SharedCollections
 
How Can Digital Collections Support Shared Print Initiatives?
How Can Digital Collections Support Shared Print Initiatives?How Can Digital Collections Support Shared Print Initiatives?
How Can Digital Collections Support Shared Print Initiatives?Maine_SharedCollections
 
To retain, or not retain, that is the question
To retain, or not retain, that is the questionTo retain, or not retain, that is the question
To retain, or not retain, that is the questionMaine_SharedCollections
 
“Selecting for Sustainability” Maine Shared Collections Strategy
“Selecting for Sustainability”Maine Shared Collections Strategy“Selecting for Sustainability”Maine Shared Collections Strategy
“Selecting for Sustainability” Maine Shared Collections StrategyMaine_SharedCollections
 
Data to Decisions: Shared Print Retention in Maine
Data to Decisions: Shared Print Retention in MaineData to Decisions: Shared Print Retention in Maine
Data to Decisions: Shared Print Retention in MaineMaine_SharedCollections
 
United We Stand: A Collaborative Approach to Legacy Print Collections
United We Stand: A Collaborative Approach to Legacy Print CollectionsUnited We Stand: A Collaborative Approach to Legacy Print Collections
United We Stand: A Collaborative Approach to Legacy Print CollectionsMaine_SharedCollections
 
Maine Shared Collections Strategy Print Archive Network Update
Maine Shared Collections Strategy Print Archive Network UpdateMaine Shared Collections Strategy Print Archive Network Update
Maine Shared Collections Strategy Print Archive Network UpdateMaine_SharedCollections
 
Maine Shared Collections Strategy: Origins, Vision, Goals
Maine Shared Collections Strategy: Origins, Vision, GoalsMaine Shared Collections Strategy: Origins, Vision, Goals
Maine Shared Collections Strategy: Origins, Vision, GoalsMaine_SharedCollections
 
Communication, Project Management & Decision-Making
Communication, Project Management & Decision-Making Communication, Project Management & Decision-Making
Communication, Project Management & Decision-Making Maine_SharedCollections
 
Collaborating to Preserve Our Print Collections
Collaborating to Preserve Our Print CollectionsCollaborating to Preserve Our Print Collections
Collaborating to Preserve Our Print CollectionsMaine_SharedCollections
 
Managing the collective collection - print books in maine
Managing the collective collection - print books in maine Managing the collective collection - print books in maine
Managing the collective collection - print books in maine Maine_SharedCollections
 

Más de Maine_SharedCollections (20)

Dismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print ProjectsDismantling Silos to Build Robust Shared Print Projects
Dismantling Silos to Build Robust Shared Print Projects
 
Building a Shared Print Network in New England and Beyond
Building a Shared Print Network in New England and BeyondBuilding a Shared Print Network in New England and Beyond
Building a Shared Print Network in New England and Beyond
 
Developing a State-Wide Retention Policy
Developing a State-Wide Retention PolicyDeveloping a State-Wide Retention Policy
Developing a State-Wide Retention Policy
 
Strategies for Preserving Maine's Collection in Print
Strategies for Preserving Maine's Collection in PrintStrategies for Preserving Maine's Collection in Print
Strategies for Preserving Maine's Collection in Print
 
An Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsAn Introduction to Maine Shared Collections
An Introduction to Maine Shared Collections
 
Statewide Collection Analysis in Maine
Statewide Collection Analysis in MaineStatewide Collection Analysis in Maine
Statewide Collection Analysis in Maine
 
An Introduction to Maine Shared Collections
An Introduction to Maine Shared CollectionsAn Introduction to Maine Shared Collections
An Introduction to Maine Shared Collections
 
MSCC Minerva Users Council Presentation
MSCC Minerva Users Council PresentationMSCC Minerva Users Council Presentation
MSCC Minerva Users Council Presentation
 
Moving Shared Print to the Network Level
Moving Shared Print to the Network LevelMoving Shared Print to the Network Level
Moving Shared Print to the Network Level
 
How Can Digital Collections Support Shared Print Initiatives?
How Can Digital Collections Support Shared Print Initiatives?How Can Digital Collections Support Shared Print Initiatives?
How Can Digital Collections Support Shared Print Initiatives?
 
To retain, or not retain, that is the question
To retain, or not retain, that is the questionTo retain, or not retain, that is the question
To retain, or not retain, that is the question
 
National Monograph Strategy
National Monograph StrategyNational Monograph Strategy
National Monograph Strategy
 
“Selecting for Sustainability” Maine Shared Collections Strategy
“Selecting for Sustainability”Maine Shared Collections Strategy“Selecting for Sustainability”Maine Shared Collections Strategy
“Selecting for Sustainability” Maine Shared Collections Strategy
 
Data to Decisions: Shared Print Retention in Maine
Data to Decisions: Shared Print Retention in MaineData to Decisions: Shared Print Retention in Maine
Data to Decisions: Shared Print Retention in Maine
 
United We Stand: A Collaborative Approach to Legacy Print Collections
United We Stand: A Collaborative Approach to Legacy Print CollectionsUnited We Stand: A Collaborative Approach to Legacy Print Collections
United We Stand: A Collaborative Approach to Legacy Print Collections
 
Maine Shared Collections Strategy Print Archive Network Update
Maine Shared Collections Strategy Print Archive Network UpdateMaine Shared Collections Strategy Print Archive Network Update
Maine Shared Collections Strategy Print Archive Network Update
 
Maine Shared Collections Strategy: Origins, Vision, Goals
Maine Shared Collections Strategy: Origins, Vision, GoalsMaine Shared Collections Strategy: Origins, Vision, Goals
Maine Shared Collections Strategy: Origins, Vision, Goals
 
Communication, Project Management & Decision-Making
Communication, Project Management & Decision-Making Communication, Project Management & Decision-Making
Communication, Project Management & Decision-Making
 
Collaborating to Preserve Our Print Collections
Collaborating to Preserve Our Print CollectionsCollaborating to Preserve Our Print Collections
Collaborating to Preserve Our Print Collections
 
Managing the collective collection - print books in maine
Managing the collective collection - print books in maine Managing the collective collection - print books in maine
Managing the collective collection - print books in maine
 

Último

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 

Último (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Data Wrangling: MSCS View from the trenches

  • 1. Data Wrangling: MSCS View from the trenches What we've learned Where we failed How we succeeded
  • 2. You do what?  Liaisoning between tech services, project team, and vendors on data manipulation and display  Skills: − Marc and ILS data migration/manipulation − Nitty Gritty details – hows and whys − Knowledge sharing between partners − Investigations and Implementations − Project management − Meeting management
  • 3.
  • 4.
  • 5.
  • 6. Data driven? Start at the end!  What do you really want to know?  Do you have the data to answer that?  What are you going to do with the data  What is interesting vs. what is actionable  Test out your theories!!
  • 8. Data driven? Start at the end!  Comparisons across institutions – match points Started with an OCLC reclamation project Records Sent Returned Unresolved Updated OCLC # Ursus 2,100,299 13,232 171,474 Colby 474,438 373 26,334 Bowdoin 624,164 37,848 Bates 656,926 25,101 TOTALS 3,855,827 13,605 260,757
  • 9. Start at the end...if your ordering out  Think about what you want to get back, make sure it goes out.  HOW will you deal with returned data?  Can all the partners do the same things in terms of processing?
  • 10. Lists, lists, lists! What will you in/exclude if you are extracting: types: gov docs, serials, media, e-resources locations: ref, off-site, reserve, special collections status: billed, missing, suppressed, withdrawn (!) use: circ, internal use, reserves What constitutes a circulating copy? How are the above encoded? Can you get what you want?
  • 11. Circ Data  How long has it been retained?  Any tech processing that included circing?  Has it ever been cleared?  (… and what does it really tell you ...)
  • 12. Know your vendor / programmer  What exactly is going to happen to the data, and what will be in(ex)cluded?  Leader bib level m , s  Gov Doc? (008 / 28) ?  Printed material? Media?
  • 13. So, you think you know your data...
  • 14. Can you get it out? Export Tables  What exactly is exported  What do they do with weird data? (b b, b 930)  Do the add any data? v.v.29 , oclc prefix  Formats of dates
  • 15. Your data may vary 35109002285482 3510900228549
  • 16. Document!!! REALLY!!!  Export tables and field mappings  Locations  List creation criteria  Record ranges exported and dates  Files
  • 17. … a few of the ugly things we saw...  Multiple fields used for internal use (INTL USE, COPY USE, and IUSE3)  Records with multiple 001s  Records with multiple barcodes, duplicate barcodes, bound with items  Barcodes in 949 not 'b'  Records with no 260  3 0000003 ocm3 3_
  • 18. Your data through different lenses Points of departure: -Merged 001s -FRBR -Volume vs Title counts -Unique vs Holdings counts -Date of data used -Definition of public domain
  • 19. When things go wrong MarcEdit is your friend!
  • 20. One more reason to thank Terry Reese SELECT T0xx.field_data FROM T0xx, T9xx WHERE T9xx.field = '945' AND T9xx.subfield = "f" AND T9xx.field_data > 0 AND T0xx.cid = T9xx.cid AND T0xx.field = '001'
  • 21. Data Wrangling: MSCS Side Closing Haiku: Data is messy While it can be normalized Nothing is perfect

Notas del editor

  1. Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
  2. Easy to say “we want detailed subject analysis and title lists” but if you don't have the staff time to review, does this really matter? Try to have a clear picture BEFORE starting the project. (Data can go stale … interest vs actionable data)
  3. Can you get what you want in a way that is meaningful to the vendor / programmer?
  4. Do you have enough for it have value (some question if it has value at all..) Did it get checked out to processing? Another example is getting lists of barcodes into review file – ran into this where odd internal use data in different fields Do you really want to rely on it? – That 1980's Word Perfect manual vs. Portuguese poetry
  5. You've decided what you want and you've pulled all your data … and ?? do you know how it's going to be processed.
  6. Variations in cataloging practices over time and space Lots of oddities – no 260, no 001, multiple 001s …
  7. Internal Use Circ in different field – different catalog
  8. Sent data to three different places (again document what went where!)
  9. Data is messy Nothing is ever perfect Please do not despair