1. Data
in databases
“It’s not what you think”
Clare Somerville
Trish O’Kane
2. Our point
Long term preservation of data requires
understanding how data is created and
managed
We have to work out:
◦ What data the business needs to keep
◦ What records the business needs
to create and keep
And….. how
◦ What data must be unchanged
◦ What we mean by usable and retrievable
3. The problem, as we see it
What is a record and its attributes
We
What is a database and how they are built
and maintained
will
cover
How can we use data sets to create
records?
What is a data warehouse and how they
are built and maintained
How can we ensure that useful data sets
are available over time
4. Agenda
The problem
Definitions
Delivering data &
records from data
◦ Data warehousing
◦ Data “lifecycle”
management
Conclusion
5. The problem
Databases have replaced many semi-structured
records
◦ Register of Births, Deaths and Marriages (and Divorces!)
◦ EQC claims data
But - we want some of that information available
long term in a usable format
Records managers are unfamiliar with the world
of structured data
◦ Disposal outcome in a draft disposal authority:
“When database decommissioned,
transfer to Archives NZ”
◦ Transfer what?
6. Who wants what?
What have we got?
◦ Data in databases
What do we want?
◦ Records
When do we want them?
◦ Now, and for the long term
But….what is a record
in the context of data?
◦ The individual data item?
◦ A whole dataset?
7. What have we got
1. Customers
◦ Customers for data
◦ Customers for records
2. Information assets
◦ Records
◦ Transactional data in databases
◦ Datasets
◦ Data marts and data warehouses
3. What do we have to do to?
◦ Principles from data warehousing
◦ Data life cycle management
9. Records
Recordkeeping
definition In structured world
Public Records Act 2005 A record is a line of
A record or class of data in a table in a
records in any form in database
whole or part, created
or received by a public
office in the conduct of
its affairs
10. Attributes of a record
Recordkeeping Data management
perspective perspective
Documents the carrying out Field types
of the organisation’s business
objectives, core business ◦ Numeric
functions, services and ◦ Character
deliverables,
and/or ◦ Date/time
Provides evidence of Composite, derived
compliance with any current
jurisdictional standards, Values
and/or
Documents
the value of the resources
of the organisation
and how risks to the business
are managed,
and/or
Supports the long-term
viability of the organisation
11. Data and metadata
Documents and metadata
“Essentially there is a different relationship
between
data and its metadata
than
documents and their metadata”
12. Is it data or is it metadata?
It depends, doesn’t it?
It’s about the level at which it is used/applied date created
E.g. Date created
Customer ID Date created Customer Customer
name Type
123 2008-10-20 Bloggs, Joe Retailer
124 2008-10-23 Mouse, Minnie Distributor
125 2008-10-26 Max, Direct
Metadata
13. Metadata in the data warehouse
Business metadata Technical metadata
Link between database What data, from
and users – road map for where, how, when etc
access
Developers
Business users Technical users
Analysts Maintenance and
Less technical growth
On-going development
14. Metadata in the data warehouse
Business metadata Technical metadata
Structure of data Table names
Table names Keys
Attribute names Indexes
Location Program names
Access Job dependencies
Reliability Transformation
Summarisations Execution time
Business rules Audit, security controls
20. 3 layers
•User interface
Database
•Rules and algorithms
•Data
21. Application layer Provides views, creates reports
Turns data into information
Adds, overwrites, deletes data
Runs rules and processes
Data layer
Data in tables
Acted on by application layer
22. Can data fit the PRA definition?
• We are “format neutral” in the
management of records, so….
• Data can be records!
– Births Deaths and Marriages Register
– EQC claims data
• Test questions
– If we exclude data what have we lost?
– What is the impact of losing data?
• On the business
• For the future
23. Source solution is not
a recordkeeping system
The Solution System is not a
Application layer recordkeeping system because it…
• Holds transactional data, not
evidence of transactions in context
(records)
• Isn‟t tamper proof
– Difficult to know exactly what the
Data layer application layer is doing
– Different tables and rows may be
managed differently
– Hard to roll back to a point in time
• Must overwrite „redundant‟ data to
run efficiently
– Compromise of history vs speed
– Business use is the priority
• The data layer is not usable without
the application layer
24. Inside a database
Here today - gone tomorrow
Transaction metadata
◦ Example: An activity about a customer is a record
Is there a Unique ID
For the transaction?
For the customer?
Where and when are/were components
located?
Multiple data tables in one database
Multiple data tables across multiple database
Table names and column names
Standard names for elements across tables
25. Source / business databases
Data stored in tables
Normalised structure
Lots of data
Large number of users
Lots of very quick transactions
Varying history retained
Mostly data is overwritten
27. Data warehouse
Storing and accessing large amounts of
data
Central repository for all or significant
parts of the data that an enterprise’s
various business systems collect
28. Multiple
Designed for
Historical source
reporting
data systems
and analysis
Lots of data!
Large
Transaction queries
level data
Multiple
Corporate table joins
effort
Centrally Unpredictable
owned use
Corporate
needs
Data Pressure on
warehouse resources
31. 1 Create policy to document:
What authoritative records must be retained
and what metadata must be retained
What formats are acceptable
Which (if any) records and metadata are
considered transient artefacts, and why (e.g.
format shifting duplicates, quality checking
etc),
Get approval for destruction of transient
artefacts as part of the normal functioning of
the systems that dispose of them
32. Approach: create and export
records from solution system
1. Identify what data tables/records are needed
and that can be produced
2. Map identified records to disposal authorities
◦ Which records must be kept beyond system
decommission
◦ Identify the business need for retention
3. Use the application layer to create and export
those records in a suitable format
4. Store in recordkeeping system e.g. data
warehouse or EDRMS
5. Retain records needed for the business post-
decommission
33. 2 Persistently associate metadata
Appropriate metadata associated and
retained with authoritative records
◦ Identify data linkages between systems
◦ Retain those linkages
or
◦ Consolidate metadata and associated record
objects into one system, and ensure they are
persistently associated
Ensure migrated data/metadata/objects
retain their context (e.g. date
created, author etc)
34. Future state
BAU transfers to recordkeeping systems
Structured data
to data warehouse
Customer
mgmt Case
EDRMS
system mgmt
system
Create key records and send to EDRMS
37. Data feeds - principles
Direct data feeds from source systems
Not changed in any way
No intervening processes
All changes to the data
Fully auditable
Reconcile to source system
38. For Example: one table…
Before: After:
29 months data 29 months data
162 tapes 4 physical files
400 million records 27 million records
88 GB 6 GB
Month1 Month2 Month3
...... Monthn
...
Compare Compare
......
...
Differences1 Differences2 ......
...
Consolidated
file
39. Subsets
Frequently used data
At a point in time
Smaller, quicker
Easier to use
Daily, weekly, monthly
40. Summary data
Summary layer
Analysts access the summary layer
Smaller, easier
Data Marts
41. Benefits of data warehouse
Accessible
Stored online
Quick and easy to access
Multiple sources of data
Updated daily
Full history – track everything
Can do more – freedom to explore
Tuned environment
One version of the truth
42. Data management
Data does not manage itself!
Difficult, unruly
Standards, processes
Roles and responsibilities
Data warehouse team
Skills
◦ Data warehousing, Data
management, Software, Hardware, Metadata, Architectur
e, Analysis, Performance, tuning
Coordination, communication, marketing
43. Best practice
Data warehousing around for years
Proven architectures, technologies,
methodologies
Good infrastructure
… but will it last?
44.
45. Challenges – big data
33% - data growth
contributes to performance
issues “most of the time”
Managing storage may cost
3-10 times cost of
procurement
Average company keeps 20-
40 duplicates of its data
46. Helping IT and the business to
collaborate in managing data
It’s not just about BI
Business and IT must
work together
Helping IT and the business to
collaborate in managing data
48. Decommission = risk
Old case
Old
mgmt
EDRMS
system
New
New Data
case
EDRMS
mgmt warehouse
system
Partial exports
49. Data lifecycle management
Data lifecycle management (DLM)
Managing the flow of data, information
and associated metadata through
information systems and
repositories, from creation and storage
through to when it can be discarded.
Recognises that the importance and
business value of data does not rely on its
age, or how often it is used.
50. Why DLM
Data and information has value for
◦ strategic and operational business needs
◦ managing risk
◦ meeting legislative obligations
Value of information decays over time
Some information can be archived, some
discarded
Occasionally, sometimes
unexpectedly, older data may need to be
accessed again, quickly, completely and
accurately
51. DLM Components
Create or Modify
Standards
Formats Requires:
Core process artefacts
Includes data
Retrieval Connected systems
Automated capture
validation
Property
Retain or Dispose Maintain
Archive Customer Organise
Transfer Tenancy
Describe
Destroy Manage
Requires: Requires:
Disposal Authorities Risk identification
Business requirements Lifecycle policies
Disposal planning Metadata schema
Tiered Storage Business classification
linked to business process
Use
Access
Share
Find
Requires:
Single source of truth
Disposal Authorities
Disposal Planning
Tiered Storage
53. Create and maintain
Principle 1: Recordkeeping Must be
Planned and Implemented
1. Responsibility assigned CEO down
2. Policy
3. Procedures
4. Responsibilities defined, resourced
5. Recordkeeping programme & monitoring
54. Principle 2: Full & accurate records of business activity
must be made
Requirement Data Data Warehouse
base
1. Functions and business activities identified and
documented
2. Records of business decisions and transactions
must be created
3. All records of business activity captured routinely
into an organisation-wide recordkeeping
framework
4. Training provided
55. Principle 3: records must provide authoritative and
reliable evidence of business activity
Requirement Data Data
base Warehouse
10. Authentic: accurately documented creation,
receipt, & transmission
11. Reliability & integrity, maintained unaltered
12. Useable, retrievable, accessible
13. Complete, with content & contextual information
14. Comprehensive, provide authoritative evidence of
all business activities
56. Principle 4: records must be managed
systematically
Requirement Data Data Warehouse
base
15. Identified & captured in recordkeeping framework
16. Organised according to a business classification
scheme
17. Reliably maintained over time in recordkeeping
framework
18. Useable, accessible & retrievable for the entire
period of their retention
19. Contextual and structural integrity maintained over
time
20. Retention & disposal actions systematic
57. RK capability of system(s)
A system that holds authoritative records
◦ Must be capable of recordkeeping, or
◦ Made capable, or
◦ Must transfer records to a recordkeeping
system
Who makes that decision?
◦ Should be business owner
◦ (with advice from IT)
Data warehouses show us
◦ what can be done
◦ how to do it
58. Developing an Enterprise Information Management Framework
Develop a strategy and Establish principles Assess current and Document legislative
INFORMATION CULTURE
INFORMATION STEWARDSHIP
GOVERNANCE Authority, management,
monitoring and performance
roadmap Define: desired maturity framework
Establish structures and - Policies Determine metrics and Understand compliance
of information management arrangements - Standards measuring Determine and optimise
functions - Business Rules
Define roles and processes Establish monitoring business benefits
arrangements processes Manage information risk
A blueprint for the semantic Model key information flows Identify: Organise information for:
INFORMATION ASSET and physical integration of Establish IS design - Authoritative information - Navigation and retrieval
enterprise information principles and standards - High-value information - Discovery
ARCHITECTURE assets, technology and the Develop an inventory of - Critical information - Content types and
business information, systems and Plan for disaster recovery categorisation
processes
BUSINESS REFERENCE AND STRUCTURED AND UNSTRUCTURED
INTELLIGENCE AND MASTER DATA INFORMATION
The
DATA MANAGEMENT Develop an information lifecycle
strategy and roadmap
Develop a recordkeeping
strategy and roadmap
behaviours,
WAREHOUSING
values and
Capture, store and re-use core Enable integration and interoperability Enable compliant retention and
norms of the Oversight of
Store and transform business entities Plan and manage: disposal in systems
enterprise the content,
Integrate and deliver Consolidate and match data - Repositories Support access to legacy within the
- Storage information description,
Perform analytics and reporting Manage and control data quality context of quality, and
Distribute core data appropriately - Format Plan for any content migration information
Support decision making accuracy of
use enterprise
Develop: Map across metadata information
throughout
METADATA MANAGEMENT - Metadata Schema schemas Manage and
The connecting foundation for
- Controlled Vocabulary Establish monitoring and sustain its lifecycle
EIM, used to describe, organise,
- Thesauri maintenance processes change
integrate, share, and govern
enterprise information assets - Business Function Classification Implement metadata Provide Define
Utilise system generated metadata management tools information responsibility,
leadership roles and
Embed EIM in accountability
Establish security policies Manage access control performance Establish
SECURITY AND CONTROL Policies, rules and tools that
ensure the proper control,
and rules Manage classified information management
Deliver
stewardship
Model information security Ensure regulatory compliance processes
protection and privacy of and scenarios Establish monitoring and training and Establish
information Build security into system metrics ongoing monitoring
metadata support and
Develop maintenance
toolkits and
reference
Social Emails Audio Mobile IT/OT Transactional material
Data
Documents Images Text Movies Search
59. Future state of data
Accurate, relevant, timely delivery of data and
information
◦ Trustworthy information
◦ Where it is needed
◦ Formats most appropriate to business need and future
Information found quickly, whether it’s old or new
Clear guidelines for systems and processes
◦ Keep what’s needed for only as long as it’s needed
◦ In the right format
Data has recognisable value and appropriate levels of
management
◦ Business need: we know what’s important, and when it’s
important
◦ Risk: we’re clear about what to manage, and how
◦ Regulatory framework: we meet legislative obligations
60. Our point
Long term preservation of data requires
understanding how data is created and
managed
We have to work out:
◦ What data the business needs to keep
◦ What records the business needs
to create and keep
And….. how
◦ What data must be unchanged
◦ What we mean by usable and retrievable
61. Data
in databases
“It’s not what you think”
Clare Somerville
Trish O’Kane
Notas del editor
Adam Brown, Statistics NZGenerally my feedback on ISO 15489 would be: why can't/shouldn't it be applied to data? At the end of the day data is just another record so it really shouldn't be an issue. Having said that I'm not sure that it would particularly add any value to it either. One of the main issues is how to define a record, in terms of data. It is the individual data item or is it a whole dataset? This is certainly the most tricky issue because you generally maintain metadata at the dataset level but potentially slice and dice at lower levels.The other key addition to it for data would have to be a greater focus on usability (7.2.5). As we know, with data this isn't a given to the same extent as it can be for a document. Significantly more information is required to be able to do anything with it - essentially there is a different relationship between data and its metadata than documents and their metadata.In summary, the principles fit for applying it to data (and should be applied!) but as it is it wouldn't add much value.
Adam Brown – Stats NZ
CS: We used to believe that there was 1 byte of metadata for every 10 bytes of data.
CS: But those numbers are changing with metadata now exceeding data.
Data is a set of values, in this case in a comma delimited file. I need more information in order to know how to read this.
From FMG
Volume, velocity, variety, complexity! Things like smart grids causing significant data volume rises.Velocity – speed produced, received, processedVariety – structured databases, emails, metering, video, image. Financial trans etc. Much unstr. – content analytics, taxonomy, ontology. Non-trad BI tools
We’ve built skills in DI, DQ in BI and analytics. Have core data management skills to support BI.Other data requires the same DM practices
This slide provides some background on what data lifecycle management is, and generic reasons as to why it is important.
This slide provides a brief high-level over view of the future state of DLM at HNZCAccurate, relevant, timely delivery implies: data will be managed through it’s lifecycle in a way that ensure there is a single source of truth, meeting the needs of users, and that can be accessed in a timely manner. “Everyone who needs it” will include mobile workers, this is also why format is an important consideration. This is just a take on “the right information to the right person at the right time (and in the right format)Finding old or new information quickly implies: information is described by metadata so that it has context and meaning, and systems use that metadata to locate relevant content and deliver information to users.Clear guidelines implies: principles will be agreed on, which will decompose into business rules for systems. For example: that X information should be kept for Y years, then disposed of by following Z procedures.Appropriate levels of management implies: the value of data is understood within the context of business need, risk and legislation, and that processes are in place to manage how business need Is determined, and how risk and legislative obligations are managed.