In a cloud and virtualized environment, information variety and volume are growing geometrically. As such, organizations are struggling to take advantage of cloud, mobile and consumer technologies while connecting to their back-end systems of record. No one person has the time or energy to sort through thousands of files to a) identify records or b) pull metadata for the records. That's where analytics enters the equation. Learn how we used analytics tools and methods to resolve our information migration challenges.
3. Topics
• Background – Problems I’m trying to solve
• Technology – Insight into how the technology works
• Approach – Approach to how I used the tool
• Revelations – Aha moments during the process
• Analysis Results – Effectiveness of the tool
• Recommendations – Lessons to improve results
• Takeaways – Summary of steps to use if using the tool
• Extensions – Other potential uses of the tool
• Contact Info – How to contact me
6. How the Tool Works
Level 1 Terms Clue Clue Type Score Mod
Statement of Work (SOW) Statement of Work (SOW) Standard 0
Doc Name*=*statement* Metadata 35
Doc Name*=*work* Metadata 20
Doc Name*=*sow* Metadata 50
Doc Name*=*closure* Metadata -10
Doc Name*=*supplier* Metadata -15
Doc Name*=*ssow* Metadata -75
Doc Name*=*jsow* Metadata -75
The clues used
to identify the
item
The score used
to classify the
item
Each hit accumulates
the score. A Score of
50 gets classified
Negative
scores can be
used
The item I’m
trying to identify
8. Realizations ( )
• The repository I started on for this application had almost a million documents (855,000).
• My first realization was that the documents were all in a proprietary repository and the tool was not able to
directly access items in this repository without developing a connector. So, analyzing documents in the repository
needed customization and I hadn’t time nor budget to develop a connector.
• To get around this issue I had a report generated (in Excel) from the repository that provided me with three pieces
of information: Document Number, Document Title and Document Family (set of 88 types). I was curious to see if I
could use the tool to identify Records based on this limited information.
• From this report I established a training set of 50,000 line items. I used a large set due to the limited information
provided. Note: If using entire documents (vs titles) much less of a training set can be used.
• Training took me 7 passes:
– Set an initial set of “clues” for a set of items. After each run the results were analyzed to determine how many items were
classified and the overall accuracy of the classification
– The item “non-record” was added after the realization that identification of non-records assists in the identification of records
– The goal of the first 4 passes was basically to increase the number of types of Records identified with some emphasis on
accuracy
– The next three passes focused on overall accuracy of the clues. Accuracy is actually more time consuming as it is a manual
process: every item needs to be assessed
– Another realization was that many of the items were not classifiable: Not enough information was contained in the data set to
render a classification (e.g., sometimes the document number was just repeated in the document title field)
9. Training Results
• Seven runs were made on the training set of 50,505 items:
• The classification percentage was monitored because not all of the items were
considered classifiable:
• Targets were actually achieved after the sixth run but one additional run was
made to slightly enhance accuracy.
CLASSIFICATION PERCENTAGE 1/11/2017 1/13/2017 1/16/2017 1/18/2017 2/15/2017 2/28/17 3/16/17
# record work products 54 57 64 66 72 70 70
# records classified 13227 17755 27490 32404 34291 35086 34834
# non records classified 0 6091 6623 8621 9732 10390 10791
percentage classified 26.2% 35.2% 54.4% 64.2% 67.9% 69.5% 69.0%
CLASSIFICATION ACCURACY 1/11/2017 1/13/2017 1/16/2017 1/18/2017 2/15/2017 2/28/17 3/16/17
# record work products 54 57 64 66 72 70 70
# records classified 13227 17755 27490 32404 34291 35086 34834
# non records classified 0 6091 6623 8621 9732 10390 10791
non-record accuracy ----- ----- ----- ----- 94.8% 98.3%
overall classification accuracy ----- ----- ----- ----- 91.0% 95.1%