1. How to Prepare Unstructured
Data for BI and Data Analytics
George ROTH – CEO Recognos Inc.
Neil MITCHELL – Recognos Inc.
Webinar Starting Soon – Everybody is Placed on Mute
2. How to Prepare Unstructured
Data for BI and Data Analytics
George ROTH – CEO Recognos Inc.
Neil MITCHELL – Recognos Inc.
3. Housekeeping
• All attendees are placed on Mute throughout the presentation
• We will make available all the Webinar materials
– The slides will be emailed and the recording posted
• Questions
– Please use the GoToWebinar “chat box” in the control panel to ask any questions
– These will be addressed at the end, as time allows, or written responses provided
• Polling
– To improve these webinars we will ask for your feedback in the form of polling
questions
– They are completely confidential
– Multiple choice
3
4. AGENDA
A. Structured, Semi-Structured and Un-Structured Content
B. What is Data Preparation in Data Science
C. The Swiss Army Knife of the Data Extraction
D. Processing of Unstructured Non-Classifiable content and
integrate all data (SDP - The Smart Data Platform)
E. On boarding ETI or SDP
F. About Recognos and Next Steps
G.Q&A
4
7. The Problem – 3 data types
• 80% of the data in the enterprise is unstructured
• Structured: in tables of a certain sort, object DBs, etc.
• Semi Structured – XML Based
• Unstructured
– Known content, classifiable – key words : Contracts, SEC Documents,
Insurance Quote Document
– Unknown content – with known domain: Board Meetings
– Unknown content with unknown domain: Panama Files, emails (discovery
suites)
7
8. Data Growth – 42.5% per year – New Data Analytics – N=ALL
8
11. What is Data Preparation in Data Science
• In most of the presentations they will say that is a tedious task
• There is no system that will do that
• Not always we know what to prepare for the Data Science applications
• Example:
– NGO – Grant – needed to know the start dates, end dates, amount of money,
name of project
– Needed to find the graph of the recipients to determine connections between
recipients
– Prevent fraud for EU funds – or money laundering
• Need to combine different data types (structured, semi-structured and
unstructured) and to provide for the next steps
11
12. C. The Swiss Army Knife for
Unstructured Classifiable content
12
15. Content that is classifiable by Keywords
• In general legal content
• Can determine the keywords
• Examples:
– Contracts
– SEC Documents
– Different Legal Documents
– Forms (IRS, INS, etc.)
– Hospital Patient Info
– Insurance Info
– Etc.
15
16. Field Types with their Extraction Methods
Type Field Type Definition Extraction Method Can be setup by
business people
?
Estimated
Percentage in
docs
Expected Accuracy
1Explicit Trainable These fields appear in the
approximate same context,
consistent across documents of the
same type.
Human Assisted Machine
Learning
Y 50%>75%
2Explicit Form Fields These fields are always preceded
by the same labels, same contexts,
etc. Example are any IRS form, the
10K Header.
Predefined templates. Need to
be setup. We are planning to
create the UI for this, we don't
have one. This was the
method that was used for the
10Ks 6 fields.
Y 10%>95%
3Explicit List Fields These fields have the same values
in all documents (with small
variations) that are known from the
beginning.
The user can define a library of
"lists" , and can select a list at
the document setup phase.
Y 10%>90%
4Implicit List Fields The expected values are
predefined but are not present in
the document. Need to be inferred
from the text.
Semantic Scripts, needs a
Semantic Infrastrucutre.
NO 5%>90%
5Semantic Fields These fields have values that are
not consistent across documents
and need semantic analysis.
Semantic Scripts, needs a
Semantic Infrastrucutre.
NO 20%>90%
6Graphical Fields
Presence
We encountered two fields.
Signature Present, Seal Present.
Artificial Vision Neural
Networks are used to detect
those. The algorithms exist,
need to be integrated.
YES 1%>95%
7Tables These are tables in a document.
There are two table types,
Manhatan Tables (no lines) and
others.
Special Artificial Vision method
to detect the table, regular
expressioln to extract the fields
after the table found.
YES 3%>95%
8Enhanced These fields are not in the
document but can be found in
some auxiliary data stores based
on what is in the document.
These fields actualy are
populated in the post
extraction validation /
augmentation process.
NO 1%>95%
100%
16
18. ETI- Extract Transform Integrate Platform –
Human in the loop Machine Learning
Document load
•PDF files, containing text or images
•Popular image file formats
Document digitization
•OCR
•Tokenization – identification of words,
sentences, paragraphs within the document
Taxonomy definition
•What are the target documents?
•What data do you want to
extract?
Manual data extraction
Example based
machine learning
Manual data corrections if
necessary – improves
extraction
Automatic data extraction
Data publishing
Initial Setup
Machine Learning
18
25. Field Types
• Trainable: the filed is always in the document (explicit) , in the same
context.
• Not Explicit – for example Has an Audit :Y/N – Has a Signature (Y/N) –
Has a Signature (Y/N)
25
26. Derived Fields – not trainable – need to write a script
• Need to read the text and determine a Boolean Value
26
27. Need to interpret text and assign code – code field
27
The system cannot be trained for derived fields !!!
31. Table Processing
• One of the most difficult tasks
• There are two table types: Manhattan Tables and Lined Tables
• Need to detect where is the table, the “lines” (vertical and horizontal)
• Extract the info
• Use filters derived from visual perception research (the so called Gabor
filters)
• The table line detection method was developed by Dr. Raul C. Mureşan
and Dr. Vasile Vlad Moca, founders of S.C. Neurodynamics S.R.L., for
Recognos . Both Dr. Mureşan and Dr. Moca have an active neuroscience
research career and are affiliated to the Romanian Institute for Science
and Technology (RIST), studied at Max Planck Institute in Germany.
31
32. What is a Perceptron ? (Wikipedia)
• In machine learning, the perceptron is an algorithm
for supervised learning of binary classifiers: functions that can decide
whether an input (represented by a vector of numbers) belongs to one
class or another. It is a type of linear classifier, i.e. a classification
algorithm that makes its predictions based on a linear predictor
function combining a set of weights with the feature vector. The algorithm
allows for online learning, in that it processes elements in the training set
one at a time.
32
35. How to measure the performance of the extraction process
• Not a simple problem
• Multiple error types
• Language
• OCR quality – language dependent
• OCR – open source, paid (Omni Page, Tesseract)
35
36. What will be reported
• True Positives
A true positive is a value that was extracted by ETI and was confirmed by the DA
as correct.
• False Positives
False positives are values identified by ETI but corrected by the DA.
• True Negatives
True negatives are values that were not found by ETI and the DA confirms that the
value for that specific filed in the taxonomy is not present in the document. It can
be either left empty by the analyst or it can be manually input without a reference
in the document.
• False Negatives
False negatives are values that ETI did not find in the document but the DA inputs
the values and adds a reference in the document.
37. The system EPI – Extraction Performance Indicators
– Precision
The precision of the data extraction will tell us how many of the identified values are correct from the total number of
values extracted.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠
=
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
The correct values are the TP, while the total values are TP + FP (correct and incorrect).
– Sensitivity
The sensitivity will tell us how many correct values we retrieved from the total values that could have been extracted.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
=
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
The correct values are the TP, while the total values in the document are TP + FN. As defined above FN are the values
that the system identified as missing but the DA found the in the document.
– Accuracy
The Precision and Sensitivity deal only with the extracted values, and do not take into account the values that are really
missing and the system correctly reports them as missing. Accuracy is the EPI that tells us how correct the system
identifies ALL values, both existing and missing.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
The correctly extracted values are both TP and TN while the total number is the sum of all four measurements.
39. US Mutual Fund Data–from documents to analytics (www.rdcmf.com)
39
40. Data Teams
40
• Need to create data teams
• Data Analysts - responsible with the taxonomies – mapping
• Validation rules
• Manual intervention decreases in time
44. Content is Not Classifiable by keywords – not consistent
• Ontology based classification, extraction
• What is an ontology ?
• RDF
• SPARQL
• Used in Data Integration (Same As)
• We can query Unstructured, Semi Structured and Structured with the
same query language
44
45. A few semantic terms….
• RDF
• Ontology - OWL
• Linked Data
• Schema.org - Google
• Data.gov
• Data.uk
45
47. 6/30/2016 47
Building Block RDF
“There is a Person identified by http://www.w3.org/People/EM/contact#me, whose name
is Eric Miller, whose email address is em@w3.org, and whose title is Dr.".
Triplets:
(i) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#fullNa
me, "Eric Miller"
(ii) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#person
alTitle, "Dr."
(iii) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,
http://www.w3.org/2000/10/swap/pim/contact#Person
(iv) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#mailbo
x, em@w3.org
63. Onboarding ETI or SDP
• Need to designate a “data Shepherd”
• The data sources need to be analyzed by a business expert (know what
data is where) – bad practice example
• Meta data governance is very important (taxonomies, ontologies)
• Gradually develop the ontology – not at once
• Needs a champion in the enterprise, the beginning is hard
• Work hand in hand with Data Analytics people
• Start small and measure the ROI
• Will have to find the “we don’t know what we don’t know” facts….
63
65. What does Recognos have
• ETI – Human in the Loop Machine learning Extraction Platform
• Deployment
– The Data - Subscription
– Licensing – on premises – on boarding – training – support
– On the Cloud – delivery on Q2
• Smart Data Platform – depends on every environment – analysis is
needed – on boarding requires consulting
65
66. About Recognos
• Recognos Inc. - California based company – established in 1999
• Has a partner company in New York – Recognos Financial
• Recognos has a development company in Cluj Romania – 80 developers
– established in 2000
• From 2008 – Involved in Semantics
• Main customers – Fisher Investments, DTCC - NY, Clarient - NY, DST,
Bank of Transylvania, OSF Budapest
• About 50% of the revenue through licensing and recurring data contracts
66
68. Next Steps
• Proof of Concept (PoC)
– We will sign an NDA as needed
– We will import your documents
– We will show you the power and ease of use of Recognos solution
• Pilot project
– We will work with you on an ROI centric project
68