SlideShare una empresa de Scribd logo
1 de 69
How to Prepare Unstructured
Data for BI and Data Analytics
George ROTH – CEO Recognos Inc.
Neil MITCHELL – Recognos Inc.
Webinar Starting Soon – Everybody is Placed on Mute
How to Prepare Unstructured
Data for BI and Data Analytics
George ROTH – CEO Recognos Inc.
Neil MITCHELL – Recognos Inc.
Housekeeping
• All attendees are placed on Mute throughout the presentation
• We will make available all the Webinar materials
– The slides will be emailed and the recording posted
• Questions
– Please use the GoToWebinar “chat box” in the control panel to ask any questions
– These will be addressed at the end, as time allows, or written responses provided
• Polling
– To improve these webinars we will ask for your feedback in the form of polling
questions
– They are completely confidential
– Multiple choice
3
AGENDA
A. Structured, Semi-Structured and Un-Structured Content
B. What is Data Preparation in Data Science
C. The Swiss Army Knife of the Data Extraction
D. Processing of Unstructured Non-Classifiable content and
integrate all data (SDP - The Smart Data Platform)
E. On boarding ETI or SDP
F. About Recognos and Next Steps
G.Q&A
4
A. Structured, Semi-Structured and
Un-Structured Content
5
Data Assets Classification
6
The Problem – 3 data types
• 80% of the data in the enterprise is unstructured
• Structured: in tables of a certain sort, object DBs, etc.
• Semi Structured – XML Based
• Unstructured
– Known content, classifiable – key words : Contracts, SEC Documents,
Insurance Quote Document
– Unknown content – with known domain: Board Meetings
– Unknown content with unknown domain: Panama Files, emails (discovery
suites)
7
Data Growth – 42.5% per year – New Data Analytics – N=ALL
8
B. What is Data Preparation in Data
Science
9
Data Preparation (Gartner)
10
What is Data Preparation in Data Science
• In most of the presentations they will say that is a tedious task
• There is no system that will do that
• Not always we know what to prepare for the Data Science applications
• Example:
– NGO – Grant – needed to know the start dates, end dates, amount of money,
name of project
– Needed to find the graph of the recipients to determine connections between
recipients
– Prevent fraud for EU funds – or money laundering
• Need to combine different data types (structured, semi-structured and
unstructured) and to provide for the next steps
11
C. The Swiss Army Knife for
Unstructured Classifiable content
12
The Swiss Army Knife
13
Classifiable Unstructured Content
14
Content that is classifiable by Keywords
• In general legal content
• Can determine the keywords
• Examples:
– Contracts
– SEC Documents
– Different Legal Documents
– Forms (IRS, INS, etc.)
– Hospital Patient Info
– Insurance Info
– Etc.
15
Field Types with their Extraction Methods
Type Field Type Definition Extraction Method Can be setup by
business people
?
Estimated
Percentage in
docs
Expected Accuracy
1Explicit Trainable These fields appear in the
approximate same context,
consistent across documents of the
same type.
Human Assisted Machine
Learning
Y 50%>75%
2Explicit Form Fields These fields are always preceded
by the same labels, same contexts,
etc. Example are any IRS form, the
10K Header.
Predefined templates. Need to
be setup. We are planning to
create the UI for this, we don't
have one. This was the
method that was used for the
10Ks 6 fields.
Y 10%>95%
3Explicit List Fields These fields have the same values
in all documents (with small
variations) that are known from the
beginning.
The user can define a library of
"lists" , and can select a list at
the document setup phase.
Y 10%>90%
4Implicit List Fields The expected values are
predefined but are not present in
the document. Need to be inferred
from the text.
Semantic Scripts, needs a
Semantic Infrastrucutre.
NO 5%>90%
5Semantic Fields These fields have values that are
not consistent across documents
and need semantic analysis.
Semantic Scripts, needs a
Semantic Infrastrucutre.
NO 20%>90%
6Graphical Fields
Presence
We encountered two fields.
Signature Present, Seal Present.
Artificial Vision Neural
Networks are used to detect
those. The algorithms exist,
need to be integrated.
YES 1%>95%
7Tables These are tables in a document.
There are two table types,
Manhatan Tables (no lines) and
others.
Special Artificial Vision method
to detect the table, regular
expressioln to extract the fields
after the table found.
YES 3%>95%
8Enhanced These fields are not in the
document but can be found in
some auxiliary data stores based
on what is in the document.
These fields actualy are
populated in the post
extraction validation /
augmentation process.
NO 1%>95%
100%
16
Swiss Army Knife for Data Extraction
17
ETI- Extract Transform Integrate Platform –
Human in the loop Machine Learning
Document load
•PDF files, containing text or images
•Popular image file formats
Document digitization
•OCR
•Tokenization – identification of words,
sentences, paragraphs within the document
Taxonomy definition
•What are the target documents?
•What data do you want to
extract?
Manual data extraction
Example based
machine learning
Manual data corrections if
necessary – improves
extraction
Automatic data extraction
Data publishing
Initial Setup
Machine Learning
18
Demo for 10K
http://playground.datafactory.recognos.ro/DevUI/#/demo
19
Examples: A Certificate of Incorporation – Insurance Contract
20
Need to define the taxonomy – list of fields
21
Data type classification
22
Key Words
23
Type of fields
24
Field Types
• Trainable: the filed is always in the document (explicit) , in the same
context.
• Not Explicit – for example Has an Audit :Y/N – Has a Signature (Y/N) –
Has a Signature (Y/N)
25
Derived Fields – not trainable – need to write a script
• Need to read the text and determine a Boolean Value
26
Need to interpret text and assign code – code field
27
The system cannot be trained for derived fields !!!
A semantic script for derived fields
28
Table Extraction – VERY DIFFICULT
29
Table Extraction
30
Table Processing
• One of the most difficult tasks
• There are two table types: Manhattan Tables and Lined Tables
• Need to detect where is the table, the “lines” (vertical and horizontal)
• Extract the info
• Use filters derived from visual perception research (the so called Gabor
filters)
• The table line detection method was developed by Dr. Raul C. Mureşan
and Dr. Vasile Vlad Moca, founders of S.C. Neurodynamics S.R.L., for
Recognos . Both Dr. Mureşan and Dr. Moca have an active neuroscience
research career and are affiliated to the Romanian Institute for Science
and Technology (RIST), studied at Max Planck Institute in Germany.
31
What is a Perceptron ? (Wikipedia)
• In machine learning, the perceptron is an algorithm
for supervised learning of binary classifiers: functions that can decide
whether an input (represented by a vector of numbers) belongs to one
class or another. It is a type of linear classifier, i.e. a classification
algorithm that makes its predictions based on a linear predictor
function combining a set of weights with the feature vector. The algorithm
allows for online learning, in that it processes elements in the training set
one at a time.
32
33
Samples of the tables processing
34
How to measure the performance of the extraction process
• Not a simple problem
• Multiple error types
• Language
• OCR quality – language dependent
• OCR – open source, paid (Omni Page, Tesseract)
35
What will be reported
• True Positives
A true positive is a value that was extracted by ETI and was confirmed by the DA
as correct.
• False Positives
False positives are values identified by ETI but corrected by the DA.
• True Negatives
True negatives are values that were not found by ETI and the DA confirms that the
value for that specific filed in the taxonomy is not present in the document. It can
be either left empty by the analyst or it can be manually input without a reference
in the document.
• False Negatives
False negatives are values that ETI did not find in the document but the DA inputs
the values and adds a reference in the document.
The system EPI – Extraction Performance Indicators
– Precision
The precision of the data extraction will tell us how many of the identified values are correct from the total number of
values extracted.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠
=
𝑇𝑃
𝑇𝑃 + 𝐹𝑃
The correct values are the TP, while the total values are TP + FP (correct and incorrect).
– Sensitivity
The sensitivity will tell us how many correct values we retrieved from the total values that could have been extracted.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =
𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡
=
𝑇𝑃
𝑇𝑃 + 𝐹𝑁
The correct values are the TP, while the total values in the document are TP + FN. As defined above FN are the values
that the system identified as missing but the DA found the in the document.
– Accuracy
The Precision and Sensitivity deal only with the extracted values, and do not take into account the values that are really
missing and the system correctly reports them as missing. Accuracy is the EPI that tells us how correct the system
identifies ALL values, both existing and missing.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠
=
𝑇𝑃 + 𝑇𝑁
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
The correctly extracted values are both TP and TN while the total number is the sum of all four measurements.
Compliance Applications
• Provenance
• Always keep link between the data points and the source
• Can be deployed on the cloud
38
US Mutual Fund Data–from documents to analytics (www.rdcmf.com)
39
Data Teams
40
• Need to create data teams
• Data Analysts - responsible with the taxonomies – mapping
• Validation rules
• Manual intervention decreases in time
Poll
• Neil Poll
41
D. Processing of Unstructured Non-
Classifiable content
SDP- Smart Data Platform)
42
Non-Classifiable Content
43
Content is Not Classifiable by keywords – not consistent
• Ontology based classification, extraction
• What is an ontology ?
• RDF
• SPARQL
• Used in Data Integration (Same As)
• We can query Unstructured, Semi Structured and Structured with the
same query language
44
A few semantic terms….
• RDF
• Ontology - OWL
• Linked Data
• Schema.org - Google
• Data.gov
• Data.uk
45
RDF
46
6/30/2016 47
Building Block RDF
“There is a Person identified by http://www.w3.org/People/EM/contact#me, whose name
is Eric Miller, whose email address is em@w3.org, and whose title is Dr.".
Triplets:
(i) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#fullNa
me, "Eric Miller"
(ii) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#person
alTitle, "Dr."
(iii) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/1999/02/22-rdf-syntax-ns#type,
http://www.w3.org/2000/10/swap/pim/contact#Person
(iv) http://www.w3.org/People/EM/contact#me,
http://www.w3.org/2000/10/swap/pim/contact#mailbo
x, em@w3.org
48
6/30/2016 49
Ontologies – OWL (The Panama Files)
From: https://www.linkedin.com/pulse/linked-leaks-powerful-
hybrid-semantic-queries-panama-papers-kiryakov?trk=hp-
feed-article-title-like
Ontology - http://protege.stanford.edu/
50
Online Course
51
Data.gov
52
Linked Data – www.linkeddata.org
53
www.schema.org – alternative to ontologies
54
Ontology Sample (OWL) – A Box – T Box
55
SPARQL – The Semantic Query Language (22 Million RDF triplets)
56
Sample analytics: occupation, countries mostly mentioned in Panama Files
57
Smart Data Platform – unifies all the data
58
The Smart Data Extraction and Integration Platform
59
Query Samples from Mark Logic (SPARQL – XQUERY)
60
Document Adviser
61
E. On boarding ETI or the SDP
62
Onboarding ETI or SDP
• Need to designate a “data Shepherd”
• The data sources need to be analyzed by a business expert (know what
data is where) – bad practice example
• Meta data governance is very important (taxonomies, ontologies)
• Gradually develop the ontology – not at once
• Needs a champion in the enterprise, the beginning is hard
• Work hand in hand with Data Analytics people
• Start small and measure the ROI
• Will have to find the “we don’t know what we don’t know” facts….
63
F. About Recognos and Next Steps
64
What does Recognos have
• ETI – Human in the Loop Machine learning Extraction Platform
• Deployment
– The Data - Subscription
– Licensing – on premises – on boarding – training – support
– On the Cloud – delivery on Q2
• Smart Data Platform – depends on every environment – analysis is
needed – on boarding requires consulting
65
About Recognos
• Recognos Inc. - California based company – established in 1999
• Has a partner company in New York – Recognos Financial
• Recognos has a development company in Cluj Romania – 80 developers
– established in 2000
• From 2008 – Involved in Semantics
• Main customers – Fisher Investments, DTCC - NY, Clarient - NY, DST,
Bank of Transylvania, OSF Budapest
• About 50% of the revenue through licensing and recurring data contracts
66
In the press
• http://www.mondovisione.com/media-and-resources/news/recognos-eti-creates-smarter-data-new-
platform-extracts-transforms-and-integr/
• http://www.dataversity.net/data-extraction-system-unstructured-documents/
• http://www.information-management.com/news/big-data-analytics/recognos-financial-announces-
release-of-ai-based-recognos-eti-10028249-
1.html?utm_medium=email&ET=informationmgmt:e6092429:2042611a:&utm_source=newsletter&utm_
campaign=daily-feb%2012%202016&st=email
• http://www.informationweek.com/big-data/big-data-analytics/7-ways-semantic-technologies-make-data-
make-sense/d/d-id/1323580?image_number=8
• http://raconteur.net/technology/top-5-sectors-using-artificial-intelligence
• http://www.fiercefinanceit.com/story/brain-over-brawn-semantic-technology-and-machine-learning-take-
new-role-man/2015-12-03
• http://www.dataversity.net/semantic-technology-a-new-approach-to-financial-data/
• http://www.recognos.ro/news-and-events/trends-in-ai-technology/#more-1211
• http://www.paymentssource.com/news/paythink/artificial-intelligence-can-nab-money-launderers-
3023456-1.html
• http://tabbforum.com/videos/artificial-intelligence-in-financial-services-2016-trends
67
Next Steps
• Proof of Concept (PoC)
– We will sign an NDA as needed
– We will import your documents
– We will show you the power and ease of use of Recognos solution
• Pilot project
– We will work with you on an ROI centric project
68
Contact
Neil Mitchell
nmitchell@recognos.com
408-838-9381
George Roth
groth@recognos.com
69

Más contenido relacionado

La actualidad más candente

Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Seth Grimes
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Seth Grimes
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics TodaySeth Grimes
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Rinke Hoekstra
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataEvangelia Daskalaki
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs StatisticsAndry Alamsyah
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2Aseel Addawood
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Mathieu d'Aquin
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for MeaningTrey Grainger
 

La actualidad más candente (20)

Text Analytics for Dummies 2010
Text Analytics for Dummies 2010Text Analytics for Dummies 2010
Text Analytics for Dummies 2010
 
Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)Text Analytics Applied (LIDER roadmapping presentation)
Text Analytics Applied (LIDER roadmapping presentation)
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics Today
 
Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Data Science
Data ScienceData Science
Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Artificial Intelligence in Data Curation
Artificial Intelligence in Data CurationArtificial Intelligence in Data Curation
Artificial Intelligence in Data Curation
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
Provenance and Reuse of Open Data (PILOD 2.0 June 2014)
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked DataISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
ISWC 2014 Tutorial - Instance Matching Benchmarks for Linked Data
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs Statistics
 
Data Science Workshop - day 2
Data Science Workshop - day 2Data Science Workshop - day 2
Data Science Workshop - day 2
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0Data analytics beyond data processing and how it affects Industry 4.0
Data analytics beyond data processing and how it affects Industry 4.0
 
Searching for Meaning
Searching for MeaningSearching for Meaning
Searching for Meaning
 

Destacado

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Peter Wren-Hilton
 
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityDealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityGreat Wide Open
 
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataMarco Gralike
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehousephanleson
 
The Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataThe Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataHealth Catalyst
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BIMonaheng Diaho
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarDatameer
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementDataWorks Summit
 

Destacado (8)

Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
Mining Unstructured Data:Practical Applications, from the Strata O'Reilly Mak...
 
Dealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to InfinityDealing with Unstructured Data: Scaling to Infinity
Dealing with Unstructured Data: Scaling to Infinity
 
Hotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured DataHotsos 2013 - Creating Structure in Unstructured Data
Hotsos 2013 - Creating Structure in Unstructured Data
 
Lecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data WarehouseLecture 11 Unstructured Data and the Data Warehouse
Lecture 11 Unstructured Data and the Data Warehouse
 
The Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the DataThe Analytic System: Finding Patterns in the Data
The Analytic System: Finding Patterns in the Data
 
Unstructured Data in BI
Unstructured Data in BIUnstructured Data in BI
Unstructured Data in BI
 
Analyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop WebinarAnalyzing Unstructured Data in Hadoop Webinar
Analyzing Unstructured Data in Hadoop Webinar
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 

Similar a Unstructured data processing webinar 06272016

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsHPCC Systems
 
How Data Collection Shapes MI Performance
How Data Collection Shapes MI PerformanceHow Data Collection Shapes MI Performance
How Data Collection Shapes MI PerformanceNorthwest Analytics
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsPyData
 
ds 1 Introduction to Data Structures.ppt
ds 1 Introduction to Data Structures.pptds 1 Introduction to Data Structures.ppt
ds 1 Introduction to Data Structures.pptAlliVinay1
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research ReportDrMAlagupriyasafiq
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data ProcessingDrMAlagupriyasafiq
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAjaved75
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitizationVenkata Reddy Konasani
 
Chapter 8 system analysis and design
Chapter 8   system analysis and designChapter 8   system analysis and design
Chapter 8 system analysis and designPratik Gupta
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-stepsShesha R
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structuressonykhan3
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityIkbal Ahmed
 
4 Statistical Software.pptx
4 Statistical Software.pptx4 Statistical Software.pptx
4 Statistical Software.pptxkaleabtegegne
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.pptmanaswidebbarma1
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.pptAshok280385
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataCSCJournals
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfAmmarAhmedSiddiqui2
 

Similar a Unstructured data processing webinar 06272016 (20)

Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC SystemsData Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
Data Patterns - A Native Open Source Data Profiling Tool for HPCC Systems
 
EDA
EDAEDA
EDA
 
How Data Collection Shapes MI Performance
How Data Collection Shapes MI PerformanceHow Data Collection Shapes MI Performance
How Data Collection Shapes MI Performance
 
Analyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive SpreadsheetsAnalyst’s Nightmare or Laundering Massive Spreadsheets
Analyst’s Nightmare or Laundering Massive Spreadsheets
 
ds 1 Introduction to Data Structures.ppt
ds 1 Introduction to Data Structures.pptds 1 Introduction to Data Structures.ppt
ds 1 Introduction to Data Structures.ppt
 
Research methodology-Research Report
Research methodology-Research ReportResearch methodology-Research Report
Research methodology-Research Report
 
Research Methodology-Data Processing
Research Methodology-Data ProcessingResearch Methodology-Data Processing
Research Methodology-Data Processing
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Data exploration validation and sanitization
Data exploration validation and sanitizationData exploration validation and sanitization
Data exploration validation and sanitization
 
Chapter 8 system analysis and design
Chapter 8   system analysis and designChapter 8   system analysis and design
Chapter 8 system analysis and design
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
Algorithms and Data Structures
Algorithms and Data StructuresAlgorithms and Data Structures
Algorithms and Data Structures
 
Data Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & NormalityData Analysis in Research: Descriptive Statistics & Normality
Data Analysis in Research: Descriptive Statistics & Normality
 
4 Statistical Software.pptx
4 Statistical Software.pptx4 Statistical Software.pptx
4 Statistical Software.pptx
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt1.1 introduction to Data Structures.ppt
1.1 introduction to Data Structures.ppt
 
Don't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptxDon't Treat the Symptom, Find the Cause!.pptx
Don't Treat the Symptom, Find the Cause!.pptx
 
J48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance DataJ48 and JRIP Rules for E-Governance Data
J48 and JRIP Rules for E-Governance Data
 
ch2 DS.pptx
ch2 DS.pptxch2 DS.pptx
ch2 DS.pptx
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
 

Último

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Último (20)

Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Unstructured data processing webinar 06272016

  • 1. How to Prepare Unstructured Data for BI and Data Analytics George ROTH – CEO Recognos Inc. Neil MITCHELL – Recognos Inc. Webinar Starting Soon – Everybody is Placed on Mute
  • 2. How to Prepare Unstructured Data for BI and Data Analytics George ROTH – CEO Recognos Inc. Neil MITCHELL – Recognos Inc.
  • 3. Housekeeping • All attendees are placed on Mute throughout the presentation • We will make available all the Webinar materials – The slides will be emailed and the recording posted • Questions – Please use the GoToWebinar “chat box” in the control panel to ask any questions – These will be addressed at the end, as time allows, or written responses provided • Polling – To improve these webinars we will ask for your feedback in the form of polling questions – They are completely confidential – Multiple choice 3
  • 4. AGENDA A. Structured, Semi-Structured and Un-Structured Content B. What is Data Preparation in Data Science C. The Swiss Army Knife of the Data Extraction D. Processing of Unstructured Non-Classifiable content and integrate all data (SDP - The Smart Data Platform) E. On boarding ETI or SDP F. About Recognos and Next Steps G.Q&A 4
  • 5. A. Structured, Semi-Structured and Un-Structured Content 5
  • 7. The Problem – 3 data types • 80% of the data in the enterprise is unstructured • Structured: in tables of a certain sort, object DBs, etc. • Semi Structured – XML Based • Unstructured – Known content, classifiable – key words : Contracts, SEC Documents, Insurance Quote Document – Unknown content – with known domain: Board Meetings – Unknown content with unknown domain: Panama Files, emails (discovery suites) 7
  • 8. Data Growth – 42.5% per year – New Data Analytics – N=ALL 8
  • 9. B. What is Data Preparation in Data Science 9
  • 11. What is Data Preparation in Data Science • In most of the presentations they will say that is a tedious task • There is no system that will do that • Not always we know what to prepare for the Data Science applications • Example: – NGO – Grant – needed to know the start dates, end dates, amount of money, name of project – Needed to find the graph of the recipients to determine connections between recipients – Prevent fraud for EU funds – or money laundering • Need to combine different data types (structured, semi-structured and unstructured) and to provide for the next steps 11
  • 12. C. The Swiss Army Knife for Unstructured Classifiable content 12
  • 13. The Swiss Army Knife 13
  • 15. Content that is classifiable by Keywords • In general legal content • Can determine the keywords • Examples: – Contracts – SEC Documents – Different Legal Documents – Forms (IRS, INS, etc.) – Hospital Patient Info – Insurance Info – Etc. 15
  • 16. Field Types with their Extraction Methods Type Field Type Definition Extraction Method Can be setup by business people ? Estimated Percentage in docs Expected Accuracy 1Explicit Trainable These fields appear in the approximate same context, consistent across documents of the same type. Human Assisted Machine Learning Y 50%>75% 2Explicit Form Fields These fields are always preceded by the same labels, same contexts, etc. Example are any IRS form, the 10K Header. Predefined templates. Need to be setup. We are planning to create the UI for this, we don't have one. This was the method that was used for the 10Ks 6 fields. Y 10%>95% 3Explicit List Fields These fields have the same values in all documents (with small variations) that are known from the beginning. The user can define a library of "lists" , and can select a list at the document setup phase. Y 10%>90% 4Implicit List Fields The expected values are predefined but are not present in the document. Need to be inferred from the text. Semantic Scripts, needs a Semantic Infrastrucutre. NO 5%>90% 5Semantic Fields These fields have values that are not consistent across documents and need semantic analysis. Semantic Scripts, needs a Semantic Infrastrucutre. NO 20%>90% 6Graphical Fields Presence We encountered two fields. Signature Present, Seal Present. Artificial Vision Neural Networks are used to detect those. The algorithms exist, need to be integrated. YES 1%>95% 7Tables These are tables in a document. There are two table types, Manhatan Tables (no lines) and others. Special Artificial Vision method to detect the table, regular expressioln to extract the fields after the table found. YES 3%>95% 8Enhanced These fields are not in the document but can be found in some auxiliary data stores based on what is in the document. These fields actualy are populated in the post extraction validation / augmentation process. NO 1%>95% 100% 16
  • 17. Swiss Army Knife for Data Extraction 17
  • 18. ETI- Extract Transform Integrate Platform – Human in the loop Machine Learning Document load •PDF files, containing text or images •Popular image file formats Document digitization •OCR •Tokenization – identification of words, sentences, paragraphs within the document Taxonomy definition •What are the target documents? •What data do you want to extract? Manual data extraction Example based machine learning Manual data corrections if necessary – improves extraction Automatic data extraction Data publishing Initial Setup Machine Learning 18
  • 20. Examples: A Certificate of Incorporation – Insurance Contract 20
  • 21. Need to define the taxonomy – list of fields 21
  • 25. Field Types • Trainable: the filed is always in the document (explicit) , in the same context. • Not Explicit – for example Has an Audit :Y/N – Has a Signature (Y/N) – Has a Signature (Y/N) 25
  • 26. Derived Fields – not trainable – need to write a script • Need to read the text and determine a Boolean Value 26
  • 27. Need to interpret text and assign code – code field 27 The system cannot be trained for derived fields !!!
  • 28. A semantic script for derived fields 28
  • 29. Table Extraction – VERY DIFFICULT 29
  • 31. Table Processing • One of the most difficult tasks • There are two table types: Manhattan Tables and Lined Tables • Need to detect where is the table, the “lines” (vertical and horizontal) • Extract the info • Use filters derived from visual perception research (the so called Gabor filters) • The table line detection method was developed by Dr. Raul C. Mureşan and Dr. Vasile Vlad Moca, founders of S.C. Neurodynamics S.R.L., for Recognos . Both Dr. Mureşan and Dr. Moca have an active neuroscience research career and are affiliated to the Romanian Institute for Science and Technology (RIST), studied at Max Planck Institute in Germany. 31
  • 32. What is a Perceptron ? (Wikipedia) • In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers: functions that can decide whether an input (represented by a vector of numbers) belongs to one class or another. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time. 32
  • 33. 33
  • 34. Samples of the tables processing 34
  • 35. How to measure the performance of the extraction process • Not a simple problem • Multiple error types • Language • OCR quality – language dependent • OCR – open source, paid (Omni Page, Tesseract) 35
  • 36. What will be reported • True Positives A true positive is a value that was extracted by ETI and was confirmed by the DA as correct. • False Positives False positives are values identified by ETI but corrected by the DA. • True Negatives True negatives are values that were not found by ETI and the DA confirms that the value for that specific filed in the taxonomy is not present in the document. It can be either left empty by the analyst or it can be manually input without a reference in the document. • False Negatives False negatives are values that ETI did not find in the document but the DA inputs the values and adds a reference in the document.
  • 37. The system EPI – Extraction Performance Indicators – Precision The precision of the data extraction will tell us how many of the identified values are correct from the total number of values extracted. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑡𝑟𝑖𝑒𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃 The correct values are the TP, while the total values are TP + FP (correct and incorrect). – Sensitivity The sensitivity will tell us how many correct values we retrieved from the total values that could have been extracted. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑟𝑒𝑙𝑒𝑣𝑎𝑛𝑡 𝑣𝑎𝑙𝑢𝑒𝑠 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 𝑒𝑥𝑖𝑠𝑡𝑖𝑛𝑔 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁 The correct values are the TP, while the total values in the document are TP + FN. As defined above FN are the values that the system identified as missing but the DA found the in the document. – Accuracy The Precision and Sensitivity deal only with the extracted values, and do not take into account the values that are really missing and the system correctly reports them as missing. Accuracy is the EPI that tells us how correct the system identifies ALL values, both existing and missing. 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = 𝐶𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑦 𝑖𝑑𝑒𝑛𝑡𝑖𝑓𝑖𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠 = 𝑇𝑃 + 𝑇𝑁 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 The correctly extracted values are both TP and TN while the total number is the sum of all four measurements.
  • 38. Compliance Applications • Provenance • Always keep link between the data points and the source • Can be deployed on the cloud 38
  • 39. US Mutual Fund Data–from documents to analytics (www.rdcmf.com) 39
  • 40. Data Teams 40 • Need to create data teams • Data Analysts - responsible with the taxonomies – mapping • Validation rules • Manual intervention decreases in time
  • 42. D. Processing of Unstructured Non- Classifiable content SDP- Smart Data Platform) 42
  • 44. Content is Not Classifiable by keywords – not consistent • Ontology based classification, extraction • What is an ontology ? • RDF • SPARQL • Used in Data Integration (Same As) • We can query Unstructured, Semi Structured and Structured with the same query language 44
  • 45. A few semantic terms…. • RDF • Ontology - OWL • Linked Data • Schema.org - Google • Data.gov • Data.uk 45
  • 47. 6/30/2016 47 Building Block RDF “There is a Person identified by http://www.w3.org/People/EM/contact#me, whose name is Eric Miller, whose email address is em@w3.org, and whose title is Dr.". Triplets: (i) http://www.w3.org/People/EM/contact#me, http://www.w3.org/2000/10/swap/pim/contact#fullNa me, "Eric Miller" (ii) http://www.w3.org/People/EM/contact#me, http://www.w3.org/2000/10/swap/pim/contact#person alTitle, "Dr." (iii) http://www.w3.org/People/EM/contact#me, http://www.w3.org/1999/02/22-rdf-syntax-ns#type, http://www.w3.org/2000/10/swap/pim/contact#Person (iv) http://www.w3.org/People/EM/contact#me, http://www.w3.org/2000/10/swap/pim/contact#mailbo x, em@w3.org
  • 48. 48
  • 49. 6/30/2016 49 Ontologies – OWL (The Panama Files) From: https://www.linkedin.com/pulse/linked-leaks-powerful- hybrid-semantic-queries-panama-papers-kiryakov?trk=hp- feed-article-title-like
  • 53. Linked Data – www.linkeddata.org 53
  • 54. www.schema.org – alternative to ontologies 54
  • 55. Ontology Sample (OWL) – A Box – T Box 55
  • 56. SPARQL – The Semantic Query Language (22 Million RDF triplets) 56
  • 57. Sample analytics: occupation, countries mostly mentioned in Panama Files 57
  • 58. Smart Data Platform – unifies all the data 58
  • 59. The Smart Data Extraction and Integration Platform 59
  • 60. Query Samples from Mark Logic (SPARQL – XQUERY) 60
  • 62. E. On boarding ETI or the SDP 62
  • 63. Onboarding ETI or SDP • Need to designate a “data Shepherd” • The data sources need to be analyzed by a business expert (know what data is where) – bad practice example • Meta data governance is very important (taxonomies, ontologies) • Gradually develop the ontology – not at once • Needs a champion in the enterprise, the beginning is hard • Work hand in hand with Data Analytics people • Start small and measure the ROI • Will have to find the “we don’t know what we don’t know” facts…. 63
  • 64. F. About Recognos and Next Steps 64
  • 65. What does Recognos have • ETI – Human in the Loop Machine learning Extraction Platform • Deployment – The Data - Subscription – Licensing – on premises – on boarding – training – support – On the Cloud – delivery on Q2 • Smart Data Platform – depends on every environment – analysis is needed – on boarding requires consulting 65
  • 66. About Recognos • Recognos Inc. - California based company – established in 1999 • Has a partner company in New York – Recognos Financial • Recognos has a development company in Cluj Romania – 80 developers – established in 2000 • From 2008 – Involved in Semantics • Main customers – Fisher Investments, DTCC - NY, Clarient - NY, DST, Bank of Transylvania, OSF Budapest • About 50% of the revenue through licensing and recurring data contracts 66
  • 67. In the press • http://www.mondovisione.com/media-and-resources/news/recognos-eti-creates-smarter-data-new- platform-extracts-transforms-and-integr/ • http://www.dataversity.net/data-extraction-system-unstructured-documents/ • http://www.information-management.com/news/big-data-analytics/recognos-financial-announces- release-of-ai-based-recognos-eti-10028249- 1.html?utm_medium=email&ET=informationmgmt:e6092429:2042611a:&utm_source=newsletter&utm_ campaign=daily-feb%2012%202016&st=email • http://www.informationweek.com/big-data/big-data-analytics/7-ways-semantic-technologies-make-data- make-sense/d/d-id/1323580?image_number=8 • http://raconteur.net/technology/top-5-sectors-using-artificial-intelligence • http://www.fiercefinanceit.com/story/brain-over-brawn-semantic-technology-and-machine-learning-take- new-role-man/2015-12-03 • http://www.dataversity.net/semantic-technology-a-new-approach-to-financial-data/ • http://www.recognos.ro/news-and-events/trends-in-ai-technology/#more-1211 • http://www.paymentssource.com/news/paythink/artificial-intelligence-can-nab-money-launderers- 3023456-1.html • http://tabbforum.com/videos/artificial-intelligence-in-financial-services-2016-trends 67
  • 68. Next Steps • Proof of Concept (PoC) – We will sign an NDA as needed – We will import your documents – We will show you the power and ease of use of Recognos solution • Pilot project – We will work with you on an ROI centric project 68