SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing 
Copyright © 2010 - 2013 DocuFi. All Rights Reserved
In a Document Management Environment 
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
First: What is automated data capture? 
Just identifying and extracting information or data (sometimes called metadata) from scanned documents 
Data Capture:
First: What is automated data capture or data mining? 
Just identifying and extracting information or data (sometimes called metadata) from scanned documents 
Data Capture: 
Automated 
Data Capture: 
Applying the principles of automation to data capture, silly! 
This can also be called text data mining.
Why automate data capture? 
Manual Data Capture is Expensive 
and Time Consuming
Problems with manual data entry: 
1.Security maybe compromised if documents taken off premises 
2.A delay is introduced if documents taken off premises 
3.Compared to automated extraction, manual indexing is slow 
4.Manual indexing doesn’t scale well with large projects 
5.Manual indexing has the potential to introduce errors into the data 
Why automate data capture?
and… 
Why automate data capture? 
Problems with manual data entry: 
1.Security maybe compromised if documents taken off premises 
2.A delay is introduced if documents taken off premises 
3.Compared to automated extraction, manual indexing is slow 
4.Manual indexing doesn’t scale well with large projects 
5.Manual indexing has the potential to introduce errors into the data
There’s a Mountain of It!
There’s a Mountain of It! 
Let’s take a look at just invoices for example…
There’s a Mountain of It! 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It! 
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
There’s a Mountain of It! 
Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. 
and it’s expensive 
According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based. 
An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
So if e-invoicing is not an option (as it’s not for many), what? 
sending and receiving invoices electronically 
e-invoicing: 
“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” 
---Aberdeen’s 2010 report 
( 
)
And, We All Know, Time is Money
Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.
Now that you know this is all about money, let’s go back to the focus of this slideshow.
Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
There’s a Mountain of It! 
What are Regular Expressions or regex? 
Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. 
Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ 
^ 
Start at the beginning of a string or line 
∖s{1,3} 
Find a space that occurs between 1 and 3 times 
[A-Z0-9]* 
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. 
XYZ 
Find the literal characters “XYZ”
There’s a Mountain of It! 
What’s it look like? 
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ 
^ 
Start at the beginning of a string or line 
∖s{1,3} 
Find a space that occurs between 1 and 3 times 
[A-Z0-9]* 
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. 
XYZ 
Find the literal characters “XYZ” 
If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
There’s a Mountain of It! 
Huh? 
Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.
There’s a Mountain of It! 
Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. 
Here are some examples: 
Zip Codes 
^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$ 
US Phone Number 
^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$ 
Credit Card 
(^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))
There’s a Mountain of It! 
Here is a partial invoice where you might need to capture the "Catalogue Number“. 
Real World Example
There’s a Mountain of It! 
In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. 
In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down: 
[A-Z] 
Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character 
∖d{2} 
Find exactly 2 digits 
- 
Find the literal character “-“ 
[A- Z]{0,1} 
Find a character A-Z between 0 and 1 repetitions 
∖d{6} 
Find exactly 6 digits 
This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. 
As an example, we might want to extract data from a scanned file with the following 4 fields: 
Now how would this work in a data capture solution? Company Name Company Number Date SIC Code
Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data. 
A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts. 
So where is the regex?
First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below. 
Let’s break it down—-splitting the scan stack. 
(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]* 
… and check the “Split if Matched” option.
Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu. 
(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4}) 
--capturing the index data.
Information extracted through the text data mining with regex can also be used to name the file and create folders. 
Here %regex1 corresponds to the first regex field definition (CompanyName) 
and %regex2 corresponds to the second field definition (CompanyNo). 
But wait, there’s more.
We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data. 
Data in the palm of your hand…not locked in your documents! 
and…
For more on: 
•Data Mining PDF 
•Data mining Scans 
•Invoice Mining 
•Patient Record Mining 
•OCR mining 
•TIF mining 
•Extracting meta data, 
•Data extraction from unstructured data 
•Intelligent data capture 
•Data extraction 
•Using regex to extract data 
•Document scanning 
•Extracting data 
•Extract meta data, 
•Scanner software, 
•Barcode recognition, 
•OCR software, 
•Capture tutorial 
•Pdf scanning, 
•Scanning software 
•Indexing 
•Document indexing 
•Automated capture 
•Meta data 
•Scan to index 
•Batch Processing 
•Bulk scanning 
•Docufi 
•Imageramp 
•Data capture 
•Migration to document management 
the power of ImageRamp and its other features including: 
Learn more about… 
Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes 
and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)
More?
Further reading on Regular Expressions: 
More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/
docufi.com 
@imageramp 
@docufinews

Más contenido relacionado

La actualidad más candente

Painless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with AlfrescoPainless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with AlfrescoBlueFishTX
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architectureRahul Chaturvedi
 
DocuSolve Scanning Solutions
DocuSolve Scanning SolutionsDocuSolve Scanning Solutions
DocuSolve Scanning SolutionsGordon Bishop
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET Journal
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and MiningDaniel JACOB
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 

La actualidad más candente (20)

Folder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch ScanningFolder Watching For Automated Document Capture, Batch Scanning
Folder Watching For Automated Document Capture, Batch Scanning
 
Batch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp BatchBatch Document Processing with ImageRamp Batch
Batch Document Processing with ImageRamp Batch
 
An Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your RequirementsAn Introduction to Document Scanning, Understanding Your Requirements
An Introduction to Document Scanning, Understanding Your Requirements
 
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
Fujitsu ScanSnap Scanner, an overview of document data capture with barcodes,...
 
8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial8 Document Capture Must Haves, a Document Management Tutorial
8 Document Capture Must Haves, a Document Management Tutorial
 
Painless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with AlfrescoPainless Document Scanning and Indexing with Alfresco
Painless Document Scanning and Indexing with Alfresco
 
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
Mobile Cloud Capture: Customize your Data Capture on Mobile Devices with Proc...
 
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned ImagesImprove OCR Accuracy, Clean Up and Enhance Scanned Images
Improve OCR Accuracy, Clean Up and Enhance Scanned Images
 
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
ChronoScan Document Scanning and Capture for Unparralleled Data Extraction an...
 
Custom Capture Tool Development
Custom Capture Tool DevelopmentCustom Capture Tool Development
Custom Capture Tool Development
 
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File FormatsPDF vs. TIFF, An Evaluation of Document Scanning File Formats
PDF vs. TIFF, An Evaluation of Document Scanning File Formats
 
Tips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading BarcodesTips to Solve Common Problems Reading Barcodes
Tips to Solve Common Problems Reading Barcodes
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Key aspects of big data storage and its architecture
Key aspects of big data storage and its architectureKey aspects of big data storage and its architecture
Key aspects of big data storage and its architecture
 
DocuSolve Scanning Solutions
DocuSolve Scanning SolutionsDocuSolve Scanning Solutions
DocuSolve Scanning Solutions
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
IRJET- Classification of Pattern Storage System and Analysis of Online Shoppi...
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Odam: Open Data, Access and Mining
Odam: Open Data, Access and MiningOdam: Open Data, Access and Mining
Odam: Open Data, Access and Mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 

Similar a Automated Data Capture Using Regular Expressions

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxMalla Reddy University
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and RetrievalOptum
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really DoingDave Stokes
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph WebinarNeo4j
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitAmazon Web Services
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsIJMER
 
What is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CVWhat is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CVJobTatkal
 
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...Amazon Web Services
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationKate Subramanian
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhVISHALMARWADE1
 

Similar a Automated Data Capture Using Regular Expressions (20)

Unit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptxUnit 2 - Data Manipulation with R.pptx
Unit 2 - Data Manipulation with R.pptx
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Data Mining _ Weka
Data Mining _ WekaData Mining _ Weka
Data Mining _ Weka
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Structured Document Search and Retrieval
Structured Document Search and RetrievalStructured Document Search and Retrieval
Structured Document Search and Retrieval
 
50120130406017
5012013040601750120130406017
50120130406017
 
What Your Database Query is Really Doing
What Your Database Query is Really DoingWhat Your Database Query is Really Doing
What Your Database Query is Really Doing
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Database Project
Database ProjectDatabase Project
Database Project
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
RDBMS to Graph Webinar
RDBMS to Graph WebinarRDBMS to Graph Webinar
RDBMS to Graph Webinar
 
Understanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web SummitUnderstanding Graph Databases: AWS Developer Workshop at Web Summit
Understanding Graph Databases: AWS Developer Workshop at Web Summit
 
Vision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result RecordsVision Based Deep Web data Extraction on Nested Query Result Records
Vision Based Deep Web data Extraction on Nested Query Result Records
 
What is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CVWhat is parsing, and how to make a parsable CV
What is parsing, and how to make a parsable CV
 
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
Search Your DynamoDB Data with Amazon Elasticsearch Service (ANT302) - AWS re...
 
A Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence ApplicationA Data Warehouse And Business Intelligence Application
A Data Warehouse And Business Intelligence Application
 
data wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjhdata wrangling (1).pptx kjhiukjhknjbnkjh
data wrangling (1).pptx kjhiukjhknjbnkjh
 

Último

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Último (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Automated Data Capture Using Regular Expressions

  • 1. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing Copyright © 2010 - 2013 DocuFi. All Rights Reserved
  • 2. In a Document Management Environment Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  • 3. First: What is automated data capture? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture:
  • 4. First: What is automated data capture or data mining? Just identifying and extracting information or data (sometimes called metadata) from scanned documents Data Capture: Automated Data Capture: Applying the principles of automation to data capture, silly! This can also be called text data mining.
  • 5. Why automate data capture? Manual Data Capture is Expensive and Time Consuming
  • 6. Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data Why automate data capture?
  • 7. and… Why automate data capture? Problems with manual data entry: 1.Security maybe compromised if documents taken off premises 2.A delay is introduced if documents taken off premises 3.Compared to automated extraction, manual indexing is slow 4.Manual indexing doesn’t scale well with large projects 5.Manual indexing has the potential to introduce errors into the data
  • 9. There’s a Mountain of It! Let’s take a look at just invoices for example…
  • 10. There’s a Mountain of It! According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  • 11. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.
  • 12. There’s a Mountain of It! Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper. and it’s expensive According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based. An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.
  • 13. So if e-invoicing is not an option (as it’s not for many), what? sending and receiving invoices electronically e-invoicing: “it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.” ---Aberdeen’s 2010 report ( )
  • 14. And, We All Know, Time is Money
  • 15. Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.
  • 16. Now that you know this is all about money, let’s go back to the focus of this slideshow.
  • 17. Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
  • 18. There’s a Mountain of It! What are Regular Expressions or regex? Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents. Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.
  • 19. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ
  • 20. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ”
  • 21. There’s a Mountain of It! What’s it look like? A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ ^ Start at the beginning of a string or line ∖s{1,3} Find a space that occurs between 1 and 3 times [A-Z0-9]* Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible. XYZ Find the literal characters “XYZ” If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.
  • 22. There’s a Mountain of It! Huh? Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.
  • 23. There’s a Mountain of It! Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet. Here are some examples: Zip Codes ^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$ US Phone Number ^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$ Credit Card (^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))
  • 24. There’s a Mountain of It! Here is a partial invoice where you might need to capture the "Catalogue Number“. Real World Example
  • 25. There’s a Mountain of It! In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression. In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.
  • 26. We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down: [A-Z] Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character ∖d{2} Find exactly 2 digits - Find the literal character “-“ [A- Z]{0,1} Find a character A-Z between 0 and 1 repetitions ∖d{6} Find exactly 6 digits This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.
  • 27. We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents. As an example, we might want to extract data from a scanned file with the following 4 fields: Now how would this work in a data capture solution? Company Name Company Number Date SIC Code
  • 28. Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.
  • 29. Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data. A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts. So where is the regex?
  • 30. First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below. Let’s break it down—-splitting the scan stack. (?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]* … and check the “Split if Matched” option.
  • 31. Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu. (?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4}) --capturing the index data.
  • 32. Information extracted through the text data mining with regex can also be used to name the file and create folders. Here %regex1 corresponds to the first regex field definition (CompanyName) and %regex2 corresponds to the second field definition (CompanyNo). But wait, there’s more.
  • 33. We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data. Data in the palm of your hand…not locked in your documents! and…
  • 34. For more on: •Data Mining PDF •Data mining Scans •Invoice Mining •Patient Record Mining •OCR mining •TIF mining •Extracting meta data, •Data extraction from unstructured data •Intelligent data capture •Data extraction •Using regex to extract data •Document scanning •Extracting data •Extract meta data, •Scanner software, •Barcode recognition, •OCR software, •Capture tutorial •Pdf scanning, •Scanning software •Indexing •Document indexing •Automated capture •Meta data •Scan to index •Batch Processing •Bulk scanning •Docufi •Imageramp •Data capture •Migration to document management the power of ImageRamp and its other features including: Learn more about… Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)
  • 35. More?
  • 36. Further reading on Regular Expressions: More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/