Automated Data Capture Using Regular Expressions

Using Regular Expressions for Data Mining and Automated Data Capture and Indexing
Copyright © 2010 - 2013 DocuFi. All Rights Reserved

In a Document Management Environment

First: What is automated data capture?
Just identifying and extracting information or data (sometimes called metadata) from scanned documents
Data Capture:

First: What is automated data capture or data mining?
Just identifying and extracting information or data (sometimes called metadata) from scanned documents
Data Capture:
Automated
Data Capture:
Applying the principles of automation to data capture, silly!
This can also be called text data mining.

Why automate data capture?
Manual Data Capture is Expensive
and Time Consuming

Problems with manual data entry:
1.Security maybe compromised if documents taken off premises
2.A delay is introduced if documents taken off premises
3.Compared to automated extraction, manual indexing is slow
4.Manual indexing doesn’t scale well with large projects
5.Manual indexing has the potential to introduce errors into the data

and…
Problems with manual data entry:
1.Security maybe compromised if documents taken off premises
2.A delay is introduced if documents taken off premises
3.Compared to automated extraction, manual indexing is slow
4.Manual indexing doesn’t scale well with large projects
5.Manual indexing has the potential to introduce errors into the data

There’s a Mountain of It!
Let’s take a look at just invoices for example…

According to an Aberdeen Group August 2010 report, 72 percent of received invoices are paper-based.

Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.

Companies responding to PayStream Advisors’ 2010 Invoice Automation Benchmarking survey indicated that they receive 77 percent of their invoices via paper.
and it’s expensive
An Aberdeen Group March 2012 publication estimates the costs of processing a single invoice from $4.84 to $20.13.

So if e-invoicing is not an option (as it’s not for many), what?
sending and receiving invoices electronically
e-invoicing:
“it is the front-end capture options…that introduce true performance gains. For example, respondents who have implemented front-end document capture (creating a scanned digital copy of a physical invoice to be used in the approval process) report invoice processing 34% faster than those who process invoices manually. Moving to the pure data end of the spectrum, companies that convert scanned documents into usable data (through optical character recognition or similar technologies), report a 26% faster processing time than those that work only with document images.”
---Aberdeen’s 2010 report
(
)

And, We All Know, Time is Money

Don’t forget we are using invoices only as an example. But, this could apply to patient records, legal documents, purchase orders…any document.

Now that you know this is all about money, let’s go back to the focus of this slideshow.

What are Regular Expressions or regex?
Regular expressions (regex) provide a fast and powerful method to search, extract and replace specific data found within scanned documents.
Regular expressions are essentially a special text string for describing a search pattern. You could think of regular expressions as extremely powerful wildcards.

What’s it look like?
A simple regular expression might look something like this: ^∖s{1,3}[A-Z0-9]XYZ

^
Start at the beginning of a string or line
∖s{1,3}
Find a space that occurs between 1 and 3 times
[A-Z0-9]*
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible.
XYZ
Find the literal characters “XYZ”

^
Start at the beginning of a string or line
∖s{1,3}
Find a space that occurs between 1 and 3 times
[A-Z0-9]*
Find any character in the range A-Z and 0-9, the “*” is the instruction to find as many occurrences as possible.
XYZ
Find the literal characters “XYZ”
If we had the value “ AZR8987XYZ” in our document at the start of a line we would get a match whereas if we had “ AZR898XY” we would not.

Huh?
Don’t worry, this is not a tutorial on writing regex. We just want to look at some examples and understand how regex can apply to data capture and indexing in a document management environment.

Regular expressions are extremely flexible and patterns can be constructed to match almost anything. For text commonly found in documents such as dates, SSNs, ZIP codes etc., patterns are freely available on the Internet.
Here are some examples:
Zip Codes
^(?!00000)(?<zip>(?<zip5>∖d{5})(?:[ - ](?=∖d))?(?<zip4>∖d{4})?)$
US Phone Number
^([0-9]( |-)?)?(∖(?[0-9]{3} ∖)?|[0-9]{3})( |- )?([0-9]{3}( |-)?[0-9]{4}|[a-zA-Z0-9]{7})$
Credit Card
(^(4|5)∖d{3}-?∖d{4}-?∖d{4}- ?∖d{4}|(4|5)∖d{15})|(^(6011)-?∖d{4}- ?∖d{4}-?∖d{4}|(6011)- ?∖d{12})|(^((3∖d{3}))-∖d{6}- ∖d{5}|^((3∖d{14})))

Here is a partial invoice where you might need to capture the "Catalogue Number“.
Real World Example

In order to start constructing a regular expression we have to use what we know from the data in front of us as well as making some assumptions. During testing we can refine the regular expression.
In this example we can assume from the document that the catalogue number has the format of a single uppercase letter, followed by 2 digits then a hyphen followed by a single uppercase letter and 6 digits or just 6 digits.

We could use the regex of [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{6} extract the data. Let's again break this down:
[A-Z]
Find a character from A-Z, the absence of a quantifier specification,“{}”, assumes we are only looking for 1 character
∖d{2}
Find exactly 2 digits
-
Find the literal character “-“
[A- Z]{0,1}
Find a character A-Z between 0 and 1 repetitions
∖d{6}
Find exactly 6 digits
This is just one way of writing a regular expression for this example although there are various ways it could be written. If we should subsequently find that the last portion of the catalogue number might contain 4 to 6 digits, we could simply amend it as follows [A-Z] ∖d{2}-[A-Z]{0,1} ∖d{4,6}.

We’ll take a look at how regex is used in ImageRamp Batch. It’s a simple-to-use folder processing tool that accelerates getting data and files into various EMR, Document Management or other secure storage environments. It can be used to capture and extract data in both structured and unstructured documents.
As an example, we might want to extract data from a scanned file with the following 4 fields:
Now how would this work in a data capture solution? Company Name Company Number Date SIC Code

Here is the ImageRamp screen showing the scanned file pages and the data extracted using regex for the four fields we listed.

Hang on, we’ll show it. We’ll use it to split individual company’s invoices from an multipage scan based on the Company Name and extract index data.
A company might use this to scan a large stack of invoices and split the file every time a new invoicing company name is located using the regex scripts.
So where is the regex?

First we are going to define the regex to perform document splitting when a new Company Name is located in ImageRamp’s Splitting and Extraction’s Data Mining submenu as shown below.
Let’s break it down—-splitting the scan stack.
(?<=∖bCompany∖s*Name∖s+ ∖b)[a-z0-9∖(∖) ]*
… and check the “Split if Matched” option.

Remember in our example we identified CompanyName, CompanyNo, Date and SICcode as the index or metadata information we want to capture. So here we are extracting the date field using the regex in the Index Fields section of the Data Mining submenu.
(?<Date>(?<= ∖bDate of this return∖s+∖b)∖d{2}/∖d{2}/∖d{4})
--capturing the index data.

Information extracted through the text data mining with regex can also be used to name the file and create folders.
Here %regex1 corresponds to the first regex field definition (CompanyName)
and %regex2 corresponds to the second field definition (CompanyNo).
But wait, there’s more.

We hope we have demonstrated the immense power of using regular expressions to extract data from both structured and unstructured data.
Data in the palm of your hand…not locked in your documents!
and…

For more on:
•Data Mining PDF
•Data mining Scans
•Invoice Mining
•Patient Record Mining
•OCR mining
•TIF mining
•Extracting meta data,
•Data extraction from unstructured data
•Intelligent data capture
•Data extraction
•Using regex to extract data
•Document scanning
•Extracting data
•Extract meta data,
•Scanner software,
•Barcode recognition,
•OCR software,
•Capture tutorial
•Pdf scanning,
•Scanning software
•Indexing
•Document indexing
•Automated capture
•Meta data
•Scan to index
•Batch Processing
•Bulk scanning
•Docufi
•Imageramp
•Data capture
•Migration to document management
the power of ImageRamp and its other features including:
Learn more about…
Full text OCR to PDF PDF rights management and encryption Document naming, splitting, and routing based on barcodes
and… Image processing for clean up and adaptive thresholding OCR (Optical Character Recognition) Barcode reading (1D and 2D)

Further reading on Regular Expressions:
More? http://en.wikipedia.org/wiki/Regular_expression http://regexlib.com/ http://www.regular-expressions.info/

docufi.com
@imageramp
@docufinews

Automated Data Capture Using Regular Expressions

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Automated Data Capture Using Regular Expressions

Similar a Automated Data Capture Using Regular Expressions (20)

Último

Último (20)

Automated Data Capture Using Regular Expressions