AGI - Information Management Consultants went in 2002 from corporate business libraries to academic libraries. They build a turn-key application fitting to major library management systems - named intelligentCAPTURE. It merges knowledge about workflows, library standards and data capturing technologies in a simple user interface. It has the target to increase the indexing depth of library catalogues by digging in the table of contents of books or abstracts of articles. Both are already highly condensed content provided by the author, in natural language and often not in the language of the user. The application has a dialog for scanning the parts of a printed publication - handling of paper and selecting the relevant parts to be captured and processed is the only human work – 1 to 2 minutes in average. Everything else is done parallel and automatically in the background. With Abbyy FineReader Engine, a strong KI application the printed text in images is recognized. The library systems provide the languages of the scanned documents. Based on that information one or multiple languages from a total of 200 supported languages are selected automatically. The text analytic engine used is built for German. Therefore between OCR output and text analytics machine translation from Google-API is built in. The linguistic text analytics engine extracts about 10 % of the table of contents or abstracts as normalized descriptors, free descriptors, noun phrases, names, countries. Title, OCR text and all indexing results can finally be translated into one or multiple target languages and forwarded to the library management systems or other facilities.
AGI provides additionally a Lucene based search application with build in thesaurus searching. dandelon.com is an implementation of this search application holding about 1,5 million thesaurus entries in up to 25 languages from multiple subject areas like MeSH for medicine, EUROVOC for political science – or thesauri developed by AGI clients. dandelon.com is public and increases by 6000 books per month, including 1000 ebooks. For ebooks scanning and OCR is replaced by a PDF text extraction facility. Beyond this complete articles, books or other kinds of papers can be processed and forwarded to data containers and search facilities.
8. 1 x german + 1 x english source documents
https://serveragi11.agi-imc.de/dandelon-
download/attachments/dandelon/ids/AT001FBE1DC95177B89C0C1258
250002DE121.PDF?contentDisposition=inline#search=AT001FBE1DC95
177B89C0C1258250002DE121&pagemode=thumbs
https://serveragi11.agi-imc.de/dandelon-
download/attachments/dandelon/ids/DEAGID984CFF9842FD48DC125
81F10031617F.PDF?contentDisposition=inline#search=vision%20digital
isierte%20automobilindustrie%202030&pagemode=thumbs
18. 3. T1 Optical Character Recognition (OCR) -
with Abbyy FineReader Engine
• 1. Document structure - where is text and in which way it is sorted (text, tables, images)
• 2. Narrowing down to character level and comparing patterns
• 3. Building probabilistic hypothesis about words and lines
• 4. Comparison with word patterns and dictionaries in 48 languages
• 5. Final decisions and export of text, font, graphics, layout in multiple formats
• Principles: Integrity, Purposefulness and Adaptability
• The principle of integrity says that the observed object must always be considered as a “whole” consisting of many interrelated
parts. The principle of purposefulness supposes that any interpretation of data must always serve some purpose. And the principle
of adaptability means that the program must be capable of self-learning.
19. 3. T3 Text Analytics (IAI Autoindex / CAI-Engine)
Linguistic and Statistic (German)
• 1. Document heuristic - first part and last part is more important than long part between.
• 2. Finding words, sentences, paragraphs
• 3. Analyzing each word, word variant, compound words, word types (60.000 German words with
grammar rules)
• 4. Term weight according to word type, position in sentence, structure of sentences, number of
occurrences, knowledge domain
• 5. Extraction of nouns, compound words, noun phrases, names (people, places), unknown words
• 6. Increase term weight if term is known in the build in 220.000 word thesaurus
20. 3. T3. Translation - with Google Translation
API
• Statistical model + since 2016 neural network model
Learning from bilingual text corpora
• Final translation: source language -> English -> target language
• https://en.wikipedia.org/wiki/Google_Translate
• https://en.wikipedia.org/wiki/Google_Neural_Machine_Translation
21. 3. T4 Semantic query expansion (millions of
word-word relations)
• Apache Lucene document index with build in terminology index from
different domain thesauri (Lucene is embedded in IBM Domino)
• Personal collection of thesauri from 30 years of development
22.
23. AGI - Information Management Consultants
Dipl.-Inf.wiss. Manfred Hauer M.A.
Mandelring 238 b
67433 Neustadt an der Weinstrasse
Germany
Tel +49 (0) 6321 / 9635-10
https://www.agi-imc.de
https://www.agi-imc.de/dandelonXR.nsf
Manfred.Hauer@agi-imc.de
Google Maps