SlideShare a Scribd company logo
1 of 34
Language Computer Corporation:
      Knowledge Supremacy through
   Customizable Text Extraction Products

          Andrew Hickl, CEO / President
         Language Computer Corporation
                December 2008
Language Computer Corporation (LCC)

    “Boutique” provider of next‐generation natural language processing 
•
    software solutions for Government and commercial customers
    Founded 1995
•
    Based in Richardson, Texas
•
    25 developers and researchers
•
    Strong track record:  top marks at more than 20 different Government 
•
    evaluations since 1999
         Question Answering (TREC, 1999‐2008)
     –
         Summarization (DUC, 2003‐2008)
     –
         Information Extraction (ACE, 2005‐2006)
     –
         Textual Inference (RTE, 2006‐2008)
     –
A Brief History of LCC

    1996‐2004:  Closed‐Domain Information Extraction
•
         MUC / Tipster (precursors to ACE)
     –
         Grammar‐ or rule‐based systems
     –
         Entity Extraction (100+ types, English)
     –
         Relationship Extraction (50+ types, English)
     –
         Event Extraction (5‐8 types, English)
     –

    1999 – : First Automatic Question‐Answering Systems
•
         TREC Question Answering evaluations
     –
         Factoid:  What is Britney Spears’s middle name?
     –
         Complex:  What impact did Hurricane Gustav have on the Dallas economy?
     –
         Yes/No:  Did Lindsay Lohan’s album reach #1?
     –
         How‐To:  How do I file an extension on my 2008 Federal Income Taxes?
     –
         Why:  Why did John McCain name Sarah Palin as his running mate?
     –
A Brief History of LCC

    2002 – : Wide‐Coverage Entity Extraction System
•
     – Used a maximum entropy‐based framework to categorize more than 350
       different name categories in text
         • English:  368 types
         • French, Spanish, German, Dutch, Russian, Japanese:  ~100 types
         • Arabic, Chinese, Farsi, Korean:  ~50 types
     – Dependent on sources of training data


    2004 – :  First Open‐Domain, Customizable Event Extraction System
•
         Used active learning to leverage feedback gathered from a user
     –
         Allows users to define event extractors for any event of interest
     –
         Deployed for other languages:  English, Arabic, Chinese, Korean, Farsi
     –
         Completely ontology‐independent
     –
A Brief History of LCC

    2007 – :  First Customizable Information Extraction Systems
•
     – Allows users to define extractors for any entity, attribute, relationship, or  
       sentiment / attitude expressed in text
     – Used active learning to leverage feedback gathered from a user
     – Leverages automatic candidate generation techniques to find new instances 
       for extractor training
     – Deployed for other languages:  English, Arabic, Chinese
     – Completely ontology‐independent

    2007 – :  Truly Domain‐Independent Extraction
•
     – Allows extractors to maintain high levels of performance, regardless of 
       training  or testing domain
     – Reduces “overfitting” to particular domain
     – Reduces “tag spam”:  overtagging of certain (frequent) categories in out‐of‐
       domain documents
A Brief History of LCC

    2008 – :  First Automatic Dossier / Infobox Generation System
•
     – Learns what attributes and relationships are inherently relevant for an entity 
       from information stored in unstructured text
     – Generates either Wikipedia‐style infoboxes or prose descriptions 
       (a.k.a. “dossiers”) for each entity
     – Capable of analogizing from existing structured data resources or learning 
       from feedback provided by users


    2008 – :  Robust Textual Inference for NLP Applications
•
     – Deployed state‐of‐the‐art system for recognizing textual entailment to 
       validate content stored in large databases
     – Developed temporal inference systems capable of accurately timestamping 
       events mentioned in text / message traffic
Our Mission

     Provide customers with knowledge supremacy necessary to
             support analytic operations in any domain.


Make it easy (and cost‐effective) to unlock knowledge from collections 
           of unstructured text in any language or domain.


     Develop “game changing” search and discovery tools which 
                   turn knowledge into value.



           Build the premier information extraction brand.
Key Delineators

    Scalable.   
•
     – LCC’s entity, relationship, attribute, and event extraction tools provide access to 
       more types of information than any other provider.
    Customizable.
•
     – LCC’s customization framework allows content providers to add value to existing 
       repositories quickly – and cheaply.
    Flexible.
•
     – LCC’s learning‐based extraction tools won’t degrade when run on “new” types of 
       documents.
    Deployable.
•
     – LCC offers distributable and parallelizable components which can be run 
       in any environment – big or small.
    Integrate‐able.
•
     – LCC’s products are designed to interoperate with a customer’s existing text and 
       knowledge management tools.
    Reliable.
•
     – 10+ years of excellence in providing USG customer with high‐tech
       NLP solutions that just work.
How do you achieve knowledge supremacy?

    Wide Coverage (enough for most applications)
•
    Customizable (in minutes, or less)
•
    Trainable (by application builders or end‐users)
•
    Domain Portable (with next to no human intervention) 
•
    Fast (enough to index TBs of text)
•
    Manageable (demonstrated value‐add)
•

     Challenge: Is it possible to build an extraction system 
             which can learn hundreds of types?
Solving (part of) the Coverage Problem: CiceroLite

    LCC’s wide‐coverage named entity recognizer, CiceroLite, categorizes 8 
•
    high‐frequency NE classes with over 90% precision and recall.




    But it’s capable of much more: the English language version of CiceroLite 
    can also categorize 368 different NE classes, including:
How do you achieve knowledge supremacy?

    Wide Coverage (enough for most applications)
•
    Customizable (in minutes, or less)
•
    Trainable (by application builders or end‐users)
•
    Domain Portable (with next to no human intervention)
•
    Fast (enough to index TBs of text)
•
    Manageable (demonstrated value‐add)
•


    Challenge: Is it possible to build an extraction system 
       which can allow users to create new extractors?
Introducing… CiceroCustom

    CiceroCustom can be used to extract nearly any type of entity, attribute, 
•
    relationship, or event information from text without the need for hand‐
    crafted rules or pre‐specified extraction templates. 

    Three steps to customized information extraction:
•
     – Step 1. Use CiceroCustom to define a customized extractor which specifies 
       that the type of information that a user is most interested in.
     – Step 2.   Use the CiceroCustom GUI to “train” each extractor:
         • Mark instances as “relevant” or “irrelevant”
         • Correct annotations supplied by CiceroCustom
         • Accurate results seen after < 15 minutes of training
     – Step 3. Use extractors to extract information from new texts
Traditional Text Extraction vs. CiceroCustom
                                      Traditional Extraction                CiceroCustom
Ontology Required?                     Fixed set of templates           User‐defined templates
Techniques used?                       Heuristics / Classifiers             Active Learning
Information considered?            Limited to information found in    Inter‐ and Intra‐ sentential 
                                          a single sentence                    extraction
Access to discourse information?                N/A                   Automatic Discourse Parsing
Domain portability?                     Domain‐Dependent                 Domain‐Independent
Applicable to new genres?           Performance degrades when         Robust performance across 
                                       applied to new genres              document genres
Representation of information?            Fixed, Immutable               Dynamically created
Discovery of new, essential                     User                          Automatic
information?
Coreference?                                    User                          Automatic
Level of expertise required?             Extraction Experts                  Any End User
Time to create extractors?                  Days, Weeks                     Minutes, Hours
CiceroCustom: Innovations

    First open‐domain extraction system that can be customized in minutes
•
         Active learning‐based framework makes it possible for novices to train high‐performance extractors 
     –
         in under an hour
         Extractors can be refined / split / fused as needs change
     –
    State‐of‐the‐art inference‐based instance fusion
•
         State‐of‐the‐art temporal, spatial, and textual inference components make it possible to fuse partial 
     –
         representations into coherent instances that can be used operationally
    Automatic Discovery of Essential Information Related to Candidates
•
         Rich semantic substrate helps extraction models identify all of the information needed for extraction
     –
    First Extraction System to Leverage Multiple Semantic Parsers
•
         Combines dependency information from PropBank, NomBank, and FrameNet to automatically 
     –
         create semantic representations for entities, attributes, relationships, or events of interest
         First work done leveraging semantic parsing for extraction done at LCC:  (Surdeanu et al. 2003)
     –
    State‐of‐the‐Art Discourse Parsing
•
         Identification of relations between sentences or events provides for greater recall of extractors
     –
         Extraction can go beyond a single sentence
     –
How do you achieve knowledge supremacy?

    Wide Coverage (enough for most applications)
•
    Customizable (in minutes, or less)
•
    Trainable (by application builders or end‐users)
•
    Domain Portable (with next to no human intervention)
•
    Fast (enough to index TBs of text)
•
    Manageable (demonstrated value‐add)
•
What does it mean to be “domain portable”?

  Performance of most learning‐based extraction systems (entity, event, 
•
  etc.) suffers when trained and tested on different types of documents
• Most IE systems suffer degradation of > ‐30% when ported to new 
  domains (e.g. newswire  message traffic)

    LCC is pioneering new unsupervised and lightly‐supervised approaches to 
•
    reduce the amount of degradation observed when testing on out‐of‐
    domain documents



             With ~15 minutes of input from a user, 
          LCC reduces extractor error by an average of 25%.
How do you achieve knowledge supremacy?

    Wide Coverage (enough for most applications)
•
    Customizable (in minutes, or less)
•
    Trainable (by application builders or end‐users)
•
    Domain Portable (with next to no human intervention)
•
    Fast (enough to index TBs of text)
•
    Manageable (demonstrated value‐add)
•
Performance Profile: 2 GHz, single core, 2 GB RAM
How do you achieve knowledge supremacy?

    Wide Coverage (enough for most applications)
•
    Customizable (in minutes, or less)
•
    Trainable (by application builders or end‐users)
•
    Domain Portable (with next to no human intervention)
•
    Fast (enough to index TBs of text)
•
    Manageable (demonstrated value‐add)
•
LCC Text Processing Cycle
                                                                          Question Answering
                      Open APIs
                                                                           Semantic Search
                     Web Services
                                                                          Keyword Expansion
                       Java RMI

                                            Analytic      Info
                                            Output        Need



                                                                                    Geocoding
  Predictive Analysis                                                               Spatial Inference
                                                                 Situational
                                Analysis
Socio-Cultural Analysis                                                             Timestamping
                                                                 Awareness
  Dossier Generation                                                                Temporal Inference




                                                          Data 
                                           Processing
                                                        Collection

                                                                        Data Ingestion & Indexing
             Named Entity Recognition
              Information Extraction
              Coreference Resolution
Dossier Generation (2009)

    Need for tools which can automatically assemble
•
    high‐quality knowledge resources from information 
    extracted from text

    LCC is developing an integrated, unsupervised 
•
    Dossier Generation capability which can assemble 
    relevant entity profiles (either as unstructured text or 
    Intellipedia‐style structured infoboxes)
         Hundreds of Entity, Relation, Attribute, Events
     –
         Implicit Relations from Data Mining Systems
     –
         Normalized Dates / Times / Locations
     –
         Learning‐based relevance detection algorithms capable 
     –
         of learning what’s relevant for each individual or 
         category of individuals
Database Validation (2009)




                                       Content
                                      Validation
Information
                                  The attack took place in the morning.
  Retrieval
                                     The attack killed 2 caretakers.

                                      The attack damaged 50 cars.

                                   The attack damaged 20 buildings.


                    Commitment     The mosque was in Mariengasse.
                     Extraction
                             Anas Shakfeh said the attack was a protest by
                             rightest circles against the Islam conference.
Knowledge Acquisition for Link Analysis (2009)
                   Entity Extraction
                         Relationship Extraction
                             Event Extraction
                                   Untyped Dependency Extraction




   Model                                          Semantic
  Feedback                                         Triples


                  Weights,
   Entailment                                       Graph
                  Pruning
   Validation                                     Population



    Candidate                                      Graph
    Relations                                      Edges
                   Inference
                  Enrichment
LCC Services

    Custom End‐to‐End Application Development
•
    Custom Component Development
•
    Corporate R&D
•
    Production Services
•
    Data Verification Services
•
    Support and Maintenance
•
Who is LCC’s customer base?

    Target Markets
•
         Government, Intelligence, and Defense
     –
         Commodity Search Providers
     –
         Company, Credit, and Financial Information
     –
         News and Trade Publishers
     –
         General Aggregators and Distributors
     –
         Pharma
     –


    Emerging Markets
•
         Legal 
     –
         CRM
     –
         Supply Chain Management
     –
         Business Intelligence Providers
     –
         Healthcare
     –
Who are LCC’s partners?

    Strategic Partners                         Technology Partners
•                                          •
     – Application Developers (with                 Extraction Providers
                                                –
       complementary S&D interests)                 Data Mining Providers
                                                –
     – Visualization Developers                     Database Providers
                                                –
     – Commodity Search Providers                   Inference Providers
                                                –
     – Mobile App Developers



    Integration Partners                       Channel Partners
•                                          •
     – Large Government integrators             – Content Providers
       with access to customers,                   • News
       systems of record                           • Education
     – Large software vendors with                 • Financial
       interest in extraction technology
                                                   • Business Intelligence
CiceroLite

          High‐performance named entity 
      •
          recognition for multiple 
          languages

          Foreign Languages:
      •
           – English (3/2009: > 1000 types)
           – Spanish, French, Dutch, German, 
             Russian, Japanese (~100 types)
           – Arabic, Chinese, Farsi, Korean 
             (~50 types)

          Available as server or standalone 
      •
          application
PinPoint

        Geocoding of more than 10M 
    •
        place names
             Absolute Expressions
         –
             Relative Expressions
         –
             Street Addresses
         –
             Latitude / Longitude or MGRS
         –

        Timestamping for events and 
    •
        event‐denoting nominals
         – Absolute Expressions
         – Relative Expressions
         – Duration Estimation

        Available as a server app only
    •
CiceroCustom

          Open‐domain, customizable:
      •
           – Entity
           – Attribute
           – Relationship
           and
           – Event Extraction


          Foreign language support:
      •
           – Arabic, Chinese

          Available as a server or 
      •
          standalone application
IndexManager

        Distributable annotation and 
      •
        indexing that’s compatible with 
        all of LCC’s products
      • Can index annotations from 
        multiple providers into single 
        open‐standard index format

          Document formats supported: 
      •
          .xml, .html, .pdf, .doc, .ppt, .txt, 
          e‐mail, etc.

          Available as a server or a desktop 
      •
          application
Sentiment Tracking

             Identifies sentiment, opinions, 
         •
             and other subjective attitudes 
             held by individuals towards any of 
             a set of “target” products or 
             issues.

             Only available for English
         •

             Only available as a server app
         •

             Can be run with LCC’s indexes –
         •
             or any standard Apache Lucene 
             index.
Ferret

         State‐of‐the‐art question 
    •
         answering for factoid, list, and 
         complex questions

         Foreign Language Support:
    •
          – English, Arabic, Chinese, Farsi, 
            Korean, Turkish, Spanish, French, 
            Dutch, German, Japanese

      Available as a server or 
    •
      standalone application
    • Can be run with LCC’s indexes –
      or any standard Apache Lucene 
      index.
GistTexter

          Summarization for document 
      •
          clusters or search results

          Foreign Language Support:
      •
           – English, Arabic, Chinese, Farsi, 
             and Korean


        Available as a server or 
      •
        standalone application
      • Can be run with LCC’s indexes –
        or any standard Apache Lucene 
        index.
For More Information

    For more information, contact us:
•

     – Andrew Hickl, CEO/President
       andy@languagecomputer.com
       tel:  (972) 231‐0052, Extension 114
       cel:  (858) 366‐8424

    Websites:
•
     – Corporate:  http://www.languagecomputer.com
     – Labs:  http://labs.languagecomputer.com
     – Online Demos:  http://www.getferret.com

More Related Content

Viewers also liked

Flax Awareness Society
Flax Awareness Society Flax Awareness Society
Flax Awareness Society Om Verma
 
ARKNAV Company profile 2014
ARKNAV Company profile 2014ARKNAV Company profile 2014
ARKNAV Company profile 2014Aileen Marshall
 
Curso De Coaching
Curso De CoachingCurso De Coaching
Curso De Coachingguest30412
 
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martin
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint MartinLa Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martin
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martinmairiebreuillevert
 
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016Bjarke Alling
 
E health ses-eng
E health ses-engE health ses-eng
E health ses-engLuis Lozano
 
Memes | Locos del Social Media
Memes | Locos del Social MediaMemes | Locos del Social Media
Memes | Locos del Social MediaPepe Romera
 
El templo del santo grial
El templo del santo grialEl templo del santo grial
El templo del santo grialobatala39
 
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...ExternalEvents
 
9.Mothers day slide share By Mr Allah Dad Khan Visiting Professor he Univer...
9.Mothers day slide share   By Mr Allah Dad Khan Visiting Professor he Univer...9.Mothers day slide share   By Mr Allah Dad Khan Visiting Professor he Univer...
9.Mothers day slide share By Mr Allah Dad Khan Visiting Professor he Univer...Mr.Allah Dad Khan
 
Fo Ing025 A Martinrea Estampados Die Spec
Fo Ing025 A Martinrea Estampados Die SpecFo Ing025 A Martinrea Estampados Die Spec
Fo Ing025 A Martinrea Estampados Die Specguest2f0309
 
Mk experiencial laura galvez navarro
Mk experiencial laura galvez navarroMk experiencial laura galvez navarro
Mk experiencial laura galvez navarrolauragalna
 

Viewers also liked (20)

Valdivia
ValdiviaValdivia
Valdivia
 
Flax Awareness Society
Flax Awareness Society Flax Awareness Society
Flax Awareness Society
 
ARKNAV Company profile 2014
ARKNAV Company profile 2014ARKNAV Company profile 2014
ARKNAV Company profile 2014
 
Webquest mejorada 5
Webquest mejorada 5Webquest mejorada 5
Webquest mejorada 5
 
Curso De Coaching
Curso De CoachingCurso De Coaching
Curso De Coaching
 
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martin
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint MartinLa Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martin
La Lettre du Maire - Juin 2012 - Nuémro spécial Eglise Saint Martin
 
Examen entel carlos cornejo benjamin reed
Examen entel  carlos cornejo   benjamin reedExamen entel  carlos cornejo   benjamin reed
Examen entel carlos cornejo benjamin reed
 
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016
Drop blækket - brug computeren. Dansk IT gå-hjem-møde 3. marts 2016
 
Jornada de metrologia
Jornada de metrologiaJornada de metrologia
Jornada de metrologia
 
Resume TT Deutsch.pdf.docx
Resume TT Deutsch.pdf.docxResume TT Deutsch.pdf.docx
Resume TT Deutsch.pdf.docx
 
Comandos redes
Comandos  redesComandos  redes
Comandos redes
 
E health ses-eng
E health ses-engE health ses-eng
E health ses-eng
 
Memes | Locos del Social Media
Memes | Locos del Social MediaMemes | Locos del Social Media
Memes | Locos del Social Media
 
Consejosgatos pdf
Consejosgatos pdfConsejosgatos pdf
Consejosgatos pdf
 
El templo del santo grial
El templo del santo grialEl templo del santo grial
El templo del santo grial
 
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...
Experiencias sobre la puesta en marcha de un proyecto de construcción de una ...
 
9.Mothers day slide share By Mr Allah Dad Khan Visiting Professor he Univer...
9.Mothers day slide share   By Mr Allah Dad Khan Visiting Professor he Univer...9.Mothers day slide share   By Mr Allah Dad Khan Visiting Professor he Univer...
9.Mothers day slide share By Mr Allah Dad Khan Visiting Professor he Univer...
 
Fo Ing025 A Martinrea Estampados Die Spec
Fo Ing025 A Martinrea Estampados Die SpecFo Ing025 A Martinrea Estampados Die Spec
Fo Ing025 A Martinrea Estampados Die Spec
 
Mk experiencial laura galvez navarro
Mk experiencial laura galvez navarroMk experiencial laura galvez navarro
Mk experiencial laura galvez navarro
 
Desayuno con diamantes
Desayuno con diamantesDesayuno con diamantes
Desayuno con diamantes
 

Similar to Language Computer Corporation: Text Extraction Profile

Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearnCascading
 
Engineering Effectiveness
Engineering EffectivenessEngineering Effectiveness
Engineering EffectivenessMarcio Sete
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...Dr. Haxel Consult
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONAvinash Deshpande
 
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan GoleDemocratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan GoleDatabricks
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...Dataconomy Media
 
Applying Web 2.0 Concepts to Your Business
Applying Web 2.0 Concepts to Your BusinessApplying Web 2.0 Concepts to Your Business
Applying Web 2.0 Concepts to Your Businessdigitalev
 
The information supernova
The information supernovaThe information supernova
The information supernovaAlaa Al-Agamawi
 
Oracle analytics cloud overview feb 2017
Oracle analytics cloud overview   feb 2017Oracle analytics cloud overview   feb 2017
Oracle analytics cloud overview feb 2017aioughydchapter
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsNicolas (Nick) Barcet
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingCascading
 
A7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudA7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudDr. Wilfred Lin (Ph.D.)
 
Programming Language Selection
Programming Language SelectionProgramming Language Selection
Programming Language SelectionDhananjay Nene
 
Radio Engage Presentation
Radio Engage PresentationRadio Engage Presentation
Radio Engage PresentationAmie Forest
 
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]Netex Learning
 
Requirementv4
Requirementv4Requirementv4
Requirementv4stat
 
What “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincWhat “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincDon Day
 
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraWebinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraDataStax
 

Similar to Language Computer Corporation: Text Extraction Profile (20)

Cascading concurrent yahoo lunch_nlearn
Cascading concurrent   yahoo lunch_nlearnCascading concurrent   yahoo lunch_nlearn
Cascading concurrent yahoo lunch_nlearn
 
Engineering Effectiveness
Engineering EffectivenessEngineering Effectiveness
Engineering Effectiveness
 
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
IC-SDV 2019: Down-to-earth machine learning: What you always wanted your data...
 
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATIONLogitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
Logitech - LOGITECH ACCELERATES CLOUD ANALYTICS USING DATA VIRTUALIZATION
 
StephanieRoberts16
StephanieRoberts16StephanieRoberts16
StephanieRoberts16
 
Democratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan GoleDemocratizing Apache Spark for the Enterprise with Jonathan Gole
Democratizing Apache Spark for the Enterprise with Jonathan Gole
 
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found..."Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
"Updates on Semantic Fingerprinting", Francisco Webber, Inventor and Co-Found...
 
Applying Web 2.0 Concepts to Your Business
Applying Web 2.0 Concepts to Your BusinessApplying Web 2.0 Concepts to Your Business
Applying Web 2.0 Concepts to Your Business
 
The information supernova
The information supernovaThe information supernova
The information supernova
 
Oracle analytics cloud overview feb 2017
Oracle analytics cloud overview   feb 2017Oracle analytics cloud overview   feb 2017
Oracle analytics cloud overview feb 2017
 
Transforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOpsTransforming to OpenStack: a sample roadmap to DevOps
Transforming to OpenStack: a sample roadmap to DevOps
 
Accelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with CascadingAccelerate Big Data Application Development with Cascading
Accelerate Big Data Application Development with Cascading
 
Resume
ResumeResume
Resume
 
A7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloudA7 storytelling with_oracle_analytics_cloud
A7 storytelling with_oracle_analytics_cloud
 
Programming Language Selection
Programming Language SelectionProgramming Language Selection
Programming Language Selection
 
Radio Engage Presentation
Radio Engage PresentationRadio Engage Presentation
Radio Engage Presentation
 
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]
Netex learningMaker | Authoring tool for HTML5 e-learning content [EN]
 
Requirementv4
Requirementv4Requirementv4
Requirementv4
 
What “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information ModelincWhat “Model” DITA Specializations Can Teach About Information Modelinc
What “Model” DITA Specializations Can Teach About Information Modelinc
 
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache CassandraWebinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
Webinar: Comparing DataStax Enterprise with Open Source Apache Cassandra
 

Language Computer Corporation: Text Extraction Profile

  • 1. Language Computer Corporation: Knowledge Supremacy through Customizable Text Extraction Products Andrew Hickl, CEO / President Language Computer Corporation December 2008
  • 2. Language Computer Corporation (LCC) “Boutique” provider of next‐generation natural language processing  • software solutions for Government and commercial customers Founded 1995 • Based in Richardson, Texas • 25 developers and researchers • Strong track record:  top marks at more than 20 different Government  • evaluations since 1999 Question Answering (TREC, 1999‐2008) – Summarization (DUC, 2003‐2008) – Information Extraction (ACE, 2005‐2006) – Textual Inference (RTE, 2006‐2008) –
  • 3. A Brief History of LCC 1996‐2004:  Closed‐Domain Information Extraction • MUC / Tipster (precursors to ACE) – Grammar‐ or rule‐based systems – Entity Extraction (100+ types, English) – Relationship Extraction (50+ types, English) – Event Extraction (5‐8 types, English) – 1999 – : First Automatic Question‐Answering Systems • TREC Question Answering evaluations – Factoid:  What is Britney Spears’s middle name? – Complex:  What impact did Hurricane Gustav have on the Dallas economy? – Yes/No:  Did Lindsay Lohan’s album reach #1? – How‐To:  How do I file an extension on my 2008 Federal Income Taxes? – Why:  Why did John McCain name Sarah Palin as his running mate? –
  • 4. A Brief History of LCC 2002 – : Wide‐Coverage Entity Extraction System • – Used a maximum entropy‐based framework to categorize more than 350 different name categories in text • English:  368 types • French, Spanish, German, Dutch, Russian, Japanese:  ~100 types • Arabic, Chinese, Farsi, Korean:  ~50 types – Dependent on sources of training data 2004 – :  First Open‐Domain, Customizable Event Extraction System • Used active learning to leverage feedback gathered from a user – Allows users to define event extractors for any event of interest – Deployed for other languages:  English, Arabic, Chinese, Korean, Farsi – Completely ontology‐independent –
  • 5. A Brief History of LCC 2007 – :  First Customizable Information Extraction Systems • – Allows users to define extractors for any entity, attribute, relationship, or   sentiment / attitude expressed in text – Used active learning to leverage feedback gathered from a user – Leverages automatic candidate generation techniques to find new instances  for extractor training – Deployed for other languages:  English, Arabic, Chinese – Completely ontology‐independent 2007 – :  Truly Domain‐Independent Extraction • – Allows extractors to maintain high levels of performance, regardless of  training  or testing domain – Reduces “overfitting” to particular domain – Reduces “tag spam”:  overtagging of certain (frequent) categories in out‐of‐ domain documents
  • 6. A Brief History of LCC 2008 – :  First Automatic Dossier / Infobox Generation System • – Learns what attributes and relationships are inherently relevant for an entity  from information stored in unstructured text – Generates either Wikipedia‐style infoboxes or prose descriptions  (a.k.a. “dossiers”) for each entity – Capable of analogizing from existing structured data resources or learning  from feedback provided by users 2008 – :  Robust Textual Inference for NLP Applications • – Deployed state‐of‐the‐art system for recognizing textual entailment to  validate content stored in large databases – Developed temporal inference systems capable of accurately timestamping  events mentioned in text / message traffic
  • 7. Our Mission Provide customers with knowledge supremacy necessary to support analytic operations in any domain. Make it easy (and cost‐effective) to unlock knowledge from collections  of unstructured text in any language or domain. Develop “game changing” search and discovery tools which  turn knowledge into value. Build the premier information extraction brand.
  • 8. Key Delineators Scalable.    • – LCC’s entity, relationship, attribute, and event extraction tools provide access to  more types of information than any other provider. Customizable. • – LCC’s customization framework allows content providers to add value to existing  repositories quickly – and cheaply. Flexible. • – LCC’s learning‐based extraction tools won’t degrade when run on “new” types of  documents. Deployable. • – LCC offers distributable and parallelizable components which can be run  in any environment – big or small. Integrate‐able. • – LCC’s products are designed to interoperate with a customer’s existing text and  knowledge management tools. Reliable. • – 10+ years of excellence in providing USG customer with high‐tech NLP solutions that just work.
  • 9. How do you achieve knowledge supremacy? Wide Coverage (enough for most applications) • Customizable (in minutes, or less) • Trainable (by application builders or end‐users) • Domain Portable (with next to no human intervention)  • Fast (enough to index TBs of text) • Manageable (demonstrated value‐add) • Challenge: Is it possible to build an extraction system  which can learn hundreds of types?
  • 10. Solving (part of) the Coverage Problem: CiceroLite LCC’s wide‐coverage named entity recognizer, CiceroLite, categorizes 8  • high‐frequency NE classes with over 90% precision and recall. But it’s capable of much more: the English language version of CiceroLite  can also categorize 368 different NE classes, including:
  • 11. How do you achieve knowledge supremacy? Wide Coverage (enough for most applications) • Customizable (in minutes, or less) • Trainable (by application builders or end‐users) • Domain Portable (with next to no human intervention) • Fast (enough to index TBs of text) • Manageable (demonstrated value‐add) • Challenge: Is it possible to build an extraction system  which can allow users to create new extractors?
  • 12. Introducing… CiceroCustom CiceroCustom can be used to extract nearly any type of entity, attribute,  • relationship, or event information from text without the need for hand‐ crafted rules or pre‐specified extraction templates.  Three steps to customized information extraction: • – Step 1. Use CiceroCustom to define a customized extractor which specifies  that the type of information that a user is most interested in. – Step 2.   Use the CiceroCustom GUI to “train” each extractor: • Mark instances as “relevant” or “irrelevant” • Correct annotations supplied by CiceroCustom • Accurate results seen after < 15 minutes of training – Step 3. Use extractors to extract information from new texts
  • 13. Traditional Text Extraction vs. CiceroCustom Traditional Extraction CiceroCustom Ontology Required? Fixed set of templates User‐defined templates Techniques used? Heuristics / Classifiers Active Learning Information considered? Limited to information found in  Inter‐ and Intra‐ sentential  a single sentence extraction Access to discourse information? N/A Automatic Discourse Parsing Domain portability? Domain‐Dependent Domain‐Independent Applicable to new genres? Performance degrades when  Robust performance across  applied to new genres document genres Representation of information? Fixed, Immutable Dynamically created Discovery of new, essential  User Automatic information? Coreference? User Automatic Level of expertise required? Extraction Experts Any End User Time to create extractors? Days, Weeks Minutes, Hours
  • 14. CiceroCustom: Innovations First open‐domain extraction system that can be customized in minutes • Active learning‐based framework makes it possible for novices to train high‐performance extractors  – in under an hour Extractors can be refined / split / fused as needs change – State‐of‐the‐art inference‐based instance fusion • State‐of‐the‐art temporal, spatial, and textual inference components make it possible to fuse partial  – representations into coherent instances that can be used operationally Automatic Discovery of Essential Information Related to Candidates • Rich semantic substrate helps extraction models identify all of the information needed for extraction – First Extraction System to Leverage Multiple Semantic Parsers • Combines dependency information from PropBank, NomBank, and FrameNet to automatically  – create semantic representations for entities, attributes, relationships, or events of interest First work done leveraging semantic parsing for extraction done at LCC:  (Surdeanu et al. 2003) – State‐of‐the‐Art Discourse Parsing • Identification of relations between sentences or events provides for greater recall of extractors – Extraction can go beyond a single sentence –
  • 15. How do you achieve knowledge supremacy? Wide Coverage (enough for most applications) • Customizable (in minutes, or less) • Trainable (by application builders or end‐users) • Domain Portable (with next to no human intervention) • Fast (enough to index TBs of text) • Manageable (demonstrated value‐add) •
  • 16. What does it mean to be “domain portable”? Performance of most learning‐based extraction systems (entity, event,  • etc.) suffers when trained and tested on different types of documents • Most IE systems suffer degradation of > ‐30% when ported to new  domains (e.g. newswire  message traffic) LCC is pioneering new unsupervised and lightly‐supervised approaches to  • reduce the amount of degradation observed when testing on out‐of‐ domain documents With ~15 minutes of input from a user,  LCC reduces extractor error by an average of 25%.
  • 17. How do you achieve knowledge supremacy? Wide Coverage (enough for most applications) • Customizable (in minutes, or less) • Trainable (by application builders or end‐users) • Domain Portable (with next to no human intervention) • Fast (enough to index TBs of text) • Manageable (demonstrated value‐add) •
  • 18. Performance Profile: 2 GHz, single core, 2 GB RAM
  • 19. How do you achieve knowledge supremacy? Wide Coverage (enough for most applications) • Customizable (in minutes, or less) • Trainable (by application builders or end‐users) • Domain Portable (with next to no human intervention) • Fast (enough to index TBs of text) • Manageable (demonstrated value‐add) •
  • 20. LCC Text Processing Cycle Question Answering Open APIs Semantic Search Web Services Keyword Expansion Java RMI Analytic Info Output Need Geocoding Predictive Analysis Spatial Inference Situational Analysis Socio-Cultural Analysis Timestamping Awareness Dossier Generation Temporal Inference Data  Processing Collection Data Ingestion & Indexing Named Entity Recognition Information Extraction Coreference Resolution
  • 21. Dossier Generation (2009) Need for tools which can automatically assemble • high‐quality knowledge resources from information  extracted from text LCC is developing an integrated, unsupervised  • Dossier Generation capability which can assemble  relevant entity profiles (either as unstructured text or  Intellipedia‐style structured infoboxes) Hundreds of Entity, Relation, Attribute, Events – Implicit Relations from Data Mining Systems – Normalized Dates / Times / Locations – Learning‐based relevance detection algorithms capable  – of learning what’s relevant for each individual or  category of individuals
  • 22. Database Validation (2009) Content Validation Information The attack took place in the morning. Retrieval The attack killed 2 caretakers. The attack damaged 50 cars. The attack damaged 20 buildings. Commitment The mosque was in Mariengasse. Extraction Anas Shakfeh said the attack was a protest by rightest circles against the Islam conference.
  • 23. Knowledge Acquisition for Link Analysis (2009) Entity Extraction Relationship Extraction Event Extraction Untyped Dependency Extraction Model Semantic Feedback Triples Weights, Entailment Graph Pruning Validation Population Candidate Graph Relations Edges Inference Enrichment
  • 24. LCC Services Custom End‐to‐End Application Development • Custom Component Development • Corporate R&D • Production Services • Data Verification Services • Support and Maintenance •
  • 25. Who is LCC’s customer base? Target Markets • Government, Intelligence, and Defense – Commodity Search Providers – Company, Credit, and Financial Information – News and Trade Publishers – General Aggregators and Distributors – Pharma – Emerging Markets • Legal  – CRM – Supply Chain Management – Business Intelligence Providers – Healthcare –
  • 26. Who are LCC’s partners? Strategic Partners Technology Partners • • – Application Developers (with  Extraction Providers – complementary S&D interests) Data Mining Providers – – Visualization Developers Database Providers – – Commodity Search Providers Inference Providers – – Mobile App Developers Integration Partners Channel Partners • • – Large Government integrators  – Content Providers with access to customers,  • News systems of record • Education – Large software vendors with  • Financial interest in extraction technology • Business Intelligence
  • 27. CiceroLite High‐performance named entity  • recognition for multiple  languages Foreign Languages: • – English (3/2009: > 1000 types) – Spanish, French, Dutch, German,  Russian, Japanese (~100 types) – Arabic, Chinese, Farsi, Korean  (~50 types) Available as server or standalone  • application
  • 28. PinPoint Geocoding of more than 10M  • place names Absolute Expressions – Relative Expressions – Street Addresses – Latitude / Longitude or MGRS – Timestamping for events and  • event‐denoting nominals – Absolute Expressions – Relative Expressions – Duration Estimation Available as a server app only •
  • 29. CiceroCustom Open‐domain, customizable: • – Entity – Attribute – Relationship and – Event Extraction Foreign language support: • – Arabic, Chinese Available as a server or  • standalone application
  • 30. IndexManager Distributable annotation and  • indexing that’s compatible with  all of LCC’s products • Can index annotations from  multiple providers into single  open‐standard index format Document formats supported:  • .xml, .html, .pdf, .doc, .ppt, .txt,  e‐mail, etc. Available as a server or a desktop  • application
  • 31. Sentiment Tracking Identifies sentiment, opinions,  • and other subjective attitudes  held by individuals towards any of  a set of “target” products or  issues. Only available for English • Only available as a server app • Can be run with LCC’s indexes – • or any standard Apache Lucene  index.
  • 32. Ferret State‐of‐the‐art question  • answering for factoid, list, and  complex questions Foreign Language Support: • – English, Arabic, Chinese, Farsi,  Korean, Turkish, Spanish, French,  Dutch, German, Japanese Available as a server or  • standalone application • Can be run with LCC’s indexes – or any standard Apache Lucene  index.
  • 33. GistTexter Summarization for document  • clusters or search results Foreign Language Support: • – English, Arabic, Chinese, Farsi,  and Korean Available as a server or  • standalone application • Can be run with LCC’s indexes – or any standard Apache Lucene  index.
  • 34. For More Information For more information, contact us: • – Andrew Hickl, CEO/President andy@languagecomputer.com tel:  (972) 231‐0052, Extension 114 cel:  (858) 366‐8424 Websites: • – Corporate:  http://www.languagecomputer.com – Labs:  http://labs.languagecomputer.com – Online Demos:  http://www.getferret.com