SlideShare una empresa de Scribd logo
1 de 100
Descargar para leer sin conexión
Getting Started with Unstructured
    Data
    Christine Connors & Kevin Lynch
    TriviumRLG LLC

    Semantic Tech & Business, Washington D.C.
    November 29, 2011

Tuesday, November 29, 2011
Meta


    ✤   Presenter: Christine Connors

         ✤    @cjmconnors

    ✤   Presenter: Kevin Lynch

         ✤    @kevinjohnlynch

    ✤   Principals at www.triviumrlg.com



Tuesday, November 29, 2011
Agenda

    ✤   What is unstructured data?

    ✤   Where do we find it?

    ✤   How important is it?

    ✤   How do we visualize it?

    ✤   Machine processing for actionable data

    ✤   Tools


Tuesday, November 29, 2011
What is unstructured data?


    ✤   Data which is

         ✤    Not in a database

         ✤    Does not adhere to a formal data model

    ✤   Content




Tuesday, November 29, 2011
Isn’t that a misnomer?

    ✤   Problematic term

    ✤   The presence of object metadata or aesthetic markup does not alone
        give ‘structure’ in this sense of the word

         ✤    Object metadata = machine or applied properties

         ✤    Aesthetic markup = stylesheets; rendering information

    ✤   Semi-structured data is typically treated as unstructured for the
        purposes of machine processing and analysis


Tuesday, November 29, 2011
Types of ‘un’structured data



    ✤   Text-based documents

         ✤    Word processing, presentations, email, blogs, wikis, tweets, web
              pages, web components (read/write web)

    ✤   Audio/video files




Tuesday, November 29, 2011
Where do we find it?

    ✤   Office productivity suites

    ✤   Content management systems

    ✤   Digital asset management systems

    ✤   Web content management systems

         ✤    Wikis, blogs, comment & discussion threads

    ✤   Social networking tools

         ✤    Twitter, Yammer, instant messengers

Tuesday, November 29, 2011
Is it really that important?
                             Structured               Unstructured



                                                15%




                                          85%




Tuesday, November 29, 2011
What’s in that 80-85%?




    ✤   Progress reports -
        created in a word processor




Tuesday, November 29, 2011
What’s in that 80-85%?




    ✤   Dashboards -
        created in presentation software




Tuesday, November 29, 2011
What’s in that 80-85%?



    ✤   Progress reports -
        color coded text in a
        spreadsheet




Tuesday, November 29, 2011
What’s in that 80-85%?



    ✤   Brainstorming -
        in messaging systems

    ✤   Decision making - in email




Tuesday, November 29, 2011
What’s in that 80-85%?




    ✤   Business intelligence - on the
        web and more




Tuesday, November 29, 2011
How can we make the data more
    actionable?

    ✤   Identify it

    ✤   Convert to a format you can work with

    ✤   Add structure, meaning:

         ✤    information extraction

         ✤    annotation

         ✤    content analytics


Tuesday, November 29, 2011
What about enterprise search?


    ✤   First line of defense

    ✤   Points you at the highest relevancy ranked data via pattern matching
        and statistical analysis

    ✤   Does not assist in other visualizations or transformations without
        further machine processing




Tuesday, November 29, 2011
Machine Processing


 Unstructured                 Natural                       Rules-based
                                            Statistical                   Semantic
    Data                     Language                        Classifica-
                                            Analysis                      Analysis
                             Processing                         tion



                                          Machine Processing Platform
                                                           Federated
                                                            Search        A
                                                                          P   Index
                                                                          I

     Visualizations                                       Data Stores
Tuesday, November 29, 2011
Let’s go a little deeper...




Tuesday, November 29, 2011
Good News, Bad News

    ✤   Good: Basic text analysis tools are widely available; cheap or free

    ✤   Good: The range of information you can now consider has broadened;
        the intelligence you can bring to bear on that information has
        increased

    ✤   Bad: Skillsets not widely available (but they are available!)

    ✤   Good: You can get started right here, understanding, identifying the
        sources, and possible approaches



Tuesday, November 29, 2011
What Data Doesn’t Do

    ✤   From Coco Krumme in “Beautiful Data”

         ✤    Data doesn’t drive everything.

         ✤    Note: “narrative fallacy,” “confirmation bias,” “paradox of choice”

    ✤   Data doesn’t: scale (cognitively), alone explain, predict

    ✤   The real world doesn’t create random variables

    ✤   Data doesn’t stand alone


Tuesday, November 29, 2011
Integrating Unstructured
                       Data




                                                         Images

         From Oracle 11g presentation at www.nmoug.org/papers/11g_High_Level_April08.ppt
Tuesday, November 29, 2011
The Goal: Usable Knowledge


    ✤   Information extraction is NOT the goal

         ✤    Information extraction is a means to an end

    ✤   Knowledge discovery is the goal

    ✤   To this end, we will perform lots of processing to move from bits to
        usable meaning




Tuesday, November 29, 2011
So many <near> synonyms

    ✤   Text analytics

    ✤   Content analytics

    ✤   Text mining

    ✤   Data mining

    ✤   Information extraction

    ✤   And then there’s Natural Language Processing


Tuesday, November 29, 2011
What’s the same?



    ✤   Moving from bits to meaning requires processing, and a lot of that
        processing is the same, no matter what you call it

    ✤   We will focus primarily on textual information today




Tuesday, November 29, 2011
Natural Language

    ✤   From Peter Norvig’s “Natural Language Corpus Data: chapter in
        “Beautiful Data”

    ✤   Google’s 1 trillion-word corpus investigating probabilistic language
        models

    ✤   13 million types (unique words, punctuation)

    ✤   100k types cover 98% of the corpus

    ✤   For: word segmentation, spelling correction, language identification,
        spam detection, author identification

    ✤   %? = “chooses pain” ; “in sufficient numbers”

Tuesday, November 29, 2011
Information Extraction

    ✤   Token identification - “tokenization”

    ✤   Word segmentation

    ✤   Sentence splitting

    ✤   Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective,
        etc.)

    ✤   Phrase identification - noun phrase

    ✤   Entity extraction - people, places, events, dates, organizations

Tuesday, November 29, 2011
Information Extraction

    ✤   Cluster analysis - group related information, where relationship may not
        be known

    ✤   Classification - mapping to specific categories

    ✤   Dependency identification / Rule generation

    ✤   Relationship detection - e.g. “Joe” “is CEO” at “IBM”

    ✤   Conference resolution (anaphoric reference resolution)

         ✤    e.g., “Joe is CEO at IBM. He is an IEEE member.”

    ✤   Summarization - key concepts or key sentences

Tuesday, November 29, 2011
IR and IE

    ✤   IR (Information Retrieval) versus IE (Information Extraction)

    ✤   IR retrieves documents from collections; IE retrieves facts and structured
        information from collections

    ✤   In IR, the objects of analysis are documents; in IE, the objects of analysis
        are facts

    ✤   IE returns knowledge at a deeper level than traditional IR

         ✤    Results may be imperfect, and linking them back to documents adds
              value

         ✤    Sound familiar? (semantic web, linked data)

Tuesday, November 29, 2011
Information Extraction
    Two primary system types

                     Knowledge Engineering                                     Learning Systems

                               Rule based                              Use statistics or other machine learning


            Developed by experienced language engineers         Developers do not need language engineering expertise


                       Make use of human intuition


              Require only small amount of training data           Require large amounts of annotated training data


               Development can be very time consuming

                                                                 Some changes may require re-annotation of the entire
              Some changes may be hard to accommodate
                                                                                 training corpus



                     From http://gate.ac.uk/sale/talks/gate-course-may11/track-1/module-2-ie/module-2-ie.pdf

Tuesday, November 29, 2011
Text




                                         Predicate
                             Subject                            Object


    Two views of the semantic web
    Machine learning, natural language processing, artificial intelligence and linked data

    Images from Wikipedia


Tuesday, November 29, 2011
Named Entities

    ✤   What is NER?

    ✤   Named Entity Recognition

         ✤    identifying proper names in texts, and classification into a set of
              predefined categories of interest

    ✤   Named entity recognition is the cornerstone of Information
        Extraction, providing a foundation from which to build complex
        information extraction systems



Tuesday, November 29, 2011
Named Entities

    ✤   Person names

    ✤   Organizations (companies, government organizations, committees)

    ✤   Locations (cities, countries, rivers)

    ✤   Date and time expressions

    ✤   Measures (percent, money, weight)

    ✤   Email addresses, web addresses, street addresses

    ✤   Some domain-specific entities: names of drugs, medical conditions,
        names of ships, bibliographic references, etc.

Tuesday, November 29, 2011
NOT Named Entities
    ✤   Artifacts - Wall Street Journal

    ✤   Common nouns, referring to named entities

         ✤    e.g. the company, the committee

    ✤   Name of groups of people and things named after people

         ✤    e.g. the Tories, the Nobel Prize

    ✤   Adjectives derived from names

         ✤    e.g. Bulgarian, Chinese

    ✤   Numbers which are not times, dates, percentages or money amounts
        http://gate.ac.uk/sale/talks/ne-tutorial.ppt

Tuesday, November 29, 2011
Break Time!




Tuesday, November 29, 2011
Open Tools
   ✤    GATE – General Architecture for
        Text Engineering, from the
        University of Sheffield, with many
        users and excellent documentation.

   ✤    GATE has customizable document
        and corpus processing pipelines.
        GATE is an architecture, a
        framework, and a development
        environment, with a clean separation
        of algorithms, data, and
        visualization.


Tuesday, November 29, 2011
GATE


    ✤   “The Volkswagen Beetle of language processing”

    ✤   “...more than a decade of collecting reusable code and building a
        community has lead [to] a mature ecosystem for solving language
        processing problems quickly.”

    ✤   Hamish Cunningham 2010




Tuesday, November 29, 2011
GATE – Key Features

    ✤   Component-based development

    ✤   Automatic performance measurement

    ✤   Clean separation between data structures and algorithms

    ✤   Consistent use of standard mechanisms for components to
        communicate data

    ✤   Insulation from data formats

    ✤   Provision of a baseline set of language components

Tuesday, November 29, 2011
GATE – More...

    ✤   Free – open source, LPGL, Java

    ✤   Mature, at version 6, actively supported, 15 FTEs

    ✤   Comprehensive, standards-based, popular

    ✤   Used by thousands of companies, universities, and research
        laboratories

    ✤   Well-known, tested, researched, and very well-documented


Tuesday, November 29, 2011
GATE Overview

    ✤   Architectural principles

         ✤    Non-prescriptive, theory neutral (strength and weakness)

         ✤    Re-use, interoperation, not reimplementation (diverse support, lots of
              plugins)

         ✤    (Almost) everything is a component, and component sets are user-extendable

    ✤   Component-based development

         ✤    CREOLE = modified Java Beans (Collection of REusable Objects for
              Language Engineering)

         ✤    The minimal component = 10 lines of Java, 10 lines of XML, 1 URL

Tuesday, November 29, 2011
GATE – Family

    ✤   GATE Developer – an integrated development environment for
        language processing components bundled with the most widely used
        Information Extraction system and a comprehensive set of plugins

    ✤   GATE Embedded – an object library optimized for inclusion in
        diverse apps

    ✤   GATE Teamware – web app, a collaborative annotative environment

    ✤   GATE Cloud – parallel distributed processing



Tuesday, November 29, 2011
GATE – Embedded




                             From http://gate.ac.uk/g8/page/print/2/sale/talks/gate-apis.png
Tuesday, November 29, 2011
GATE – Teamware

    ✤   GATE Teamware – web app, a collaborative annotative environment
        for high volume factory-style semantic annotation built with workflow

    ✤   Running in 5 minutes with Teamware virtual server from
        GATECloud.net (itself open source):

         ✤    Reusable project templates

         ✤    Project-specific roles, users

         ✤    Applying GATE-based processing routines

         ✤    Project status, annotator activity, statistics

Tuesday, November 29, 2011
GATE – First Cousins


    ✤   Ontotext KIM: UIs demonstrating the multi-paradigm approach to
        information management, navigation and search

    ✤   Ontotext Mimir: a massively scalable multi-paradigm index built on
        Ontotext’s semantic repository family, GATE’s annotation structures
        database, plus full-text indexing from MG4

    ✤   Ontotext FactForge: ~4B Linked Data statements, query-able




Tuesday, November 29, 2011
GATE – Ontotext KIM

    ✤   Ontotext KIM: UIs, tools, GATE Gazetteers, including a Linked Data
        gazetteer (experimental)

    ✤   Pre-loaded knowledge base for entities

    ✤   Tools to upload, query, tailor the knowledge base, algorithms, UI

    ✤   Can crawl web, including Linked Data, creating semantic index: your
        servers, theirs, or cloud

    ✤   Based on GATE and OWLIM


Tuesday, November 29, 2011
GATE – Ontotext KIM




                             From: http://www.ontotext.com/sites/default/files/pictures/diagram.png
Tuesday, November 29, 2011
GATE – Ontotext KIM
     Structure




Tuesday, November 29, 2011
GATE – Ontotext KIM
     Patterns




Tuesday, November 29, 2011
GATE – Ontotext KIM
     Ontology




Tuesday, November 29, 2011
GATE – Ontotext KIM
     Facets




Tuesday, November 29, 2011
GATE – Ontotext MIMIR

    ✤   Ontotext Mimir: large scale indexing infrastructure supporting hybrid
        search (text, annotation, meaning); massively scalable multi-paradigm
        capability, combines MG4J full-text index and BigOWLIM semantic
        repository; query with text, structural info, and SPARQL

    ✤   Integrated with GATE, customizable, scalable

    ✤   Open source components

    ✤   Can federate multiple MIMIRs

    ✤   Low acquisition, management cost to scale

Tuesday, November 29, 2011
GATE – Multi-paradigm

    ✤   Why “multi-paradigm?” Proliferation of retrieval technology options

    ✤   Full text, boolean, proximity, ranking; behavior mining, tag clouds;
        concept indexing: taxonomic, ontological; annotation-based

    ✤   Choice depends principally on content volume + value:

         ✤    High volume, low (average) value: web search

         ✤    Medium volume, higher (personal) value: social networks, photo
              sharing, tagging

         ✤    Low volume, high value: controlled vocabularies, taxonomies,
              ontologies

Tuesday, November 29, 2011
GATE “Resources”

    ✤   Applications – groups of processes (that run on one or more
        documents)

    ✤   Language Resources – documents or document collections (corpus,
        corpora)

    ✤   Processing Resources – annotation tools that operate on text in
        documents

    ✤   Applications, made up of Processing Resources, operate on Language
        Resources


Tuesday, November 29, 2011
Plugins


    ✤   Applications – an application consists of any number of Processing
        Resources, run sequentially over documents

    ✤   Plugins – a plugin is a collection of one or more Processing Resources,
        bundled together.

    ✤   Plugins, then, are applications, that need to be loaded in order to
        access their Processing Resources.




Tuesday, November 29, 2011
GATE – Plugins (I)




Tuesday, November 29, 2011
GATE – Plugins (II)




Tuesday, November 29, 2011
GATE




Tuesday, November 29, 2011
GATE Annotations

    ✤   Annotations are central to understanding GATE
    ✤   Annotations are associated with each document
    ✤   Each annotation has:
         ✤    start and end offsets
         ✤    an optional set of features
         ✤    each feature has a name and a value
Tuesday, November 29, 2011
GATE Annotations




Tuesday, November 29, 2011
GATE Annotations




Tuesday, November 29, 2011
Information Extraction
                                                               ✤   TE: Template Elements
    ✤    NE: Named Entity recognition and
         typing
                                                               ✤   TR: Template Relations
    ✤    CO: CO-reference resolution
                                                               ✤   ST: Scenario Templates
    ✤    Example:

                    The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head.
                    Dr. Head is a staff scientist at We Build Rockets Inc.

     ✤    NE:      Entities are “rocket,” “Tuesday,” “Dr. Head” and “We Build Rockets”
          CO:      “it” refers to the rocket; “Dr. Head” and “Dr. Big Head” are the same
          TE:      the rocket is “shiny red” and Head’s “brainchild”
          TR:      Dr. Head works for “We Build Rockets Inc.”
          ST:      a rocket launching event occurred with the various participants
                                  From http://gate.ac.uk/sale/talks/ne-tutorial.ppt
Tuesday, November 29, 2011
ANNIE

    ✤   A Nearly-New Information Extraction System, packaged with GATE,
        used throughout examples, and a great place to start

    ✤   A collection of GATE Processing Resources to perform Information
        Extraction on unstructured text

    ✤   “Nearly new” – its name 10 years ago, that stuck

    ✤   Other information extraction systems include LingPipe and
        OpenNLP. GATE includes wrappers for LingPipe and OpenNLP,
        independently developed NLP pipelines. All three systems are
        provided as pre-built application through the GATE File menu

Tuesday, November 29, 2011
ANNIE

    ✤   “Processing Resources” inside ANNIE:

    ✤   Tokenizer, sentence splitter, part-of-speech tagger, gazetteers, named
        entity tagger, and an orthomatcher

    ✤   Also included are noun phrase and verb phrase chunkers

    ✤   Each “Processing Resource” inside ANNIE can be used as part of a
        pipeline you create to add annotations or modify existing ones

    ✤   ANNIE is a highly customizable, rule-based system, with very useful
        defaults

Tuesday, November 29, 2011
ANNIE

    ✤   “Processing Resources” inside ANNIE:

    ✤   Gazetteer – lookup annotations (lists)

    ✤   JAPE transducer – date, person, location, organization, money,
        percent annotations

    ✤   Orthomatcher – adds match features to named entity annotations
        (coreference matching)

    ✤   Document Reset – removes annotations


Tuesday, November 29, 2011
IE Steps in ANNIE


    ✤   “Tokenizer” performs Token identification and word segmentation

    ✤   “Sentence splitter” identifies sentences

    ✤   “POS” tagger performs Part-of-speech tagging – (noun, verb, adverb,
        adjective)

    ✤   Must run Tokenizer and Sentence Splitter before POS tagger




Tuesday, November 29, 2011
IE Steps in ANNIE

    ✤   “Gazetteers” – lists of names (people, cities, groups); you can modify
        or add lists

    ✤   Each list has features (majorType, minorType, language)

    ✤   Gazetteers generate “Lookup” annotations with features
        corresponding to the matched list. When the text matches a gazetteer
        entry, a Lookup annotation is created.

    ✤   Lookup annotation are used by ANNIE’s Named Entity transducer to
        for entity identification.


Tuesday, November 29, 2011
ANNIE in GATE




Tuesday, November 29, 2011
ANNIE in GATE




Tuesday, November 29, 2011
ANNIE in GATE




Tuesday, November 29, 2011
ANNIE Sequence




                              Pipeline sequence matters: tokenizer,
                             sentence splitter, POS tagger, gazetteer
Tuesday, November 29, 2011
IE Steps in ANNIE


    ✤   “NE Transducer” – Named Entity Transducer performs named entity
        recognition (NER)

    ✤   Once we have built up the processing resource pipeline with the
        previous steps (tokeniser, sentence splitter, POS tagger, gazetteer), we
        are ready to add the transducer for named entity recognition

    ✤   More specific information can be added to the features now, including
        the “kind” of entity, and the rules that were fired



Tuesday, November 29, 2011
IE Steps in ANNIE


    ✤   “OrthoMatcher” – orthographic co-reference matches proper names
        and their variants.

    ✤   Will match previously unclassified names, based on relations with
        classified entities

    ✤   Matches “Kevin Lynch” with “Dr. Lynch”

    ✤   Matches acronyms with expansions



Tuesday, November 29, 2011
IE Steps in ANNIE

    ✤   Tokenizer, sentence splitter, and OrthoMatcher are language, domain,
        and application-independent

    ✤   Part-of-speech tagger is language dependent and application-
        independent

    ✤   Gazetteer lists are starting points (60K entries)

    ✤   ANNIE is a way to get started, with a framework for identifying the
        kinds of elements that matter to your work, and for quickly testing
        your ideas against existing data


Tuesday, November 29, 2011
Annotations In Context




Tuesday, November 29, 2011
Rules-based Classification

    ✤   Once a stand-alone project, now often part of annotation services

    ✤   Regex, Boolean and naive Bayesian algorithms executed on tokens

         ✤    And, Or, Not, Near (x), Multi, Stem, Exact, Phrase, et al (vendor or
              source dependent)

    ✤   Assigns documents to a taxonomic category

         ✤    Allow for greater control over depth and breadth of categories

    ✤   Human aided, machine processed

Tuesday, November 29, 2011
Rules-based Classification




Tuesday, November 29, 2011
Break Time!




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Prefuse




Tuesday, November 29, 2011
Visualization - Gephi




Tuesday, November 29, 2011
Visualization - Gephi




Tuesday, November 29, 2011
Visualization - Cytoscape




Tuesday, November 29, 2011
Quick!
    ✤   Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of
        parliament, and so on and so forth) -- call this your corpus

    ✤   Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy,
        or something from the Linked Data cloud) -- call this your ontology

    ✤   Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to
        the ontology (2.)

    ✤   Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and
        measure performance against the gold standard

    ✤   Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems
        using GATE Embedded)

    ✤   Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For
        techies: this sits in the backroom as a RESTful web service.)

    ✤   Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity
        graphing, time series graphing, annotation structure search and (last but not least) boolean full text search.
        (More techy stuff: mash up these types of search with your existing UIs.)

Tuesday, November 29, 2011
Data Warehousing /
    Business Intelligence


    ✤   Perspective

    ✤   Process

    ✤   Use cases

    ✤   Implications with unstructured data




Tuesday, November 29, 2011
DW/BI Perspective


    ✤   Structured data is an incomplete version of the “truth”

    ✤   Until information is quantified, it is not very useful

    ✤   Discover facts, and give them structure

    ✤   Complement structured data with unstructured data; try to complete
        the picture (of the business, the customer, performance)




Tuesday, November 29, 2011
DW/BI Process



    ✤   Extract, then formalize

    ✤   Give information structure, then associations

    ✤   Map to existing structures in the data warehouse




Tuesday, November 29, 2011
DW/BI Use Cases

    ✤   Report indexing (of metadata, of instances)

         ✤    Report sections become possible

    ✤   Self-service for consumers

    ✤   “BI Search” (of those reports)

    ✤   Include in portal


    ✤   As range of reports and users increases, unstructured data approaches
        have more value

Tuesday, November 29, 2011
DW/BI Use Case Ideas


    ✤   For customers, products, complaints, locations:

         ✤    Voice recognition indexing

         ✤    RSS feeds

         ✤    Wikis, blogs (internal and external)

         ✤    Instant messages



Tuesday, November 29, 2011
DW/BI Implications

    ✤   Have to store these results

    ✤   Have to model these results

    ✤   Have to map these results to something meaningful

    ✤   Have to include the results in a useful way (Where? Use taxonomies?
        Which ones?)

    ✤   Quality, cost, and complexity matter; extracted entities don’t relate
        directly to performance

    ✤   Not a replacement, an addition to the technology

Tuesday, November 29, 2011
Some Technical Issues


    ✤   Quality

    ✤   Integration

    ✤   Concurrency

    ✤   Security

    ✤   Skills



Tuesday, November 29, 2011
Additional Open Tools

   ✤    UIMA – Unstructured Information
        Management Architecture (IBM’s
        Watson uses this), originated at
        IBM, now an Apache project.

   ✤    Component software architecture
        with a document processing
        pipeline similar to GATE. Focus on
        performance and scalability, with
        distributed processing (web
        services).


Tuesday, November 29, 2011
UIMA
    UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
      types based on existing ones and update the Common Analysis Structure (CAS) for
                                     upstream processing.
                                                                                                   UIMA CAS
                                                                                              Representation now
                             Common Analysis Structure (CAS)                                        Aligned
                                                                                               with XMI standard
                       Relationship                                   CeoOf


                                                      Arg1:Person                        Arg2:Org
                                                                 Analysis Results
                                                             (i.e., Artifact Metadata)
                       Named Entity          Person                                               Organization


                        Parser                 NP                    VP                          PP


                                      Fred       Center     is       the      CEO        of     Center     Micros

                                                           Artifact (e.g., Document)
                                                                                                                    Chart by
                                                                                                                     IBM
Tuesday, November 29, 2011
UIMA




                             Image by
                               IBM
Tuesday, November 29, 2011
Commercial Tools

    ✤   Oracle Data Mining (Text Mining)

    ✤   IBM SPSS

    ✤   SAS Text Miner

    ✤   Smartlogic

    ✤   Lots of acquisitions going on in the “big data” space

         ✤    HP acquired Autonomy

         ✤    Oracle acquired Endeca

Tuesday, November 29, 2011
A Note on Tools

    ✤    UIMA and GATE – comprehensive suite of capabilities, with learning
         curves.

    ✤    Commercial tools range from unstructured capabilities inside DBMSs
         like Oracle, to Business Objects business intelligence tools (who
         acquired Inxight from Xeroc Parc).

    ✤    Your mileage will vary. The biggest differentiator is your knowledge
         of your data.




Tuesday, November 29, 2011
Questions?




Tuesday, November 29, 2011
Thank you
     Christine Connors
     Kevin Lynch
     www.triviumrlg.com




Tuesday, November 29, 2011
What can unstructured data look
    like post-processing?




Tuesday, November 29, 2011

Más contenido relacionado

La actualidad más candente

A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceMark West
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overviewColleen Farrelly
 
Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data ScienceRoger Huang
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data sciencebhavesh lande
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceMark West
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabusanoop bk
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI dayMohammed Barakat
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceVignesh Prajapati
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace Mohamadreza Mohtat
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school studentsMelanie Manning, CFA
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningJulian Bright
 

La actualidad más candente (20)

A Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data ScienceA Practical-ish Introduction to Data Science
A Practical-ish Introduction to Data Science
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big data and data science overview
Big data and data science overviewBig data and data science overview
Big data and data science overview
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Different Career Paths in Data Science
Different Career Paths in Data ScienceDifferent Career Paths in Data Science
Different Career Paths in Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
introduction to data science
introduction to data scienceintroduction to data science
introduction to data science
 
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data ScienceGeeCon Prague 2018 - A Practical-ish Introduction to Data Science
GeeCon Prague 2018 - A Practical-ish Introduction to Data Science
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Data science presentation 2nd CI day
Data science presentation 2nd CI dayData science presentation 2nd CI day
Data science presentation 2nd CI day
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace How To Become a Data Scientist in Iran Marketplace
How To Become a Data Scientist in Iran Marketplace
 
Data Science presentation for elementary school students
Data Science presentation for elementary school studentsData Science presentation for elementary school students
Data Science presentation for elementary school students
 
Demystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine LearningDemystifying Data Science with an introduction to Machine Learning
Demystifying Data Science with an introduction to Machine Learning
 

Destacado

Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookEmcien Corporation
 
Moving from Unstructured Documents to Structured XML
Moving from Unstructured Documents to Structured XMLMoving from Unstructured Documents to Structured XML
Moving from Unstructured Documents to Structured XMLScott Abel
 
From Taxonomies to Ontologies
From Taxonomies to OntologiesFrom Taxonomies to Ontologies
From Taxonomies to OntologiesChristine Connors
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataPerficient, Inc.
 
Five creative search solutions using text analytics
Five creative search solutions using text analyticsFive creative search solutions using text analytics
Five creative search solutions using text analyticsEnterprise Knowledge
 
Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...Erik Fransen
 
Accentuate the Positive: Modeling Enterprise Ontologies
Accentuate the Positive: Modeling Enterprise OntologiesAccentuate the Positive: Modeling Enterprise Ontologies
Accentuate the Positive: Modeling Enterprise OntologiesChristine Connors
 
ListenLogic Unstructured & Structured Data Analytics
ListenLogic Unstructured & Structured Data AnalyticsListenLogic Unstructured & Structured Data Analytics
ListenLogic Unstructured & Structured Data AnalyticsListenLogic
 
Ontology And Taxonomy Modeling Quick Guide
Ontology And Taxonomy Modeling Quick GuideOntology And Taxonomy Modeling Quick Guide
Ontology And Taxonomy Modeling Quick GuideHeimo Hänninen
 
Unstructured data to structured meaning for nyu itp camp - 6-22-12 ms
Unstructured data to structured meaning for nyu itp camp - 6-22-12 msUnstructured data to structured meaning for nyu itp camp - 6-22-12 ms
Unstructured data to structured meaning for nyu itp camp - 6-22-12 msMarshall Sponder
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seachkrisztianbalog
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataDATAVERSITY
 
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...Course5i
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in SolrTommaso Teofili
 
Proposal 12 - Visual Analytics
Proposal 12 - Visual AnalyticsProposal 12 - Visual Analytics
Proposal 12 - Visual AnalyticsCISTI ICIST
 
CRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and InterpretationCRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and InterpretationAlexey Shigarov
 
Intro to Information Systems
Intro to Information SystemsIntro to Information Systems
Intro to Information Systemstclanton4
 

Destacado (20)

Structured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebookStructured and Unstructured Big Data ebook
Structured and Unstructured Big Data ebook
 
Moving from Unstructured Documents to Structured XML
Moving from Unstructured Documents to Structured XMLMoving from Unstructured Documents to Structured XML
Moving from Unstructured Documents to Structured XML
 
From Taxonomies to Ontologies
From Taxonomies to OntologiesFrom Taxonomies to Ontologies
From Taxonomies to Ontologies
 
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured DataIBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
IBM Watson Content Analytics: Discover Hidden Value in Your Unstructured Data
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Five creative search solutions using text analytics
Five creative search solutions using text analyticsFive creative search solutions using text analytics
Five creative search solutions using text analytics
 
Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...Data Search Searching And Finding Information In Unstructured And Structured ...
Data Search Searching And Finding Information In Unstructured And Structured ...
 
Accentuate the Positive: Modeling Enterprise Ontologies
Accentuate the Positive: Modeling Enterprise OntologiesAccentuate the Positive: Modeling Enterprise Ontologies
Accentuate the Positive: Modeling Enterprise Ontologies
 
ListenLogic Unstructured & Structured Data Analytics
ListenLogic Unstructured & Structured Data AnalyticsListenLogic Unstructured & Structured Data Analytics
ListenLogic Unstructured & Structured Data Analytics
 
Ontology And Taxonomy Modeling Quick Guide
Ontology And Taxonomy Modeling Quick GuideOntology And Taxonomy Modeling Quick Guide
Ontology And Taxonomy Modeling Quick Guide
 
Unstructured data to structured meaning for nyu itp camp - 6-22-12 ms
Unstructured data to structured meaning for nyu itp camp - 6-22-12 msUnstructured data to structured meaning for nyu itp camp - 6-22-12 ms
Unstructured data to structured meaning for nyu itp camp - 6-22-12 ms
 
Life Science Analytics
Life Science AnalyticsLife Science Analytics
Life Science Analytics
 
Semistructured Data Seach
Semistructured Data SeachSemistructured Data Seach
Semistructured Data Seach
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...
Using Unstructured Text Data to Stay Ahead of Market Trends and Quantify Cust...
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Proposal 12 - Visual Analytics
Proposal 12 - Visual AnalyticsProposal 12 - Visual Analytics
Proposal 12 - Visual Analytics
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
CRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and InterpretationCRL: A Rule Language for Table Analysis and Interpretation
CRL: A Rule Language for Table Analysis and Interpretation
 
Intro to Information Systems
Intro to Information SystemsIntro to Information Systems
Intro to Information Systems
 

Similar a Getting Started with Unstructured Data

Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-searchDiana Maynard
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalS. M. Hassan Zaidi
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEDiana Maynard
 
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Yousif Almas
 
Theo downes le guin - listening - 2011
Theo downes le guin - listening - 2011Theo downes le guin - listening - 2011
Theo downes le guin - listening - 2011Ray Poynter
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfneelakandan2001kpm
 
Use NLP to Solve Business Problems
Use NLP to Solve Business ProblemsUse NLP to Solve Business Problems
Use NLP to Solve Business ProblemsAnnie Flippo
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdfAnime196637
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inKumari Naveen
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.pptHaHa501620
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and searchNathan McMinn
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextSeth Grimes
 
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Alexander Serebrenik
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction documentrajatkr
 
Fast and accurate sentiment classification us and naive bayes model b516001
Fast and accurate sentiment classification  us and naive bayes model b516001Fast and accurate sentiment classification  us and naive bayes model b516001
Fast and accurate sentiment classification us and naive bayes model b516001Abhisek Sahoo
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingGeeks Anonymes
 
DiscoverText Product Overview
DiscoverText Product OverviewDiscoverText Product Overview
DiscoverText Product OverviewStuart Shulman
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPlucenerevolution
 

Similar a Getting Started with Unstructured Data (20)

Text analysis-semantic-search
Text analysis-semantic-searchText analysis-semantic-search
Text analysis-semantic-search
 
BEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine FinalBEA 2015 Generating Metadata by Machine Final
BEA 2015 Generating Metadata by Machine Final
 
Text analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATEText analysis and Semantic Search with GATE
Text analysis and Semantic Search with GATE
 
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...Beyond Siri on the iPhone: How could intelligent systems change the way we in...
Beyond Siri on the iPhone: How could intelligent systems change the way we in...
 
Theo downes le guin - listening - 2011
Theo downes le guin - listening - 2011Theo downes le guin - listening - 2011
Theo downes le guin - listening - 2011
 
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdfData+Science+in+Python+-+Data+Prep+&+EDA.pdf
Data+Science+in+Python+-+Data+Prep+&+EDA.pdf
 
Use NLP to Solve Business Problems
Use NLP to Solve Business ProblemsUse NLP to Solve Business Problems
Use NLP to Solve Business Problems
 
Natural Language Processing .pdf
Natural Language Processing .pdfNatural Language Processing .pdf
Natural Language Processing .pdf
 
NLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful inNLP Tasks and Applications.ppt useful in
NLP Tasks and Applications.ppt useful in
 
lect36-tasks.ppt
lect36-tasks.pptlect36-tasks.ppt
lect36-tasks.ppt
 
Natural language processing and search
Natural language processing and searchNatural language processing and search
Natural language processing and search
 
Text Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's NextText Analytics Market Insights: What's Working and What's Next
Text Analytics Market Insights: What's Working and What's Next
 
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
Invited Talk MESOCA 2014: Evolving software systems: emerging trends and chal...
 
Veda Semantics - introduction document
Veda Semantics - introduction documentVeda Semantics - introduction document
Veda Semantics - introduction document
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
Fast and accurate sentiment classification us and naive bayes model b516001
Fast and accurate sentiment classification  us and naive bayes model b516001Fast and accurate sentiment classification  us and naive bayes model b516001
Fast and accurate sentiment classification us and naive bayes model b516001
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
DiscoverText Product Overview
DiscoverText Product OverviewDiscoverText Product Overview
DiscoverText Product Overview
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 
Shrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLPShrinking the Haystack" using Solr and OpenNLP
Shrinking the Haystack" using Solr and OpenNLP
 

Más de Christine Connors

Más de Christine Connors (7)

Taxonomy Governance
Taxonomy GovernanceTaxonomy Governance
Taxonomy Governance
 
As indexing2011
As indexing2011As indexing2011
As indexing2011
 
Taxonomies - A Foundation for more
Taxonomies - A Foundation for moreTaxonomies - A Foundation for more
Taxonomies - A Foundation for more
 
Semantics in the Enterprise: Roles & Capabilities
Semantics in the Enterprise: Roles & CapabilitiesSemantics in the Enterprise: Roles & Capabilities
Semantics in the Enterprise: Roles & Capabilities
 
Knowledge Hierarchies
Knowledge HierarchiesKnowledge Hierarchies
Knowledge Hierarchies
 
Semantics For Cultural Heritage
Semantics For Cultural HeritageSemantics For Cultural Heritage
Semantics For Cultural Heritage
 
What's Next for the Web?
What's Next for the Web?What's Next for the Web?
What's Next for the Web?
 

Último

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 

Último (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 

Getting Started with Unstructured Data

  • 1. Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC Semantic Tech & Business, Washington D.C. November 29, 2011 Tuesday, November 29, 2011
  • 2. Meta ✤ Presenter: Christine Connors ✤ @cjmconnors ✤ Presenter: Kevin Lynch ✤ @kevinjohnlynch ✤ Principals at www.triviumrlg.com Tuesday, November 29, 2011
  • 3. Agenda ✤ What is unstructured data? ✤ Where do we find it? ✤ How important is it? ✤ How do we visualize it? ✤ Machine processing for actionable data ✤ Tools Tuesday, November 29, 2011
  • 4. What is unstructured data? ✤ Data which is ✤ Not in a database ✤ Does not adhere to a formal data model ✤ Content Tuesday, November 29, 2011
  • 5. Isn’t that a misnomer? ✤ Problematic term ✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word ✤ Object metadata = machine or applied properties ✤ Aesthetic markup = stylesheets; rendering information ✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis Tuesday, November 29, 2011
  • 6. Types of ‘un’structured data ✤ Text-based documents ✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web) ✤ Audio/video files Tuesday, November 29, 2011
  • 7. Where do we find it? ✤ Office productivity suites ✤ Content management systems ✤ Digital asset management systems ✤ Web content management systems ✤ Wikis, blogs, comment & discussion threads ✤ Social networking tools ✤ Twitter, Yammer, instant messengers Tuesday, November 29, 2011
  • 8. Is it really that important? Structured Unstructured 15% 85% Tuesday, November 29, 2011
  • 9. What’s in that 80-85%? ✤ Progress reports - created in a word processor Tuesday, November 29, 2011
  • 10. What’s in that 80-85%? ✤ Dashboards - created in presentation software Tuesday, November 29, 2011
  • 11. What’s in that 80-85%? ✤ Progress reports - color coded text in a spreadsheet Tuesday, November 29, 2011
  • 12. What’s in that 80-85%? ✤ Brainstorming - in messaging systems ✤ Decision making - in email Tuesday, November 29, 2011
  • 13. What’s in that 80-85%? ✤ Business intelligence - on the web and more Tuesday, November 29, 2011
  • 14. How can we make the data more actionable? ✤ Identify it ✤ Convert to a format you can work with ✤ Add structure, meaning: ✤ information extraction ✤ annotation ✤ content analytics Tuesday, November 29, 2011
  • 15. What about enterprise search? ✤ First line of defense ✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis ✤ Does not assist in other visualizations or transformations without further machine processing Tuesday, November 29, 2011
  • 16. Machine Processing Unstructured Natural Rules-based Statistical Semantic Data Language Classifica- Analysis Analysis Processing tion Machine Processing Platform Federated Search A P Index I Visualizations Data Stores Tuesday, November 29, 2011
  • 17. Let’s go a little deeper... Tuesday, November 29, 2011
  • 18. Good News, Bad News ✤ Good: Basic text analysis tools are widely available; cheap or free ✤ Good: The range of information you can now consider has broadened; the intelligence you can bring to bear on that information has increased ✤ Bad: Skillsets not widely available (but they are available!) ✤ Good: You can get started right here, understanding, identifying the sources, and possible approaches Tuesday, November 29, 2011
  • 19. What Data Doesn’t Do ✤ From Coco Krumme in “Beautiful Data” ✤ Data doesn’t drive everything. ✤ Note: “narrative fallacy,” “confirmation bias,” “paradox of choice” ✤ Data doesn’t: scale (cognitively), alone explain, predict ✤ The real world doesn’t create random variables ✤ Data doesn’t stand alone Tuesday, November 29, 2011
  • 20. Integrating Unstructured Data Images From Oracle 11g presentation at www.nmoug.org/papers/11g_High_Level_April08.ppt Tuesday, November 29, 2011
  • 21. The Goal: Usable Knowledge ✤ Information extraction is NOT the goal ✤ Information extraction is a means to an end ✤ Knowledge discovery is the goal ✤ To this end, we will perform lots of processing to move from bits to usable meaning Tuesday, November 29, 2011
  • 22. So many <near> synonyms ✤ Text analytics ✤ Content analytics ✤ Text mining ✤ Data mining ✤ Information extraction ✤ And then there’s Natural Language Processing Tuesday, November 29, 2011
  • 23. What’s the same? ✤ Moving from bits to meaning requires processing, and a lot of that processing is the same, no matter what you call it ✤ We will focus primarily on textual information today Tuesday, November 29, 2011
  • 24. Natural Language ✤ From Peter Norvig’s “Natural Language Corpus Data: chapter in “Beautiful Data” ✤ Google’s 1 trillion-word corpus investigating probabilistic language models ✤ 13 million types (unique words, punctuation) ✤ 100k types cover 98% of the corpus ✤ For: word segmentation, spelling correction, language identification, spam detection, author identification ✤ %? = “chooses pain” ; “in sufficient numbers” Tuesday, November 29, 2011
  • 25. Information Extraction ✤ Token identification - “tokenization” ✤ Word segmentation ✤ Sentence splitting ✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.) ✤ Phrase identification - noun phrase ✤ Entity extraction - people, places, events, dates, organizations Tuesday, November 29, 2011
  • 26. Information Extraction ✤ Cluster analysis - group related information, where relationship may not be known ✤ Classification - mapping to specific categories ✤ Dependency identification / Rule generation ✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM” ✤ Conference resolution (anaphoric reference resolution) ✤ e.g., “Joe is CEO at IBM. He is an IEEE member.” ✤ Summarization - key concepts or key sentences Tuesday, November 29, 2011
  • 27. IR and IE ✤ IR (Information Retrieval) versus IE (Information Extraction) ✤ IR retrieves documents from collections; IE retrieves facts and structured information from collections ✤ In IR, the objects of analysis are documents; in IE, the objects of analysis are facts ✤ IE returns knowledge at a deeper level than traditional IR ✤ Results may be imperfect, and linking them back to documents adds value ✤ Sound familiar? (semantic web, linked data) Tuesday, November 29, 2011
  • 28. Information Extraction Two primary system types Knowledge Engineering Learning Systems Rule based Use statistics or other machine learning Developed by experienced language engineers Developers do not need language engineering expertise Make use of human intuition Require only small amount of training data Require large amounts of annotated training data Development can be very time consuming Some changes may require re-annotation of the entire Some changes may be hard to accommodate training corpus From http://gate.ac.uk/sale/talks/gate-course-may11/track-1/module-2-ie/module-2-ie.pdf Tuesday, November 29, 2011
  • 29. Text Predicate Subject Object Two views of the semantic web Machine learning, natural language processing, artificial intelligence and linked data Images from Wikipedia Tuesday, November 29, 2011
  • 30. Named Entities ✤ What is NER? ✤ Named Entity Recognition ✤ identifying proper names in texts, and classification into a set of predefined categories of interest ✤ Named entity recognition is the cornerstone of Information Extraction, providing a foundation from which to build complex information extraction systems Tuesday, November 29, 2011
  • 31. Named Entities ✤ Person names ✤ Organizations (companies, government organizations, committees) ✤ Locations (cities, countries, rivers) ✤ Date and time expressions ✤ Measures (percent, money, weight) ✤ Email addresses, web addresses, street addresses ✤ Some domain-specific entities: names of drugs, medical conditions, names of ships, bibliographic references, etc. Tuesday, November 29, 2011
  • 32. NOT Named Entities ✤ Artifacts - Wall Street Journal ✤ Common nouns, referring to named entities ✤ e.g. the company, the committee ✤ Name of groups of people and things named after people ✤ e.g. the Tories, the Nobel Prize ✤ Adjectives derived from names ✤ e.g. Bulgarian, Chinese ✤ Numbers which are not times, dates, percentages or money amounts http://gate.ac.uk/sale/talks/ne-tutorial.ppt Tuesday, November 29, 2011
  • 34. Open Tools ✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation. ✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization. Tuesday, November 29, 2011
  • 35. GATE ✤ “The Volkswagen Beetle of language processing” ✤ “...more than a decade of collecting reusable code and building a community has lead [to] a mature ecosystem for solving language processing problems quickly.” ✤ Hamish Cunningham 2010 Tuesday, November 29, 2011
  • 36. GATE – Key Features ✤ Component-based development ✤ Automatic performance measurement ✤ Clean separation between data structures and algorithms ✤ Consistent use of standard mechanisms for components to communicate data ✤ Insulation from data formats ✤ Provision of a baseline set of language components Tuesday, November 29, 2011
  • 37. GATE – More... ✤ Free – open source, LPGL, Java ✤ Mature, at version 6, actively supported, 15 FTEs ✤ Comprehensive, standards-based, popular ✤ Used by thousands of companies, universities, and research laboratories ✤ Well-known, tested, researched, and very well-documented Tuesday, November 29, 2011
  • 38. GATE Overview ✤ Architectural principles ✤ Non-prescriptive, theory neutral (strength and weakness) ✤ Re-use, interoperation, not reimplementation (diverse support, lots of plugins) ✤ (Almost) everything is a component, and component sets are user-extendable ✤ Component-based development ✤ CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) ✤ The minimal component = 10 lines of Java, 10 lines of XML, 1 URL Tuesday, November 29, 2011
  • 39. GATE – Family ✤ GATE Developer – an integrated development environment for language processing components bundled with the most widely used Information Extraction system and a comprehensive set of plugins ✤ GATE Embedded – an object library optimized for inclusion in diverse apps ✤ GATE Teamware – web app, a collaborative annotative environment ✤ GATE Cloud – parallel distributed processing Tuesday, November 29, 2011
  • 40. GATE – Embedded From http://gate.ac.uk/g8/page/print/2/sale/talks/gate-apis.png Tuesday, November 29, 2011
  • 41. GATE – Teamware ✤ GATE Teamware – web app, a collaborative annotative environment for high volume factory-style semantic annotation built with workflow ✤ Running in 5 minutes with Teamware virtual server from GATECloud.net (itself open source): ✤ Reusable project templates ✤ Project-specific roles, users ✤ Applying GATE-based processing routines ✤ Project status, annotator activity, statistics Tuesday, November 29, 2011
  • 42. GATE – First Cousins ✤ Ontotext KIM: UIs demonstrating the multi-paradigm approach to information management, navigation and search ✤ Ontotext Mimir: a massively scalable multi-paradigm index built on Ontotext’s semantic repository family, GATE’s annotation structures database, plus full-text indexing from MG4 ✤ Ontotext FactForge: ~4B Linked Data statements, query-able Tuesday, November 29, 2011
  • 43. GATE – Ontotext KIM ✤ Ontotext KIM: UIs, tools, GATE Gazetteers, including a Linked Data gazetteer (experimental) ✤ Pre-loaded knowledge base for entities ✤ Tools to upload, query, tailor the knowledge base, algorithms, UI ✤ Can crawl web, including Linked Data, creating semantic index: your servers, theirs, or cloud ✤ Based on GATE and OWLIM Tuesday, November 29, 2011
  • 44. GATE – Ontotext KIM From: http://www.ontotext.com/sites/default/files/pictures/diagram.png Tuesday, November 29, 2011
  • 45. GATE – Ontotext KIM Structure Tuesday, November 29, 2011
  • 46. GATE – Ontotext KIM Patterns Tuesday, November 29, 2011
  • 47. GATE – Ontotext KIM Ontology Tuesday, November 29, 2011
  • 48. GATE – Ontotext KIM Facets Tuesday, November 29, 2011
  • 49. GATE – Ontotext MIMIR ✤ Ontotext Mimir: large scale indexing infrastructure supporting hybrid search (text, annotation, meaning); massively scalable multi-paradigm capability, combines MG4J full-text index and BigOWLIM semantic repository; query with text, structural info, and SPARQL ✤ Integrated with GATE, customizable, scalable ✤ Open source components ✤ Can federate multiple MIMIRs ✤ Low acquisition, management cost to scale Tuesday, November 29, 2011
  • 50. GATE – Multi-paradigm ✤ Why “multi-paradigm?” Proliferation of retrieval technology options ✤ Full text, boolean, proximity, ranking; behavior mining, tag clouds; concept indexing: taxonomic, ontological; annotation-based ✤ Choice depends principally on content volume + value: ✤ High volume, low (average) value: web search ✤ Medium volume, higher (personal) value: social networks, photo sharing, tagging ✤ Low volume, high value: controlled vocabularies, taxonomies, ontologies Tuesday, November 29, 2011
  • 51. GATE “Resources” ✤ Applications – groups of processes (that run on one or more documents) ✤ Language Resources – documents or document collections (corpus, corpora) ✤ Processing Resources – annotation tools that operate on text in documents ✤ Applications, made up of Processing Resources, operate on Language Resources Tuesday, November 29, 2011
  • 52. Plugins ✤ Applications – an application consists of any number of Processing Resources, run sequentially over documents ✤ Plugins – a plugin is a collection of one or more Processing Resources, bundled together. ✤ Plugins, then, are applications, that need to be loaded in order to access their Processing Resources. Tuesday, November 29, 2011
  • 53. GATE – Plugins (I) Tuesday, November 29, 2011
  • 54. GATE – Plugins (II) Tuesday, November 29, 2011
  • 56. GATE Annotations ✤ Annotations are central to understanding GATE ✤ Annotations are associated with each document ✤ Each annotation has: ✤ start and end offsets ✤ an optional set of features ✤ each feature has a name and a value Tuesday, November 29, 2011
  • 59. Information Extraction ✤ TE: Template Elements ✤ NE: Named Entity recognition and typing ✤ TR: Template Relations ✤ CO: CO-reference resolution ✤ ST: Scenario Templates ✤ Example: The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. ✤ NE: Entities are “rocket,” “Tuesday,” “Dr. Head” and “We Build Rockets” CO: “it” refers to the rocket; “Dr. Head” and “Dr. Big Head” are the same TE: the rocket is “shiny red” and Head’s “brainchild” TR: Dr. Head works for “We Build Rockets Inc.” ST: a rocket launching event occurred with the various participants From http://gate.ac.uk/sale/talks/ne-tutorial.ppt Tuesday, November 29, 2011
  • 60. ANNIE ✤ A Nearly-New Information Extraction System, packaged with GATE, used throughout examples, and a great place to start ✤ A collection of GATE Processing Resources to perform Information Extraction on unstructured text ✤ “Nearly new” – its name 10 years ago, that stuck ✤ Other information extraction systems include LingPipe and OpenNLP. GATE includes wrappers for LingPipe and OpenNLP, independently developed NLP pipelines. All three systems are provided as pre-built application through the GATE File menu Tuesday, November 29, 2011
  • 61. ANNIE ✤ “Processing Resources” inside ANNIE: ✤ Tokenizer, sentence splitter, part-of-speech tagger, gazetteers, named entity tagger, and an orthomatcher ✤ Also included are noun phrase and verb phrase chunkers ✤ Each “Processing Resource” inside ANNIE can be used as part of a pipeline you create to add annotations or modify existing ones ✤ ANNIE is a highly customizable, rule-based system, with very useful defaults Tuesday, November 29, 2011
  • 62. ANNIE ✤ “Processing Resources” inside ANNIE: ✤ Gazetteer – lookup annotations (lists) ✤ JAPE transducer – date, person, location, organization, money, percent annotations ✤ Orthomatcher – adds match features to named entity annotations (coreference matching) ✤ Document Reset – removes annotations Tuesday, November 29, 2011
  • 63. IE Steps in ANNIE ✤ “Tokenizer” performs Token identification and word segmentation ✤ “Sentence splitter” identifies sentences ✤ “POS” tagger performs Part-of-speech tagging – (noun, verb, adverb, adjective) ✤ Must run Tokenizer and Sentence Splitter before POS tagger Tuesday, November 29, 2011
  • 64. IE Steps in ANNIE ✤ “Gazetteers” – lists of names (people, cities, groups); you can modify or add lists ✤ Each list has features (majorType, minorType, language) ✤ Gazetteers generate “Lookup” annotations with features corresponding to the matched list. When the text matches a gazetteer entry, a Lookup annotation is created. ✤ Lookup annotation are used by ANNIE’s Named Entity transducer to for entity identification. Tuesday, November 29, 2011
  • 65. ANNIE in GATE Tuesday, November 29, 2011
  • 66. ANNIE in GATE Tuesday, November 29, 2011
  • 67. ANNIE in GATE Tuesday, November 29, 2011
  • 68. ANNIE Sequence Pipeline sequence matters: tokenizer, sentence splitter, POS tagger, gazetteer Tuesday, November 29, 2011
  • 69. IE Steps in ANNIE ✤ “NE Transducer” – Named Entity Transducer performs named entity recognition (NER) ✤ Once we have built up the processing resource pipeline with the previous steps (tokeniser, sentence splitter, POS tagger, gazetteer), we are ready to add the transducer for named entity recognition ✤ More specific information can be added to the features now, including the “kind” of entity, and the rules that were fired Tuesday, November 29, 2011
  • 70. IE Steps in ANNIE ✤ “OrthoMatcher” – orthographic co-reference matches proper names and their variants. ✤ Will match previously unclassified names, based on relations with classified entities ✤ Matches “Kevin Lynch” with “Dr. Lynch” ✤ Matches acronyms with expansions Tuesday, November 29, 2011
  • 71. IE Steps in ANNIE ✤ Tokenizer, sentence splitter, and OrthoMatcher are language, domain, and application-independent ✤ Part-of-speech tagger is language dependent and application- independent ✤ Gazetteer lists are starting points (60K entries) ✤ ANNIE is a way to get started, with a framework for identifying the kinds of elements that matter to your work, and for quickly testing your ideas against existing data Tuesday, November 29, 2011
  • 72. Annotations In Context Tuesday, November 29, 2011
  • 73. Rules-based Classification ✤ Once a stand-alone project, now often part of annotation services ✤ Regex, Boolean and naive Bayesian algorithms executed on tokens ✤ And, Or, Not, Near (x), Multi, Stem, Exact, Phrase, et al (vendor or source dependent) ✤ Assigns documents to a taxonomic category ✤ Allow for greater control over depth and breadth of categories ✤ Human aided, machine processed Tuesday, November 29, 2011
  • 76. Visualization - Prefuse Tuesday, November 29, 2011
  • 77. Visualization - Prefuse Tuesday, November 29, 2011
  • 78. Visualization - Prefuse Tuesday, November 29, 2011
  • 79. Visualization - Prefuse Tuesday, November 29, 2011
  • 80. Visualization - Prefuse Tuesday, November 29, 2011
  • 81. Visualization - Prefuse Tuesday, November 29, 2011
  • 82. Visualization - Gephi Tuesday, November 29, 2011
  • 83. Visualization - Gephi Tuesday, November 29, 2011
  • 85. Quick! ✤ Take one large pile of text (documents, emails, tweets, patents, papers, transcripts, blogs, comments, acts of parliament, and so on and so forth) -- call this your corpus ✤ Pick a structured description of interesting things in the text (a telephone directory, or chemical taxonomy, or something from the Linked Data cloud) -- call this your ontology ✤ Use GATE Teamware to mark up a gold standard example set of annotations of the corpus (1.) relative to the ontology (2.) ✤ Use GATE Developer to build a semantic annotation pipeline to do the annotation job automatically and measure performance against the gold standard ✤ Take the pipeline from 4. and apply it to your text pile using GATE Cloud (or embed it in your own systems using GATE Embedded) ✤ Use GATE Mimir to store the annotations relative to the ontology in a multiparadigm index server. (For techies: this sits in the backroom as a RESTful web service.) ✤ Use Ontotext KIM to add semantic search, knowledge facet search, ontology browsing, entity popularity graphing, time series graphing, annotation structure search and (last but not least) boolean full text search. (More techy stuff: mash up these types of search with your existing UIs.) Tuesday, November 29, 2011
  • 86. Data Warehousing / Business Intelligence ✤ Perspective ✤ Process ✤ Use cases ✤ Implications with unstructured data Tuesday, November 29, 2011
  • 87. DW/BI Perspective ✤ Structured data is an incomplete version of the “truth” ✤ Until information is quantified, it is not very useful ✤ Discover facts, and give them structure ✤ Complement structured data with unstructured data; try to complete the picture (of the business, the customer, performance) Tuesday, November 29, 2011
  • 88. DW/BI Process ✤ Extract, then formalize ✤ Give information structure, then associations ✤ Map to existing structures in the data warehouse Tuesday, November 29, 2011
  • 89. DW/BI Use Cases ✤ Report indexing (of metadata, of instances) ✤ Report sections become possible ✤ Self-service for consumers ✤ “BI Search” (of those reports) ✤ Include in portal ✤ As range of reports and users increases, unstructured data approaches have more value Tuesday, November 29, 2011
  • 90. DW/BI Use Case Ideas ✤ For customers, products, complaints, locations: ✤ Voice recognition indexing ✤ RSS feeds ✤ Wikis, blogs (internal and external) ✤ Instant messages Tuesday, November 29, 2011
  • 91. DW/BI Implications ✤ Have to store these results ✤ Have to model these results ✤ Have to map these results to something meaningful ✤ Have to include the results in a useful way (Where? Use taxonomies? Which ones?) ✤ Quality, cost, and complexity matter; extracted entities don’t relate directly to performance ✤ Not a replacement, an addition to the technology Tuesday, November 29, 2011
  • 92. Some Technical Issues ✤ Quality ✤ Integration ✤ Concurrency ✤ Security ✤ Skills Tuesday, November 29, 2011
  • 93. Additional Open Tools ✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project. ✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services). Tuesday, November 29, 2011
  • 94. UIMA UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. UIMA CAS Representation now Common Analysis Structure (CAS) Aligned with XMI standard Relationship CeoOf Arg1:Person Arg2:Org Analysis Results (i.e., Artifact Metadata) Named Entity Person Organization Parser NP VP PP Fred Center is the CEO of Center Micros Artifact (e.g., Document) Chart by IBM Tuesday, November 29, 2011
  • 95. UIMA Image by IBM Tuesday, November 29, 2011
  • 96. Commercial Tools ✤ Oracle Data Mining (Text Mining) ✤ IBM SPSS ✤ SAS Text Miner ✤ Smartlogic ✤ Lots of acquisitions going on in the “big data” space ✤ HP acquired Autonomy ✤ Oracle acquired Endeca Tuesday, November 29, 2011
  • 97. A Note on Tools ✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves. ✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc). ✤ Your mileage will vary. The biggest differentiator is your knowledge of your data. Tuesday, November 29, 2011
  • 99. Thank you Christine Connors Kevin Lynch www.triviumrlg.com Tuesday, November 29, 2011
  • 100. What can unstructured data look like post-processing? Tuesday, November 29, 2011