SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Getting Started with Unstructured
    Data
    Christine Connors & Kevin Lynch
    TriviumRLG LLC

    November 17, 2011


Thursday, November 17, 2011
Meta

    ✤   Presenter: Christine Connors

         ✤    @cjmconnors

    ✤   Presenter: Kevin Lynch

         ✤    @kevinjohnlynch

    ✤   Principals at www.triviumrlg.com

    ✤   Partnering with Dataversity


Thursday, November 17, 2011
Agenda

    ✤   What is unstructured data?

    ✤   Where do we find it?

    ✤   How important is it?

    ✤   How do we visualize it?

    ✤   Machine processing for actionable data

    ✤   Tools


Thursday, November 17, 2011
What is unstructured data?


    ✤   Data which is

         ✤    Not in a database

         ✤    Does not adhere to a formal data model

    ✤   Content




Thursday, November 17, 2011
Isn’t that a misnomer?

    ✤   Problematic term

    ✤   The presence of object metadata or aesthetic markup does not alone
        give ‘structure’ in this sense of the word

         ✤    Object metadata = machine or applied properties

         ✤    Aesthetic markup = stylesheets; rendering information

    ✤   Semi-structured data is typically treated as unstructured for the
        purposes of machine processing and analysis


Thursday, November 17, 2011
Types of ‘un’structured data



    ✤   Text-based documents

         ✤    Word processing, presentations, email, blogs, wikis, tweets, web
              pages, web components (read/write web)

    ✤   Audio/video files




Thursday, November 17, 2011
Where do we find it?

    ✤   Office productivity suites

    ✤   Content management systems

    ✤   Digital asset management systems

    ✤   Web content management systems

         ✤    Wikis, blogs, comment & discussion threads

    ✤   Social networking tools

         ✤    Twitter, Yammer, instant messengers

Thursday, November 17, 2011
Is it really that important?
                              Structured               Unstructured



                                                 15%




                                           85%




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Progress reports -
        created in a word processor




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Dashboards -
        created in presentation software




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Progress reports -
        color coded text in a
        spreadsheet




Thursday, November 17, 2011
What’s in that 80-85%?



    ✤   Brainstorming -
        in messaging systems

    ✤   Decision making - in email




Thursday, November 17, 2011
What’s in that 80-85%?




    ✤   Business intelligence - on the
        web and more




Thursday, November 17, 2011
How can we make the data more
    actionable?

    ✤   Identify it

    ✤   Convert to a format you can work with

    ✤   Add structure, meaning:

         ✤    information extraction

         ✤    annotation

         ✤    content analytics


Thursday, November 17, 2011
What about enterprise search?


    ✤   First line of defense

    ✤   Points you at the highest relevancy ranked data via pattern matching
        and statistical analysis

    ✤   Does not assist in other visualizations or transformations without
        further machine processing




Thursday, November 17, 2011
Information Extraction


    ✤   Token identification - “tokenization”

    ✤   Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective,
        etc.)

    ✤   Phrase identification - noun phrase

    ✤   Entity extraction - people, places, events, dates, organizations




Thursday, November 17, 2011
Information Extraction

    ✤   Cluster analysis - group related information, where relationship may
        not be known

    ✤   Classification - mapping to specific categories

    ✤   Dependency identification / Rule generation

    ✤   Relationship detection - e.g. “Joe” “is CEO” at “IBM”

    ✤   Summarization - key concepts or key sentences


Thursday, November 17, 2011
Open Tools
   ✤    GATE – General Architecture for
        Text Engineering, from the
        University of Sheffield, with many
        users and excellent documentation.

   ✤    GATE has customizable document
        and corpus processing pipelines.
        GATE is an architecture, a
        framework, and a development
        environment, with a clean separation
        of algorithms, data, and
        visualization.


Thursday, November 17, 2011
Open Tools

   ✤    UIMA – Unstructured Information
        Management Architecture (IBM’s
        Watson uses this), originated at
        IBM, now an Apache project.

   ✤    Component software architecture
        with a document processing
        pipeline similar to GATE. Focus on
        performance and scalability, with
        distributed processing (web
        services).


Thursday, November 17, 2011
UIMA
    UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
      types based on existing ones and update the Common Analysis Structure (CAS) for
                                     upstream processing.
                                                                                                    UIMA CAS
                                                                                               Representation now
                              Common Analysis Structure (CAS)                                        Aligned
                                                                                                with XMI standard
                        Relationship                                   CeoOf


                                                       Arg1:Person                        Arg2:Org
                                                                  Analysis Results
                                                              (i.e., Artifact Metadata)
                       Named Entity           Person                                               Organization


                         Parser                 NP                    VP                          PP


                                       Fred       Center     is       the      CEO        of     Center     Micros

                                                            Artifact (e.g., Document)
                                                                                                                     Chart by
                                                                                                                      IBM
Thursday, November 17, 2011
UIMA




                              Image by
                                IBM
Thursday, November 17, 2011
Commercial Tools

    ✤   Oracle Data Mining (Text Mining)

    ✤   IBM SPSS

    ✤   SAS Text Miner

    ✤   Smartlogic

    ✤   Lots of acquisitions going on in the “big data” space

         ✤    HP acquired Autonomy

         ✤    Oracle acquired Endeca

Thursday, November 17, 2011
A Note on Tools

    ✤    UIMA and GATE – comprehensive suite of capabilities, with learning
         curves.

    ✤    Commercial tools range from unstructured capabilities inside DBMSs
         like Oracle, to Business Objects business intelligence tools (who
         acquired Inxight from Xeroc Parc).

    ✤    Your mileage will vary. The biggest differentiator is your knowledge
         of your data.




Thursday, November 17, 2011
What can unstructured data look
    like post-processing?




Thursday, November 17, 2011
Machine Processing


 Unstructured                  Natural                       Rules-based
                                             Statistical                   Semantic
    Data                      Language                        Classifica-
                                             Analysis                      Analysis
                              Processing                         tion



                                           Machine Processing Platform
                                                            Federated
                                                             Search        A
                                                                           P   Index
                                                                           I

     Visualizations                                        Data Stores
Thursday, November 17, 2011
Questions?




Thursday, November 17, 2011
Thank you
     Christine Connors
     Kevin Lynch
     www.triviumrlg.com




Thursday, November 17, 2011

Más contenido relacionado

Más de DATAVERSITY

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...DATAVERSITY
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceDATAVERSITY
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for YouDATAVERSITY
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?DATAVERSITY
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?DATAVERSITY
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling FundamentalsDATAVERSITY
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectDATAVERSITY
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at ScaleDATAVERSITY
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?DATAVERSITY
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...DATAVERSITY
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?DATAVERSITY
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsDATAVERSITY
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayDATAVERSITY
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise AnalyticsDATAVERSITY
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best PracticesDATAVERSITY
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?DATAVERSITY
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesDATAVERSITY
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageDATAVERSITY
 

Más de DATAVERSITY (20)

Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
Architecture, Products, and Total Cost of Ownership of the Leading Machine Le...
 
Data at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and GovernanceData at the Speed of Business with Data Mastering and Governance
Data at the Speed of Business with Data Mastering and Governance
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Make Data Work for You
Make Data Work for YouMake Data Work for You
Make Data Work for You
 
Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?Data Catalogs Are the Answer – What is the Question?
Data Catalogs Are the Answer – What is the Question?
 
Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?Data Catalogs Are the Answer – What Is the Question?
Data Catalogs Are the Answer – What Is the Question?
 
Data Modeling Fundamentals
Data Modeling FundamentalsData Modeling Fundamentals
Data Modeling Fundamentals
 
Showing ROI for Your Analytic Project
Showing ROI for Your Analytic ProjectShowing ROI for Your Analytic Project
Showing ROI for Your Analytic Project
 
How a Semantic Layer Makes Data Mesh Work at Scale
How a Semantic Layer Makes  Data Mesh Work at ScaleHow a Semantic Layer Makes  Data Mesh Work at Scale
How a Semantic Layer Makes Data Mesh Work at Scale
 
Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?Is Enterprise Data Literacy Possible?
Is Enterprise Data Literacy Possible?
 
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
The Data Trifecta – Privacy, Security & Governance Race from Reactivity to Re...
 
Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?Emerging Trends in Data Architecture – What’s the Next Big Thing?
Emerging Trends in Data Architecture – What’s the Next Big Thing?
 
Data Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and ForwardsData Governance Trends - A Look Backwards and Forwards
Data Governance Trends - A Look Backwards and Forwards
 
Data Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement TodayData Governance Trends and Best Practices To Implement Today
Data Governance Trends and Best Practices To Implement Today
 
2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics2023 Trends in Enterprise Analytics
2023 Trends in Enterprise Analytics
 
Data Strategy Best Practices
Data Strategy Best PracticesData Strategy Best Practices
Data Strategy Best Practices
 
Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?Who Should Own Data Governance – IT or Business?
Who Should Own Data Governance – IT or Business?
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
MLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive AdvantageMLOps – Applying DevOps to Competitive Advantage
MLOps – Applying DevOps to Competitive Advantage
 

Último

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxAshokKarra1
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfMr Bounab Samir
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinojohnmickonozaleda
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPCeline George
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 

Último (20)

Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
Karra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptxKarra SKD Conference Presentation Revised.pptx
Karra SKD Conference Presentation Revised.pptx
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdfLike-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
Like-prefer-love -hate+verb+ing & silent letters & citizenship text.pdf
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
FILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipinoFILIPINO PSYCHology sikolohiyang pilipino
FILIPINO PSYCHology sikolohiyang pilipino
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
How to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERPHow to do quick user assign in kanban in Odoo 17 ERP
How to do quick user assign in kanban in Odoo 17 ERP
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 

Getting Started with Unstructured Data

  • 1. Getting Started with Unstructured Data Christine Connors & Kevin Lynch TriviumRLG LLC November 17, 2011 Thursday, November 17, 2011
  • 2. Meta ✤ Presenter: Christine Connors ✤ @cjmconnors ✤ Presenter: Kevin Lynch ✤ @kevinjohnlynch ✤ Principals at www.triviumrlg.com ✤ Partnering with Dataversity Thursday, November 17, 2011
  • 3. Agenda ✤ What is unstructured data? ✤ Where do we find it? ✤ How important is it? ✤ How do we visualize it? ✤ Machine processing for actionable data ✤ Tools Thursday, November 17, 2011
  • 4. What is unstructured data? ✤ Data which is ✤ Not in a database ✤ Does not adhere to a formal data model ✤ Content Thursday, November 17, 2011
  • 5. Isn’t that a misnomer? ✤ Problematic term ✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word ✤ Object metadata = machine or applied properties ✤ Aesthetic markup = stylesheets; rendering information ✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis Thursday, November 17, 2011
  • 6. Types of ‘un’structured data ✤ Text-based documents ✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web) ✤ Audio/video files Thursday, November 17, 2011
  • 7. Where do we find it? ✤ Office productivity suites ✤ Content management systems ✤ Digital asset management systems ✤ Web content management systems ✤ Wikis, blogs, comment & discussion threads ✤ Social networking tools ✤ Twitter, Yammer, instant messengers Thursday, November 17, 2011
  • 8. Is it really that important? Structured Unstructured 15% 85% Thursday, November 17, 2011
  • 9. What’s in that 80-85%? ✤ Progress reports - created in a word processor Thursday, November 17, 2011
  • 10. What’s in that 80-85%? ✤ Dashboards - created in presentation software Thursday, November 17, 2011
  • 11. What’s in that 80-85%? ✤ Progress reports - color coded text in a spreadsheet Thursday, November 17, 2011
  • 12. What’s in that 80-85%? ✤ Brainstorming - in messaging systems ✤ Decision making - in email Thursday, November 17, 2011
  • 13. What’s in that 80-85%? ✤ Business intelligence - on the web and more Thursday, November 17, 2011
  • 14. How can we make the data more actionable? ✤ Identify it ✤ Convert to a format you can work with ✤ Add structure, meaning: ✤ information extraction ✤ annotation ✤ content analytics Thursday, November 17, 2011
  • 15. What about enterprise search? ✤ First line of defense ✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis ✤ Does not assist in other visualizations or transformations without further machine processing Thursday, November 17, 2011
  • 16. Information Extraction ✤ Token identification - “tokenization” ✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.) ✤ Phrase identification - noun phrase ✤ Entity extraction - people, places, events, dates, organizations Thursday, November 17, 2011
  • 17. Information Extraction ✤ Cluster analysis - group related information, where relationship may not be known ✤ Classification - mapping to specific categories ✤ Dependency identification / Rule generation ✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM” ✤ Summarization - key concepts or key sentences Thursday, November 17, 2011
  • 18. Open Tools ✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation. ✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization. Thursday, November 17, 2011
  • 19. Open Tools ✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project. ✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services). Thursday, November 17, 2011
  • 20. UIMA UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for upstream processing. UIMA CAS Representation now Common Analysis Structure (CAS) Aligned with XMI standard Relationship CeoOf Arg1:Person Arg2:Org Analysis Results (i.e., Artifact Metadata) Named Entity Person Organization Parser NP VP PP Fred Center is the CEO of Center Micros Artifact (e.g., Document) Chart by IBM Thursday, November 17, 2011
  • 21. UIMA Image by IBM Thursday, November 17, 2011
  • 22. Commercial Tools ✤ Oracle Data Mining (Text Mining) ✤ IBM SPSS ✤ SAS Text Miner ✤ Smartlogic ✤ Lots of acquisitions going on in the “big data” space ✤ HP acquired Autonomy ✤ Oracle acquired Endeca Thursday, November 17, 2011
  • 23. A Note on Tools ✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves. ✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc). ✤ Your mileage will vary. The biggest differentiator is your knowledge of your data. Thursday, November 17, 2011
  • 24. What can unstructured data look like post-processing? Thursday, November 17, 2011
  • 25. Machine Processing Unstructured Natural Rules-based Statistical Semantic Data Language Classifica- Analysis Analysis Processing tion Machine Processing Platform Federated Search A P Index I Visualizations Data Stores Thursday, November 17, 2011
  • 27. Thank you Christine Connors Kevin Lynch www.triviumrlg.com Thursday, November 17, 2011