MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Getting Started with Unstructured Data
1. Getting Started with Unstructured
Data
Christine Connors & Kevin Lynch
TriviumRLG LLC
November 17, 2011
Thursday, November 17, 2011
2. Meta
✤ Presenter: Christine Connors
✤ @cjmconnors
✤ Presenter: Kevin Lynch
✤ @kevinjohnlynch
✤ Principals at www.triviumrlg.com
✤ Partnering with Dataversity
Thursday, November 17, 2011
3. Agenda
✤ What is unstructured data?
✤ Where do we find it?
✤ How important is it?
✤ How do we visualize it?
✤ Machine processing for actionable data
✤ Tools
Thursday, November 17, 2011
4. What is unstructured data?
✤ Data which is
✤ Not in a database
✤ Does not adhere to a formal data model
✤ Content
Thursday, November 17, 2011
5. Isn’t that a misnomer?
✤ Problematic term
✤ The presence of object metadata or aesthetic markup does not alone
give ‘structure’ in this sense of the word
✤ Object metadata = machine or applied properties
✤ Aesthetic markup = stylesheets; rendering information
✤ Semi-structured data is typically treated as unstructured for the
purposes of machine processing and analysis
Thursday, November 17, 2011
6. Types of ‘un’structured data
✤ Text-based documents
✤ Word processing, presentations, email, blogs, wikis, tweets, web
pages, web components (read/write web)
✤ Audio/video files
Thursday, November 17, 2011
7. Where do we find it?
✤ Office productivity suites
✤ Content management systems
✤ Digital asset management systems
✤ Web content management systems
✤ Wikis, blogs, comment & discussion threads
✤ Social networking tools
✤ Twitter, Yammer, instant messengers
Thursday, November 17, 2011
8. Is it really that important?
Structured Unstructured
15%
85%
Thursday, November 17, 2011
9. What’s in that 80-85%?
✤ Progress reports -
created in a word processor
Thursday, November 17, 2011
10. What’s in that 80-85%?
✤ Dashboards -
created in presentation software
Thursday, November 17, 2011
11. What’s in that 80-85%?
✤ Progress reports -
color coded text in a
spreadsheet
Thursday, November 17, 2011
12. What’s in that 80-85%?
✤ Brainstorming -
in messaging systems
✤ Decision making - in email
Thursday, November 17, 2011
13. What’s in that 80-85%?
✤ Business intelligence - on the
web and more
Thursday, November 17, 2011
14. How can we make the data more
actionable?
✤ Identify it
✤ Convert to a format you can work with
✤ Add structure, meaning:
✤ information extraction
✤ annotation
✤ content analytics
Thursday, November 17, 2011
15. What about enterprise search?
✤ First line of defense
✤ Points you at the highest relevancy ranked data via pattern matching
and statistical analysis
✤ Does not assist in other visualizations or transformations without
further machine processing
Thursday, November 17, 2011
17. Information Extraction
✤ Cluster analysis - group related information, where relationship may
not be known
✤ Classification - mapping to specific categories
✤ Dependency identification / Rule generation
✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”
✤ Summarization - key concepts or key sentences
Thursday, November 17, 2011
18. Open Tools
✤ GATE – General Architecture for
Text Engineering, from the
University of Sheffield, with many
users and excellent documentation.
✤ GATE has customizable document
and corpus processing pipelines.
GATE is an architecture, a
framework, and a development
environment, with a clean separation
of algorithms, data, and
visualization.
Thursday, November 17, 2011
19. Open Tools
✤ UIMA – Unstructured Information
Management Architecture (IBM’s
Watson uses this), originated at
IBM, now an Apache project.
✤ Component software architecture
with a document processing
pipeline similar to GATE. Focus on
performance and scalability, with
distributed processing (web
services).
Thursday, November 17, 2011
20. UIMA
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new
types based on existing ones and update the Common Analysis Structure (CAS) for
upstream processing.
UIMA CAS
Representation now
Common Analysis Structure (CAS) Aligned
with XMI standard
Relationship CeoOf
Arg1:Person Arg2:Org
Analysis Results
(i.e., Artifact Metadata)
Named Entity Person Organization
Parser NP VP PP
Fred Center is the CEO of Center Micros
Artifact (e.g., Document)
Chart by
IBM
Thursday, November 17, 2011
22. Commercial Tools
✤ Oracle Data Mining (Text Mining)
✤ IBM SPSS
✤ SAS Text Miner
✤ Smartlogic
✤ Lots of acquisitions going on in the “big data” space
✤ HP acquired Autonomy
✤ Oracle acquired Endeca
Thursday, November 17, 2011
23. A Note on Tools
✤ UIMA and GATE – comprehensive suite of capabilities, with learning
curves.
✤ Commercial tools range from unstructured capabilities inside DBMSs
like Oracle, to Business Objects business intelligence tools (who
acquired Inxight from Xeroc Parc).
✤ Your mileage will vary. The biggest differentiator is your knowledge
of your data.
Thursday, November 17, 2011
25. Machine Processing
Unstructured Natural Rules-based
Statistical Semantic
Data Language Classifica-
Analysis Analysis
Processing tion
Machine Processing Platform
Federated
Search A
P Index
I
Visualizations Data Stores
Thursday, November 17, 2011