Presented by Jeremy Bently| Smartlogic. See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
As Big Data becomes more pervasive, the need for increased metadata management becomes critical to the understanding and mining of that content. Metadata is what unlocks the value of information assets. When metadata is well managed, the information assets are more useful and valuable. Badly managed metadata can make information assets less useful and less valuable — creating increased costs and risks related to those assets. During this presentation, we'll discuss the different types of metadata, the role of search and analytics in Big Data and the integration of Apache Solr with Content Intelligence to enable better metadata management of Big Data.
Big Data Meets Metadata – Analyzing Large Data Sets
1. Smartlogic
TM
Lucene Revolution 2012
Jeremy
Bentley,
CEO
2. 1st degree of order
Filing management
• 80% of enterprise information is
unstructured
• Doubling every 19 months and
accelerating [Gartner]
• Increasing burden of compliance
• Enterprise 2.0 additions
• Big Data connotations
3. 2nd degree of order
Index management
• File plans and metadata schema
• Manually applied classification
• Low level of consistency and quality
4. 3rd degree Order
Enterprise
Content
Search
Management
Portal
Infrastructure
Document
Management
Automation of
SharePoint
1st & 2nd Records
Management
Degrees
Publishing
Process
Systems
Management
&
Digital
Workflow
Asset
Management
eDiscovery
5. 5
A 10 year Flatline
User
Search
Sa5sfac5on
50%
48%
2001
2011
• 2001,
IDC,
“Quan5fying
Enterprise
Search”
Searchers
are
successful
in
finding
what
they
seek
50%
of
the
9me
or
less
• 2011,
MindMetre/SmartLogic
More
than
half
(52%)
cannot
find
the
informa9on
they
need
using
their
Enterprise
search
system
6. The explosion of information
80Tb
?
20
5mes
Terabytes
of
data
increase
in
Informa5on
volume
4Tb
1993-‐2001
2001-‐2009
Source:
the
Na5onal
Archives
7. Volume + other disruptive factors
Velocity
Variety
Complexity
Cross-‐organiza5onal
and
cross
pla[orm
informa5on
needs
Changing
requirements
for
informa5on
over
5me
Copyright
@
2011
Smartlogic
Semaphore
Limited
7
8. New 4th degree of order
Enterprise
Content
Search
Management
Portal
Infrastructure
Document
Management
SharePoint
Content Records
Intelligence Management
Publishing
Process
Systems
Management
&
Digital
Workflow
Asset
Management
eDiscovery
9. Content Intelligence
Informa5on
Manufacturing
Mone5sa5on
Knowledge
Metadata
Recovery
Data
Loss
Preven5on
Risk
&
Compliance
Content
Analy5cs
11. Metadata
Information
Subject
Crea5on
Date
Loca5on
Modified
Date
Project
Author
Func5on
Format
(PDF,DOC,XLS)
(IT,HR,Finance)
Protec5ve
Marker
Expiry
Publisher
Expert
Reten5on
Site
Process Structural
12. 4th degree of order
Content Intelligence
Content
Intelligence
Pla[orm
FAST
SharePoint
13. What is Content Intelligence
Content
Intelligence
is
the
process
of
IDENTIFYING
CLASSIFYING
EXTRACTING
ANALYZING
SURFACING
informa5on
based
on
its
meaning
and
context
to
make
!mely
and
informed
business
decisions.
14. Content Intelligence Solutions
KNOWLEDGE
MICROTARGETING
ACQUSITION
&
DISTRIBUTION
&
REUSE
GOVERNANCE,
COMPLIANCE
&
WEB-‐BASED
RISK
SELF
SERVICE
15. Big Data + Content Intelligence
From
Gartner,
2011
16. Semaphore – Three Core Capabilities
Seman5c
Ontology
Build,
Manage
and
Model
Manager
Deploy
Vocabularies/
Libraries
Expose
Apply
SEMAPHORE
Users
Content
ClassificaJon
SemanJc
Server
Enhancement
Server
Inform
Explore
data
to
find
Automate
the
insights
Metadata
Enrichment
16
17. Enterprise Classification
Important
requirements
for
Velocity/Volume:
• Scalability
for
large
volumes
of
content,
users,
metadata
and
systems
• Easy
integra5on
with
processing
systems
-‐
search,
content,
records
and
document
management
systems
as
well
as
file
shares
and
content
migra5on
tools
• Support
for
all
the
organiza5on‘s
languages
and
data
formats
19. Metadata Generation
Information
Brand Creation Date
Service Modified Date
Geography Author
Products Format
(PDF,DOC,XLS)
Expert
Protective
Retention
Marker
Publisher
Expiry
Site
Process Structural
21. Without Accurate Metadata
Big
Data
has
its
perils.
With
huge
data
sets
and
fine-‐grained
measurement,
there
is
increased
risk
of
“false
discoveries.”
The
trouble
with
seeking
a
meaningful
needle
in
massive
haystacks
of
data
is
that
“many
bits
of
straw
look
like
needles.”
-‐
Trevor
Has5e,
Sta5s5cs
Professor
at
Stanford
University
22. What Classification Must Handle
Capability
Included
Look
for
all
the
vocabulary
associated
with
topic/en5ty
Determine
aboutness
/
avoid
passing
men5ons
Address
term
ambiguity
Handle
stemming
errors
Determine
if
topics
in
the
same
context
Split
documents
into
components
Generate
scores
(so
most
relevant
content
bubbles
to
top)
Show
dynamic
summaries
to
users
23. Enhancing Metadata
• Accurately
classify
content
into
subject
areas
defined
in
a
taxonomy/ontology
• En5ty
extrac5on
(Text
Mining)
• Sen5ment
Analysis
• Fact
Extrac5on
24. Physical Architecture
Ontology
Management
Services
Ontology
Manager
Ontology
Manager
Desktop
Ontology
Manager
Desktop
Standalone
Desktop
Win
7,
Vista
Win
7,
Vista
Win7,
Vista
2Gb
RAM
2Gb
RAM
2Gb
RAM
2GHz
Dual
CPU
2GHz
Dual
CPU
2GHz
Dual
CPU
Op5onal
RDBMS
data
store
Ontology
Manager
Server
Oracle
Port
8001
Port
8002
MySQL
Win
7,
Vista,
2003,
2008
+R2
Ontology
Ontology
Linux
SQL
Server
2005
+
2008
+
Instance
1
Instance
2
2Gb
RAM
2008
R2
2GHz
CPU
Seman5c
Enhancement
Server
Content
Classifica5on
Server
Search
Enhancement
Server
Classifica5on
Server
Classifica5on
Test
Interface
Port
5058
Search
GSA
Extensions
Classifica5on
Internet
Explorer
Enhancement
FAST
Extensions
Instance
Firefox
Instance
Sharepoint
Extensions
Rule
and
Template
Editor
Windows
Server
2003
,2008
(32bit/64bit)
+R2
Windows
Server
2003
,2008
(32bit/64bit)
+
R2
Win
7,
Vista
Linux
Linux
2Gb
RAM
IIS/Apache
HTTP
Server
CPU
and
RAM
intensive.
Scale
to
volume
of
content
2GHz
Dual
CPU
RAM
and
disk
access
intensive.
Scale
to
expected
peak
search
throughput
and
number
of
publishing
users
Google
Classifica5on
Handler
Integra5on
Components
Dispatcher
Proxy
Windows
Server
2003
,2008
(32bit/64bit)
+R2
Scale
for
throughput
of
GSA
Indexing
Crawler
Search
Applica5on
Framework
Search
Applica5on
Framework
Document
Library
Components
Semaphore
Document
Processor
Semaphore
Document
Processor
Search
Applica5on
Framework
Search
Web
Parts
Microsou
FAST
ESP
Microsou
Office
SharePoint
Google
Search
Appliance
Server
Farm
SOLR
Server
2007
/
2010
Server
Farm
29. How Else Does Semaphore Help
Disambiguate queries
Perfectly formed filters
organised by facet
Graphical drill down
Explore relationships
Supporting documents