Search, Signals & Sensemaking: An Analytics Vision

Search, Signals & Sense:
An Analytics Fueled Vision

Seth Grimes
@sethgrimes

A Sense Making Story

New York Times,
September 30, 2012

Valium: Starting a Chain of Connections

New York Times,
September 8, 1957

H.P. Luhn

By H.P. Luhn, in
IBM Journal,
April, 1958

http://altaplana.com/ibm-
luhn58-LiteratureAbstracts.pdf

Modelling Text

Luhn’s analysis of
Messengers of the Nervous
System, a Scientific American
article
http://wordle.net, applied
to the NY Times article

“Statistical information derived from word frequency and distribution is
used by the machine to compute a relative measure of significance, first
for individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.

Luhn’s Example

New York Times,
September 8, 1957

Can Software Make the Connection?

Mark Lombardi, George W. Bush, Harken Energy
and Jackson Stephens, c. 1979-90, Detail

There and Back Again: Modelling Text, 2

The text content of a document can be considered an
unordered “bag of words.”
Particular documents are points in a high-dimensional
vector space.
Salton, Wong &
Yang, “A Vector
Space Model for
Automatic
Indexing,”
November 1975.

Modelling Text, 3

We might construct a document-term matrix...
• D1 = “I like databases”
• D2 = “I hate hate databases”

I like hate databases
D1 1 1 0 1
D2 1 0 2 1
http://en.wikipedia.org/wiki/Term-document_matrix

and use a weighting such as TF-IDF (term frequency–
inverse document frequency)…
in computing the cosine of the angle between
weighted doc-vectors to determine similarity.

Modelling Text, 4

In the form of query-document similarity, this is
Information Retrieval 101.
• See, for instance, Salton & Buckley, “Term-Weighting
Approaches in Automatic Text Retrieval,” 1988.
• A useful basic tech paper: Russ Albright, SAS, “Taming Text
with the SVD,” 2004.
Given the complexity of human language, statistical
models may fall short.
“Reading from text in general is a hard problem, because it
involves all of common sense knowledge.”
-- Expert systems pioneer Edward A. Feigenbaum

From Text to Data: Features

Analytical methods make text tractable.
Latent semantic indexing utilizing singular value
decomposition for term reduction / feature selection.
Classification technologies / methods:
• Naive Bayes.
• Support Vector Machine.
• K-nearest neighbor.

“Reading from Text is a Hard Problem”

Eugène
Delacroix,
St. Michael
Defeats the
Devil

Thus the Orb he roam'd
With narrow search; and with inspection deep
Consider'd every Creature, which of all
Most opportune might serve his Wiles.
-- John Milton, Paradise Lost

Data, Search, Analysis, and Discovery

Eugène
Delacroix,
St. Michael
Defeats the
Devil

Data
For Space
features
Analysis
Thus the Orb he roam'd
With narrow search; and with inspection deep
Consider'd every Creature, which of all Intent,
Most opportune might serve his Wiles. Goals
-- John Milton, Paradise Lost

The User Interface

“Search is the UI for data today.”
-- Grant Ingersoll, Chief Scientist, LucidWorks
Quoted by Gil Press in Forbes,
“LucidWorks: Bringing Search to Big Data”
http://www.forbes.com/sites/gilpress/2012/09/24/lucidworks-bringing-search-to-big-data/

What’s beyond?

Search and Sensemaking

“It is convenient to divide the entire
information access process into two
main components: information retrieval
through searching and browsing, and
analysis and synthesis of results. This
broader process is often referred to in
the literature as sensemaking.
Sensemaking refers to an iterative
process of formulating a conceptual
representation from of a large volume
of information. Search plays only one
part in this process.”
-- Marti Hearst, 2009
http://searchuserinterfaces.com/

Senseless Search

New but old: Dumb and siloed

Searcher Supplied Sense

Better?

Clustered Clarity

Carrot2.
(open source)

Semanticized (Web) Search

Google
Knowledge
Graph

Search Fronted Analysis & Discovery

Fusions,
Signals

Toward Semantic Search Sensemaking

Old Search Sensemaking
Search on: keywords + identity, history & context
Sources: content/type silos Unified
Indexed: terms + metadata (properties)
Returned: hit lists Categories / clusters /
answers first
Relevance: PageRank (Inferred) intent
Prevalence: plenty of new Plenty of established
platforms with old(ish) search with new(ish)
search capabilities, also wanna-
bes.

The Back End

Platforms and ecosystems.
APIs and services.
Text and content analytics --
Discerns and extracts features including relationships from
source materials.
Features = entities, key-value pairs, concepts, topics,
events, sentiment, etc.
Provide (for) BI on content-sourced data.
Data integration, record linkage, data fusion.

Text+ Technology Mashups

Text/content analytics generates semantics to bridge
search, BI, and applications, enabling next-
generation information systems.
Semantic search Information access
(search + text) (search + text + BI)

Search based Search BI
applications
Integrated analytics
(search + text +
(text + BI)
apps)
Applica-
Text analytics tions NextGen CRM, EFM,
(inner circle) MR, marketing, …

Analytical Assets (Open Source)

>>> import nltk
>>> sentence = """At eight o'clock on Thursday
morning... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]

http://nltk.org/
tm: Text Mining Package
A framework for text mining
applications within R.

A Big Data Analytics Architecture

http://hpccsystems.com/ (GNU Affero GPL)

http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html

Commercial (Non-OS) Solutions Plug In

Drivers and Trends

Social media!
… and personal-social-enterprise integration.
Via-API cloud services.
Big Data (even if you don’t like the term).
Volume and velocity mean new analytical approaches.
Variety: new types and a new fusion imperative.
Sentiment: Mood, opinions, emotions, intent.
Question answering.

Text Tech Initiatives

Now and near future.
• Broader & deeper international language support.
• Sentiment analysis, beyond polarity.
Emotions, intent signals. etc.
• Identity resolution & profile extraction.
Online-social-enterprise data integration.
• Semantic data integration, Complex Data.
• Speech analytics.
• Discourse analysis.
Because isolated messages are not conversations.
• Rich-media content analytics.
• Augmented reality; new human-computer interfaces.

Personal. Mobile. Intelligent?

http://timoelliott.com/blog/2010/10/sap-businessobjects-augmented-
explorer-now-available-resources-to-test-it.html

A Focus on Information & Applications

Now and near future.
• Signal detection.
Sentiment, emotion, identity, intent.
• Semanticized applications.
Linkable, mashable, enrichable.
• Rich information.
Context sensitive, situational.
Σ = Sensemaking.

Search, Signals & Sensemaking: An Analytics Vision

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (9)

Destacado

Destacado (7)

Similar a Search, Signals & Sensemaking: An Analytics Vision

Similar a Search, Signals & Sensemaking: An Analytics Vision (20)

Más de Seth Grimes

Más de Seth Grimes (20)

Último

Último (20)

Search, Signals & Sensemaking: An Analytics Vision