3. Valium: Starting a Chain of Connections
New York Times,
September 8, 1957
4. H.P. Luhn
By H.P. Luhn, in
IBM Journal,
April, 1958
http://altaplana.com/ibm-
luhn58-LiteratureAbstracts.pdf
5.
6. Modelling Text
Luhn’s analysis of
Messengers of the Nervous
System, a Scientific American
article
http://wordle.net, applied
to the NY Times article
“Statistical information derived from word frequency and distribution is
used by the machine to compute a relative measure of significance, first
for individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
10. Can Software Make the Connection?
Mark Lombardi, George W. Bush, Harken Energy
and Jackson Stephens, c. 1979-90, Detail
11. There and Back Again: Modelling Text, 2
The text content of a document can be considered an
unordered “bag of words.”
Particular documents are points in a high-dimensional
vector space.
Salton, Wong &
Yang, “A Vector
Space Model for
Automatic
Indexing,”
November 1975.
12. Modelling Text, 3
We might construct a document-term matrix...
• D1 = “I like databases”
• D2 = “I hate hate databases”
I like hate databases
D1 1 1 0 1
D2 1 0 2 1
http://en.wikipedia.org/wiki/Term-document_matrix
and use a weighting such as TF-IDF (term frequency–
inverse document frequency)…
in computing the cosine of the angle between
weighted doc-vectors to determine similarity.
13. Modelling Text, 4
In the form of query-document similarity, this is
Information Retrieval 101.
• See, for instance, Salton & Buckley, “Term-Weighting
Approaches in Automatic Text Retrieval,” 1988.
• A useful basic tech paper: Russ Albright, SAS, “Taming Text
with the SVD,” 2004.
Given the complexity of human language, statistical
models may fall short.
“Reading from text in general is a hard problem, because it
involves all of common sense knowledge.”
-- Expert systems pioneer Edward A. Feigenbaum
14. From Text to Data: Features
Analytical methods make text tractable.
Latent semantic indexing utilizing singular value
decomposition for term reduction / feature selection.
Classification technologies / methods:
• Naive Bayes.
• Support Vector Machine.
• K-nearest neighbor.
15. “Reading from Text is a Hard Problem”
Eugène
Delacroix,
St. Michael
Defeats the
Devil
Thus the Orb he roam'd
With narrow search; and with inspection deep
Consider'd every Creature, which of all
Most opportune might serve his Wiles.
-- John Milton, Paradise Lost
16. Data, Search, Analysis, and Discovery
Eugène
Delacroix,
St. Michael
Defeats the
Devil
Data
For Space
features
Analysis
Thus the Orb he roam'd
With narrow search; and with inspection deep
Consider'd every Creature, which of all Intent,
Most opportune might serve his Wiles. Goals
-- John Milton, Paradise Lost
17. The User Interface
“Search is the UI for data today.”
-- Grant Ingersoll, Chief Scientist, LucidWorks
Quoted by Gil Press in Forbes,
“LucidWorks: Bringing Search to Big Data”
http://www.forbes.com/sites/gilpress/2012/09/24/lucidworks-bringing-search-to-big-data/
What’s beyond?
18. Search and Sensemaking
“It is convenient to divide the entire
information access process into two
main components: information retrieval
through searching and browsing, and
analysis and synthesis of results. This
broader process is often referred to in
the literature as sensemaking.
Sensemaking refers to an iterative
process of formulating a conceptual
representation from of a large volume
of information. Search plays only one
part in this process.”
-- Marti Hearst, 2009
http://searchuserinterfaces.com/
26. Toward Semantic Search Sensemaking
Old Search Sensemaking
Search on: keywords + identity, history & context
Sources: content/type silos Unified
Indexed: terms + metadata (properties)
Returned: hit lists Categories / clusters /
answers first
Relevance: PageRank (Inferred) intent
Prevalence: plenty of new Plenty of established
platforms with old(ish) search with new(ish)
search capabilities, also wanna-
bes.
27. The Back End
Platforms and ecosystems.
APIs and services.
Text and content analytics --
Discerns and extracts features including relationships from
source materials.
Features = entities, key-value pairs, concepts, topics,
events, sentiment, etc.
Provide (for) BI on content-sourced data.
Data integration, record linkage, data fusion.
28. Text+ Technology Mashups
Text/content analytics generates semantics to bridge
search, BI, and applications, enabling next-
generation information systems.
Semantic search Information access
(search + text) (search + text + BI)
Search based Search BI
applications
Integrated analytics
(search + text +
(text + BI)
apps)
Applica-
Text analytics tions NextGen CRM, EFM,
(inner circle) MR, marketing, …
29. Analytical Assets (Open Source)
>>> import nltk
>>> sentence = """At eight o'clock on Thursday
morning... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
http://nltk.org/
tm: Text Mining Package
A framework for text mining
applications within R.
30. A Big Data Analytics Architecture
http://hpccsystems.com/ (GNU Affero GPL)
http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html
32. Drivers and Trends
Social media!
… and personal-social-enterprise integration.
Via-API cloud services.
Big Data (even if you don’t like the term).
Volume and velocity mean new analytical approaches.
Variety: new types and a new fusion imperative.
Sentiment: Mood, opinions, emotions, intent.
Question answering.
33. Text Tech Initiatives
Now and near future.
• Broader & deeper international language support.
• Sentiment analysis, beyond polarity.
Emotions, intent signals. etc.
• Identity resolution & profile extraction.
Online-social-enterprise data integration.
• Semantic data integration, Complex Data.
• Speech analytics.
• Discourse analysis.
Because isolated messages are not conversations.
• Rich-media content analytics.
• Augmented reality; new human-computer interfaces.
35. A Focus on Information & Applications
Now and near future.
• Signal detection.
Sentiment, emotion, identity, intent.
• Semanticized applications.
Linkable, mashable, enrichable.
• Rich information.
Context sensitive, situational.
Σ = Sensemaking.