2. Agenda
• Define Big Data and Document Analysis
• The Infinit.e Solution
• Questions
3. What is Big Data?
“Big data is a term applied to data sets
whose size is beyond the ability of commonly
used software tools to capture, manage, and
process the data within a tolerable elapsed
time.”
Source: http://en.wikipedia.org/wiki/Big_data
4. This is what Big Data Feels Like
Shamelessly stolen from:
http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
5. What is Document Analysis?
"Document Analysis refers to
computer-assisted analysis of large numbers
of documents in order to answer questions
about the content of a document set.”
Source: http://www.text-tech.com/docanalysis/definition.html
6. Document Analysis
• The goal is to:
– Extract Entities (people, places, things)
– Create Associations between entities (in the
form of noun-verb-noun), e.g.:
• John Doe lives in Washington, D.C
• John Doe is married to Jane Doe
• John Doe is a Virgo
• John Doe traveled to Mexico on July 6th, 2011
• And…
7. Document Analysis
• Turn Who, What, When and
Where into a unified data structure that
supports data analytics and visualization.
Who When
people, organizations, past, present, future
facilities, company dates
What Where
events, summaries, city, state, country,
facts, themes coordinate
8. The Infinit.e Solution
• Infinit.e is an Open Source
document discovery and
analysis platform that has
these very cool open source
tools lurking under the hood.
github.com/ikanow/Infinit.e
9. The Infinit.e Solution
Infinit.e is a
scalable
framework for Visualizing
Analyzing
Retrieving
Enriching
Storing
Collecting
Structured and
Unstructured Documents
10. Harvesting
• Infinit.e’s harvester:
– Collects documents for specified data sources
(URLs, RDBMs via JDBC, file shares)
– Marshalls each document through the
enrichment process
– Saves each metadata document, entity, and
association created to MongoDB
12. Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
<link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
<description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the
Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
conference focuses on the countries of Egypt and Tunisia the most …
</description>
<dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
<dc:creator>unknown</dc:creator>
<dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
14. Document Metadata
• doc_metadata.metadata
{
"_id" : ObjectId("4f93638e0cf212156d0559d2"),
"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
"url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
"description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
conference focuses on the countries of Egypt and Tunisia; the most ...",
"created" : ISODate("2012-04-22T01:49:02Z"),
“metadata” : {…},
"associations" : […],
"entities" : […],
...
}
16. Document Enrichment
• Infinit.e supports the extraction of entities
and creation of associations using a
combination of built in enrichment libraries
and 3rd party NLP APIs including:
17. Harvested Entities
• feature.entity
{
"_id" : ObjectId("4f9189d48baf188282a1c9ef"),
"alias" : [
"Zine el Abidine Ben Ali",
"Zine El Abidine Ben Ali",
"Zine el Abidine ben Ali"
],
"batch_resync" : true,
"communityId" : ObjectId("4f8f138103644ee8003bf518"),
"db_sync_doccount" : NumberLong(143),
"db_sync_time" : "1338751174988",
"dimension" : "Who",
"disambiguated_name" : "Zine El Abidine Ben Ali",
"doccount" : 152,
"index" : "zine el abidine ben ali/person",
"totalfreq" : 353,
"type" : "Person"
}