Strategies for Landing an Oracle DBA Job as a Fresher
How IKANOW uses MongoDB to help organizations solve really big problems
1. The Open Source document analysis platform
Or, how IKANOW uses
to help organizations solve really big problems
2. Agenda
• What is Document Analysis?
• The Infinit.e Solution
– Infinit.e’s Architecture
– Why and How we use MongoDB
• Analyzing #MongoDC
• Questions
3. This is what Big Data Looks Like
Shamelessly stolen from:
http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
4. What is Document Analysis?
"Document Analysis refers to
computer-assisted analysis of large numbers
of documents in order to answer questions
about the content of a document set.”
Source: http://www.text-tech.com/docanalysis/definition.html
5. Document Analysis
• Common document source formats:
RSS JSON XML
HTML PDF TXT
RTF Word PPT
Multimedia Files RDBMS Records ETC.
6. Document Analysis
• The goal is to:
– Extract Entities (people, places, things)
– Create Associations between entities (in the
form of noun-verb-noun), e.g.:
• John Doe lives in Washington, D.C
• John Doe is married to Jane Doe
• John Doe is a Virgo
• John Doe traveled to Mexico on July 6th, 2011
• And…
7. Document Analysis
• Turn Who, What, When and
Where into a unified data structure that
supports data analytics and visualization.
Who When
people, organizations, past, present, future
facilities, company dates
What Where
events, summaries, city, state, country,
facts, themes coordinate
8. The Infinit.e Solution
• Infinit.e is an Open Source
document discovery and
analysis platform that has
these very cool Open Source
tools lurking under the hood.
github.com/ikanow/Infinit.e
9. The Infinit.e Solution
Infinit.e is a
scalable
framework for Visualizing
Analyzing
Retrieving
Enriching
Storing
Collecting
Structured and
Unstructured Documents
12. Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
<title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
<link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
<description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the
Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
conference focuses on the countries of Egypt and Tunisia the most …
</description>
<dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
<dc:creator>unknown</dc:creator>
<dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
16. Document Metadata
• doc_metadata.metadata
{
"_id" : ObjectId("4f93638e0cf212156d0559d2"),
"title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
"url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
"description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
conference focuses on the countries of Egypt and Tunisia; the most ...",
"created" : ISODate("2012-04-22T01:49:02Z"),
“metadata” : {…},
"associations" : […],
"entities" : […],
...
}
17. Harvested Document Metadata
• doc_metadata.metadata.metadata
"metadata" : {
"location" : [
{ Note: It is okay to laugh at this
"region" : "South Asia",
"citystateprovince" : {
"stateprovince" : "Rolpa”, "city" : "Newang"
},
"country" : "Nepal"
}
],
"icn" : [ "200573487" ],
"incidentdate" : [ "07/25/2005" ],
"organization" : [
"Communist Party of Nepal (Maoist)/United People's Front”
],
...
},
18. Document Enrichment
• Infinit.e supports the extraction of entities
and creation of associations using a
combination of built in enrichment libraries
and 3rd party NLP APIs including:
19. Harvested Entities
• feature.entity
{
"_id" : ObjectId("4f9189d48baf188282a1c9ef"),
"alias" : [
"Zine el Abidine Ben Ali",
"Zine El Abidine Ben Ali",
"Zine el Abidine ben Ali"
],
"batch_resync" : true,
"communityId" : ObjectId("4f8f138103644ee8003bf518"),
"db_sync_doccount" : NumberLong(143),
"db_sync_time" : "1338751174988",
"dimension" : "Who",
"disambiguated_name" : "Zine El Abidine Ben Ali",
"doccount" : 152,
"index" : "zine el abidine ben ali/person",
"totalfreq" : 353,
"type" : "Person"
}
26. Why MongoDB? – Reason #1
Document-Oriented Storage
• MongoDB’s document-oriented storage
(i.e. schema-less) is perfectly suited to the
data design requirements of a system that
needs to ingest a wide variety of
structured and unstructured document
formats and normalize them into one
unified, semi-structured format
27. Why MongoDB? – Reason #2
JSON
• The standard language of open document
analysis
– JSON is a common interchange format supported
by tools like elasticsearch and SaaS NLP engines
– BSON (Binary JSON) is MongoDB’s native data
format
– Infinit.e ingests and exports JSON
natively via the REST based API
Note: Infinit.e uses Google’s GSON JAVA library to convert
JSON to POJOs and back
This is the JSON logo
28. Why MongoDB? – Reason #3
MongoDB Is Web Scale*
*Shards are the secret ingredients in the web scale sauce. They just work.
29. Why MongoDB? – Reason #3
Scalability
• Seriously, MongoDB Scales
– Harvesting and enriching documents requires
a lot of disk space
– MongoDB scales to arbitrary sizes in both
read/write dimensions
– Sophisticated sharding keys provide
powerful/flexible balancing
BUT building an initial cluster can be complex
and managing cluster changes is “fiddly”
30. Why MongoDB? – Reason #4
Integration with Apache Hadoop
• Hadoop is rapidly becoming the de-facto standard for
data analytics
– Open Source, very customizable
– Proven scalability
– Java libraries
• The MongoDB Hadoop Adaptor allows Hadoop to read
from and write to MongoDB instead of HDFS
+ =
31. Tweeting about MongoDC
• Source:
http://search.twitter.com/search.rss?q=mongodc
– Who’s Tweeting?
– What are they Tweeting?
– What does basic document analysis of these
Tweets tell us?