SlideShare una empresa de Scribd logo
1 de 37
The Open Source document analysis platform



  Or, how IKANOW uses
to help organizations solve really big problems
Agenda
• What is Document Analysis?
• The Infinit.e Solution
  – Infinit.e’s Architecture
  – Why and How we use MongoDB
• Analyzing #MongoDC
• Questions
This is what Big Data Looks Like




                                          Shamelessly stolen from:
   http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
What is Document Analysis?
 "Document Analysis refers to
 computer-assisted analysis of large numbers
 of documents in order to answer questions
 about the content of a document set.”
 Source: http://www.text-tech.com/docanalysis/definition.html
Document Analysis
• Common document source formats:
RSS                JSON            XML

HTML               PDF             TXT

RTF                Word            PPT

Multimedia Files   RDBMS Records   ETC.
Document Analysis
• The goal is to:
  – Extract Entities (people, places, things)
  – Create Associations between entities (in the
    form of noun-verb-noun), e.g.:
     •   John Doe lives in Washington, D.C
     •   John Doe is married to Jane Doe
     •   John Doe is a Virgo
     •   John Doe traveled to Mexico on July 6th, 2011
• And…
Document Analysis
• Turn Who, What, When and
  Where into a unified data structure that
  supports data analytics and visualization.
Who                                When
people, organizations,             past, present, future
facilities, company                dates

What                               Where
events, summaries,                 city, state, country,
facts, themes                      coordinate
The Infinit.e Solution
• Infinit.e is an Open Source
  document discovery and
  analysis platform that has
  these very cool Open Source
  tools lurking under the hood.


      github.com/ikanow/Infinit.e
The Infinit.e Solution

      Infinit.e is a
        scalable
    framework for                                           Visualizing
                                                Analyzing
                                   Retrieving
                       Enriching
             Storing
Collecting

                                        Structured and
                                   Unstructured Documents
IkanMeow
Document Collection
• Infinit.e harvests documents from:

  – URLs

  – File Shares

  – Databases
Sample RSS Document
<rss version="2.0">
<channel>
…
<item>
    <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title>
    <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in-
    egypt-tunisia-report-by-egyptlastminute-com-13613.html</link>
    <description>Report by egyptlastminute.com CAIRO: On Monday, the             countries of the
     Mediterranean opened a conference seeking to enhance the             future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia the most …
    </description>
    <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher>
    <dc:creator>unknown</dc:creator>
    <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date>
</item>
…
</channel>
</rss>
Full Text Source
Source Ingestion Data Flow
Document DBs and Collections
Document Metadata
• doc_metadata.metadata
{
    "_id" : ObjectId("4f93638e0cf212156d0559d2"),
    "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...",
    "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism-
    in-egypt-tunisia-report-by-egyptlastminute-com-13613.html"
    "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the
    Mediterranean opened a conference seeking to enhance the future of tourism in the region. The
    conference focuses on the countries of Egypt and Tunisia; the most ...",
    "created" : ISODate("2012-04-22T01:49:02Z"),
    “metadata” : {…},
    "associations" : […],
    "entities" : […],
    ...
}
Harvested Document Metadata
• doc_metadata.metadata.metadata
"metadata" : {
     "location" : [
            {                                                          Note: It is okay to laugh at this
                   "region" : "South Asia",
                   "citystateprovince" : {
                          "stateprovince" : "Rolpa”, "city" : "Newang"
                   },
                   "country" : "Nepal"
            }
     ],
     "icn" : [ "200573487" ],
     "incidentdate" : [ "07/25/2005" ],
     "organization" : [
            "Communist Party of Nepal (Maoist)/United People's Front”
     ],
     ...
},
Document Enrichment
• Infinit.e supports the extraction of entities
  and creation of associations using a
  combination of built in enrichment libraries
  and 3rd party NLP APIs including:
Harvested Entities
• feature.entity
{
    "_id" : ObjectId("4f9189d48baf188282a1c9ef"),
    "alias" : [
           "Zine el Abidine Ben Ali",
           "Zine El Abidine Ben Ali",
           "Zine el Abidine ben Ali"
    ],
    "batch_resync" : true,
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(143),
    "db_sync_time" : "1338751174988",
    "dimension" : "Who",
    "disambiguated_name" : "Zine El Abidine Ben Ali",
    "doccount" : 152,
    "index" : "zine el abidine ben ali/person",
    "totalfreq" : 353,
    "type" : "Person"
}
Harvested Entities
Harvested Associations
• feature.association
{
    "_id" : ObjectId("4f9189d48baf188282a1ca24"),
    "assoc_type" : "Fact",
    "communityId" : ObjectId("4f8f138103644ee8003bf518"),
    "db_sync_doccount" : NumberLong(70),
    "db_sync_time" : "1338491609281",
    "doccount" : NumberLong(73),
    "entity1" : [
           "zine el abidine ben ali",
           "zine el abidine ben ali/person"
    ],
    "entity1_index" : "zine el abidine ben ali/person",
    "entity2" : ["president”,"president/position”],
    "entity2_index" : "president/position",
    "index" : "5e3fff27ddb78d6873ccfc77cf05c52f",
    "verb" : ["career”,"current”,"past”],
    "verb_category" : "career"
}
Harvested Associations
Geolocation of Entities/Events
• feature.geo
{
    "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"),
    "search_field" : "cairo",
    "country" : "Egypt",
    "country_code" : "EG",
    "city" : "cairo",
    "region" : "Al Qahirah",
    "region_code" : "EG11",
    "population" : 7734602,
    "latitude" : "30.05",
    "longitude" : "31.25",
    "geoindex" : {
           "lat" : 30.05,
           "lon" : 31.25                            Note: MongoDB 2d Index
    }
}
Geolocation of Entities/Events
Who, What, Where and When
Why MongoDB? – Reason #1
Document-Oriented Storage
• MongoDB’s document-oriented storage
  (i.e. schema-less) is perfectly suited to the
  data design requirements of a system that
  needs to ingest a wide variety of
  structured and unstructured document
  formats and normalize them into one
  unified, semi-structured format
Why MongoDB? – Reason #2
JSON
• The standard language of open document
  analysis
  – JSON is a common interchange format supported
    by tools like elasticsearch and SaaS NLP engines
  – BSON (Binary JSON) is MongoDB’s native data
    format
  – Infinit.e ingests and exports JSON
    natively via the REST based API
    Note: Infinit.e uses Google’s GSON JAVA library to convert
    JSON to POJOs and back




                                               This is the JSON logo
Why MongoDB? – Reason #3
MongoDB Is Web Scale*




  *Shards are the secret ingredients in the web scale sauce. They just work.
Why MongoDB? – Reason #3
Scalability
• Seriously, MongoDB Scales
  – Harvesting and enriching documents requires
    a lot of disk space
  – MongoDB scales to arbitrary sizes in both
    read/write dimensions
  – Sophisticated sharding keys provide
    powerful/flexible balancing
   BUT building an initial cluster can be complex
    and managing cluster changes is “fiddly”
Why MongoDB? – Reason #4
Integration with Apache Hadoop
•   Hadoop is rapidly becoming the de-facto standard for
    data analytics
     – Open Source, very customizable
     – Proven scalability
     – Java libraries
•   The MongoDB Hadoop Adaptor allows Hadoop to read
    from and write to MongoDB instead of HDFS

                  +                 =
Tweeting about MongoDC
• Source:
  http://search.twitter.com/search.rss?q=mongodc
   – Who’s Tweeting?
   – What are they Tweeting?
   – What does basic document analysis of these
     Tweets tell us?
Who’s Tweeting about MongoDC?
How are Tweeter’s Connected?
What are they Tweeting About?
Sentiment?
Twitter has its Limits…
Thank You!

             Craig Vitter



         www.ikanow.com
        cvitter@ikanow.com

Más contenido relacionado

Similar a How IKANOW uses MongoDB to help organizations solve really big problems

ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
keepingfoundthingsfound
 

Similar a How IKANOW uses MongoDB to help organizations solve really big problems (20)

What do we want computers to do for us?
What do we want computers to do for us? What do we want computers to do for us?
What do we want computers to do for us?
 
Webinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDBWebinar: Building Your First Application with MongoDB
Webinar: Building Your First Application with MongoDB
 
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
DataCite – Bridging the gap and helping to find, access and reuse data – Herb...
 
Data Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch SeminarData Curation @ SpazioDati - NEXA Lunch Seminar
Data Curation @ SpazioDati - NEXA Lunch Seminar
 
Global Media Monitor - Marko Grobelnik
Global Media Monitor - Marko GrobelnikGlobal Media Monitor - Marko Grobelnik
Global Media Monitor - Marko Grobelnik
 
Navigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePointNavigating the Mess of a Shared drive Migration to SharePoint
Navigating the Mess of a Shared drive Migration to SharePoint
 
ElasticSearch - index server used as a document database
ElasticSearch - index server used as a document databaseElasticSearch - index server used as a document database
ElasticSearch - index server used as a document database
 
Semantic Web in Action
Semantic Web in ActionSemantic Web in Action
Semantic Web in Action
 
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti... NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
NISO/DCMI May 22 Webinar: Semantic Mashups Across Large, Heterogeneous Insti...
 
Dev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDBDev Jumpstart: Build Your First App with MongoDB
Dev Jumpstart: Build Your First App with MongoDB
 
Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"Krnarich "Assessing Contribution & Value"
Krnarich "Assessing Contribution & Value"
 
Digital Content Management
Digital Content ManagementDigital Content Management
Digital Content Management
 
Wiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School PkuWiser Pku Lecture@Life Science School Pku
Wiser Pku Lecture@Life Science School Pku
 
Wiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School PkuWiserpku Lecture@Life Science School Pku
Wiserpku Lecture@Life Science School Pku
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
ItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information IntegrationItemMirror, XML & The Promise of Information Integration
ItemMirror, XML & The Promise of Information Integration
 
LOD2 Webinar: SIREn
LOD2 Webinar: SIREnLOD2 Webinar: SIREn
LOD2 Webinar: SIREn
 
Information Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ DeloitteInformation Extraction from Text, presented @ Deloitte
Information Extraction from Text, presented @ Deloitte
 
Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries Geo-annotations in Semantic Digital Libraries
Geo-annotations in Semantic Digital Libraries
 
Publishing Linked Data using Schema.org
Publishing Linked Data using Schema.orgPublishing Linked Data using Schema.org
Publishing Linked Data using Schema.org
 

Más de ikanow

Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
ikanow
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
ikanow
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
ikanow
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
ikanow
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
ikanow
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
ikanow
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
ikanow
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
ikanow
 

Más de ikanow (11)

Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big DataAliasing Use Cases - How to Use IKANOW to Crunch Big Data
Aliasing Use Cases - How to Use IKANOW to Crunch Big Data
 
Mongo db washington dc 2014
Mongo db washington dc 2014Mongo db washington dc 2014
Mongo db washington dc 2014
 
Open Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media AnalysisOpen Analytics: Building Effective Frameworks for Social Media Analysis
Open Analytics: Building Effective Frameworks for Social Media Analysis
 
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud ComputingDr. Michael Valivullah, NASS/USDA - Cloud Computing
Dr. Michael Valivullah, NASS/USDA - Cloud Computing
 
Cloud computing with AWS
Cloud computing with AWS Cloud computing with AWS
Cloud computing with AWS
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Open Analytics DC April 2012 Meetup
Open Analytics DC April 2012 MeetupOpen Analytics DC April 2012 Meetup
Open Analytics DC April 2012 Meetup
 
Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?Hadoop MapReduce - I'm Sold, Now What?
Hadoop MapReduce - I'm Sold, Now What?
 
Agile intelligence through Open Analytics
Agile intelligence through Open AnalyticsAgile intelligence through Open Analytics
Agile intelligence through Open Analytics
 
Social Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big DataSocial Intelligence: Realizing Business Value in Big Data
Social Intelligence: Realizing Business Value in Big Data
 
Value Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs AnalysisValue Mining: How Entity Extraction Informs Analysis
Value Mining: How Entity Extraction Informs Analysis
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

How IKANOW uses MongoDB to help organizations solve really big problems

  • 1. The Open Source document analysis platform Or, how IKANOW uses to help organizations solve really big problems
  • 2. Agenda • What is Document Analysis? • The Infinit.e Solution – Infinit.e’s Architecture – Why and How we use MongoDB • Analyzing #MongoDC • Questions
  • 3. This is what Big Data Looks Like Shamelessly stolen from: http://techbuddha.wordpress.com/2011/09/04/big-data-are-you-creating-a-garbage-dump-or-mountains-of-gold/
  • 4. What is Document Analysis? "Document Analysis refers to computer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.” Source: http://www.text-tech.com/docanalysis/definition.html
  • 5. Document Analysis • Common document source formats: RSS JSON XML HTML PDF TXT RTF Word PPT Multimedia Files RDBMS Records ETC.
  • 6. Document Analysis • The goal is to: – Extract Entities (people, places, things) – Create Associations between entities (in the form of noun-verb-noun), e.g.: • John Doe lives in Washington, D.C • John Doe is married to Jane Doe • John Doe is a Virgo • John Doe traveled to Mexico on July 6th, 2011 • And…
  • 7. Document Analysis • Turn Who, What, When and Where into a unified data structure that supports data analytics and visualization. Who When people, organizations, past, present, future facilities, company dates What Where events, summaries, city, state, country, facts, themes coordinate
  • 8. The Infinit.e Solution • Infinit.e is an Open Source document discovery and analysis platform that has these very cool Open Source tools lurking under the hood. github.com/ikanow/Infinit.e
  • 9. The Infinit.e Solution Infinit.e is a scalable framework for Visualizing Analyzing Retrieving Enriching Storing Collecting Structured and Unstructured Documents
  • 11. Document Collection • Infinit.e harvests documents from: – URLs – File Shares – Databases
  • 12. Sample RSS Document <rss version="2.0"> <channel> … <item> <title>Mediterranean conference seeks to flourish tourism in Egypt, Tunisia… </title> <link>http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish- tourism-in- egypt-tunisia-report-by-egyptlastminute-com-13613.html</link> <description>Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia the most … </description> <dc:publisher>Latest Press Releases | Press Release Bureau</dc:publisher> <dc:creator>unknown</dc:creator> <dc:date>Sat, 21 Apr 2012 00:00:00 GMT</dc:date> </item> … </channel> </rss>
  • 15. Document DBs and Collections
  • 16. Document Metadata • doc_metadata.metadata { "_id" : ObjectId("4f93638e0cf212156d0559d2"), "title" : "Mediterranean conference seeks to flourish tourism in Egypt, Tunisia ...", "url" : "http://www.pressreleasebureau.com/mediterranean-conference-seeks-to-flourish-tourism- in-egypt-tunisia-report-by-egyptlastminute-com-13613.html" "description" : "Report by egyptlastminute.com CAIRO: On Monday, the countries of the Mediterranean opened a conference seeking to enhance the future of tourism in the region. The conference focuses on the countries of Egypt and Tunisia; the most ...", "created" : ISODate("2012-04-22T01:49:02Z"), “metadata” : {…}, "associations" : […], "entities" : […], ... }
  • 17. Harvested Document Metadata • doc_metadata.metadata.metadata "metadata" : { "location" : [ { Note: It is okay to laugh at this "region" : "South Asia", "citystateprovince" : { "stateprovince" : "Rolpa”, "city" : "Newang" }, "country" : "Nepal" } ], "icn" : [ "200573487" ], "incidentdate" : [ "07/25/2005" ], "organization" : [ "Communist Party of Nepal (Maoist)/United People's Front” ], ... },
  • 18. Document Enrichment • Infinit.e supports the extraction of entities and creation of associations using a combination of built in enrichment libraries and 3rd party NLP APIs including:
  • 19. Harvested Entities • feature.entity { "_id" : ObjectId("4f9189d48baf188282a1c9ef"), "alias" : [ "Zine el Abidine Ben Ali", "Zine El Abidine Ben Ali", "Zine el Abidine ben Ali" ], "batch_resync" : true, "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(143), "db_sync_time" : "1338751174988", "dimension" : "Who", "disambiguated_name" : "Zine El Abidine Ben Ali", "doccount" : 152, "index" : "zine el abidine ben ali/person", "totalfreq" : 353, "type" : "Person" }
  • 21. Harvested Associations • feature.association { "_id" : ObjectId("4f9189d48baf188282a1ca24"), "assoc_type" : "Fact", "communityId" : ObjectId("4f8f138103644ee8003bf518"), "db_sync_doccount" : NumberLong(70), "db_sync_time" : "1338491609281", "doccount" : NumberLong(73), "entity1" : [ "zine el abidine ben ali", "zine el abidine ben ali/person" ], "entity1_index" : "zine el abidine ben ali/person", "entity2" : ["president”,"president/position”], "entity2_index" : "president/position", "index" : "5e3fff27ddb78d6873ccfc77cf05c52f", "verb" : ["career”,"current”,"past”], "verb_category" : "career" }
  • 23. Geolocation of Entities/Events • feature.geo { "_id" : ObjectId("4d8bb5efbe07bb4f7036c82e"), "search_field" : "cairo", "country" : "Egypt", "country_code" : "EG", "city" : "cairo", "region" : "Al Qahirah", "region_code" : "EG11", "population" : 7734602, "latitude" : "30.05", "longitude" : "31.25", "geoindex" : { "lat" : 30.05, "lon" : 31.25 Note: MongoDB 2d Index } }
  • 25. Who, What, Where and When
  • 26. Why MongoDB? – Reason #1 Document-Oriented Storage • MongoDB’s document-oriented storage (i.e. schema-less) is perfectly suited to the data design requirements of a system that needs to ingest a wide variety of structured and unstructured document formats and normalize them into one unified, semi-structured format
  • 27. Why MongoDB? – Reason #2 JSON • The standard language of open document analysis – JSON is a common interchange format supported by tools like elasticsearch and SaaS NLP engines – BSON (Binary JSON) is MongoDB’s native data format – Infinit.e ingests and exports JSON natively via the REST based API Note: Infinit.e uses Google’s GSON JAVA library to convert JSON to POJOs and back This is the JSON logo
  • 28. Why MongoDB? – Reason #3 MongoDB Is Web Scale* *Shards are the secret ingredients in the web scale sauce. They just work.
  • 29. Why MongoDB? – Reason #3 Scalability • Seriously, MongoDB Scales – Harvesting and enriching documents requires a lot of disk space – MongoDB scales to arbitrary sizes in both read/write dimensions – Sophisticated sharding keys provide powerful/flexible balancing  BUT building an initial cluster can be complex and managing cluster changes is “fiddly”
  • 30. Why MongoDB? – Reason #4 Integration with Apache Hadoop • Hadoop is rapidly becoming the de-facto standard for data analytics – Open Source, very customizable – Proven scalability – Java libraries • The MongoDB Hadoop Adaptor allows Hadoop to read from and write to MongoDB instead of HDFS + =
  • 31. Tweeting about MongoDC • Source: http://search.twitter.com/search.rss?q=mongodc – Who’s Tweeting? – What are they Tweeting? – What does basic document analysis of these Tweets tell us?
  • 33. How are Tweeter’s Connected?
  • 34. What are they Tweeting About?
  • 36. Twitter has its Limits…
  • 37. Thank You! Craig Vitter www.ikanow.com cvitter@ikanow.com