Consuming and producing complex structures in Hadoop MapReduce™ with on-line natural language processing (NLP) enhanced 1.5 Billion Word Wikipedia Text Corpus Example
What is Advanced Excel and what are some best practices for designing and cre...
Rich Data Graphs for MapReduce
1. Rich Data Graphs for MapReduce
Consuming and producing complex HBase structures in MapReduce™ with
on-line NLP enhanced 1.5 Billion Word Wikipedia Text Corpus Example
Scott Cinnamond – TerraMeta Software Inc.
http://cloudgraph.org
2. Contents
• CloudGraph Overview
• CloudGraph MapReduce Overview
• Example Wikipedia NLP Corpus
• Conclusions / Lessons
• On Line Resources
• Status / Legal
3. CloudGraph Overview
• Vendor Agnostic Big Data Services
• Standards Based (XPath, SDO, UML)
• Domain Driven (your domain model is the API insulating
business logic from vendor specifics)
Hadoop / DFS / MabReduce
HBase Cassandra
Oracle
Accumulo
CloudGraph RDB
Service
MySql
CloudGraph
HBase
Service
CloudGraph
Cassandra
Service*
CloudGraph
Accumulo
Service*
4. CloudGraph MapReduce Overview
• InputFormat Extensions
– GraphInputFormat – HBase scan(s), graph
assembly and “recognition” from input Query.
(i.e. detect properties deep within table rows)
– GraphXmlInputFormat – Heterogeneous,
arbitrary graphs unmarshalled from SDO XML
• OutputFormat Extensions
– GraphXmlOutputFormat – Heterogeneous
data graphs marshalled to SDO XML
5. • Mapper Extensions
– GraphMapper – consume fully assembled data
graphs as (GraphWritable) – traverse and
produce aggregates and/or output/persist new
XML or data graphs
• Reducer Extensions
– GraphReducer – consume aggregates
output/persist new XML or data graphs
• Counters
– Graph Node Counts, Assembly Times, ..More
CloudGraph MapReduce Overview
6. Wikipedia Demo (Wikicorpus)
• See http://wikicorpus.cloudgraph.org
• 5 Million Wiki Pages
• 1.8 Million Wiki Categories
• 1.5 Billion Words, 100 Million NLP Parsed
Sentences (parsing in progress)
• No Third Party Search Product
• No Specialized Architecture, only HBase,
Hadoop, MapReduce
7. Wikicorpus MapReduce Jobs
• Parse Wiki XML into plain text, generating
word dependency trees using Stanford
NLP (avg. 10K NLP nodes per wiki page*)
• Reduce dependency trees to typed
governor/dependent aggregates with POS
and other data
• Reduce word frequencies, other counts to
indexes
*Stanford NLP is extremely CPU intensive. We are incrementally parsing and re-
indexing on our demo cluster and exploring other hardware options
8. Wikicorpus MapReduce Jobs
WIKI
XML
Dumps
Hadoop DFS / HBase
CloudGraph HBase / MapReduce
Wiki Page
Mapper
NLP Parser
Dependency
Aggregate
Mapper/Reducer
Page
NLP
Graphs
Word
Frequency
Mapper/Red
ucer
Sentence
Mapper
Dep.
Aggreg
ate
Graphs
9. Conclusions
• Stanford NLP, very accurate yet extremely
CPU and memory intensive
• Parse one sentence at a time (whole page
takes too long, sentence enough context)
• Don’t NLP parse from HBase reader
– HBase scanner cannot be open this long
between next() record/Wiki page
– Read Wiki XML from Hadoop FS and NLP
parse from there
10. On Line Resources
• Download the complete CloudGraph Wiki
example:
https://github.com/cloudgraph/wikicorpus
• Run the example online:
http://wikicorpus.cloudgraph.org
• Product details, contact information:
http://cloudgraph.org
• Beta Source Repo:
https://github.com/terrameta/cloudgraph
• Production Source Repo (under construction):
https://github.com/cloudgraph
11. Status / Legal
• Project Status
– CloudGraph® is currently under private beta testing
• Licensing
– CloudGraph® 0.5.9 Community Edition (CE) is open source licensed
under version 2 of the GNU General Public License
• Trademarks
– Apache Hadoop™ is a trademark of Apache Software Foundation
– Apache HBase™ is a trademark of Apache Software Foundation
– CloudGraph® is a trademark of TerraMeta Software LLC, TerraMeta
Software Inc.