Rich Data Graphs for MapReduce

•Descargar como PPTX, PDF•

0 recomendaciones•805 vistas

Consuming and producing complex structures in Hadoop MapReduce™ with on-line natural language processing (NLP) enhanced 1.5 Billion Word Wikipedia Text Corpus Example

Software Tecnología Educación

Rich Data Graphs for MapReduce
Consuming and producing complex HBase structures in MapReduce™ with
on-line NLP enhanced 1.5 Billion Word Wikipedia Text Corpus Example
Scott Cinnamond – TerraMeta Software Inc.
http://cloudgraph.org

Contents
• CloudGraph Overview
• CloudGraph MapReduce Overview
• Example Wikipedia NLP Corpus
• Conclusions / Lessons
• On Line Resources
• Status / Legal

CloudGraph Overview
• Vendor Agnostic Big Data Services
• Standards Based (XPath, SDO, UML)
• Domain Driven (your domain model is the API insulating
business logic from vendor specifics)
Hadoop / DFS / MabReduce
HBase Cassandra
Oracle
Accumulo
CloudGraph RDB
Service
MySql
CloudGraph
HBase
Service
CloudGraph
Cassandra
Service*
CloudGraph
Accumulo
Service*

CloudGraph MapReduce Overview
• InputFormat Extensions
– GraphInputFormat – HBase scan(s), graph
assembly and “recognition” from input Query.
(i.e. detect properties deep within table rows)
– GraphXmlInputFormat – Heterogeneous,
arbitrary graphs unmarshalled from SDO XML
• OutputFormat Extensions
– GraphXmlOutputFormat – Heterogeneous
data graphs marshalled to SDO XML

• Mapper Extensions
– GraphMapper – consume fully assembled data
graphs as (GraphWritable) – traverse and
produce aggregates and/or output/persist new
XML or data graphs
• Reducer Extensions
– GraphReducer – consume aggregates
output/persist new XML or data graphs
• Counters
– Graph Node Counts, Assembly Times, ..More
CloudGraph MapReduce Overview

Wikipedia Demo (Wikicorpus)
• See http://wikicorpus.cloudgraph.org
• 5 Million Wiki Pages
• 1.8 Million Wiki Categories
• 1.5 Billion Words, 100 Million NLP Parsed
Sentences (parsing in progress)
• No Third Party Search Product
• No Specialized Architecture, only HBase,
Hadoop, MapReduce

Wikicorpus MapReduce Jobs
• Parse Wiki XML into plain text, generating
word dependency trees using Stanford
NLP (avg. 10K NLP nodes per wiki page*)
• Reduce dependency trees to typed
governor/dependent aggregates with POS
and other data
• Reduce word frequencies, other counts to
indexes
*Stanford NLP is extremely CPU intensive. We are incrementally parsing and re-
indexing on our demo cluster and exploring other hardware options

Wikicorpus MapReduce Jobs
WIKI
XML
Dumps
Hadoop DFS / HBase
CloudGraph HBase / MapReduce
Wiki Page
Mapper
NLP Parser
Dependency
Aggregate
Mapper/Reducer
Page
NLP
Graphs
Word
Frequency
Mapper/Red
ucer
Sentence
Mapper
Dep.
Aggreg
ate
Graphs

Conclusions
• Stanford NLP, very accurate yet extremely
CPU and memory intensive
• Parse one sentence at a time (whole page
takes too long, sentence enough context)
• Don’t NLP parse from HBase reader
– HBase scanner cannot be open this long
between next() record/Wiki page
– Read Wiki XML from Hadoop FS and NLP
parse from there

On Line Resources
• Download the complete CloudGraph Wiki
example:
https://github.com/cloudgraph/wikicorpus
• Run the example online:
http://wikicorpus.cloudgraph.org
• Product details, contact information:
http://cloudgraph.org
• Beta Source Repo:
https://github.com/terrameta/cloudgraph
• Production Source Repo (under construction):
https://github.com/cloudgraph

Status / Legal
• Project Status
– CloudGraph® is currently under private beta testing
• Licensing
– CloudGraph® 0.5.9 Community Edition (CE) is open source licensed
under version 2 of the GNU General Public License
• Trademarks
– Apache Hadoop™ is a trademark of Apache Software Foundation
– Apache HBase™ is a trademark of Apache Software Foundation
– CloudGraph® is a trademark of TerraMeta Software LLC, TerraMeta
Software Inc.

Más contenido relacionado

La actualidad más candente

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph✔ Eric David Benari, PMP

An Introduction to Sparkling Water by Michal MalohlavaSpark Summit

Oracle Migration to Postgres in the CloudEDB

Qubole @ AWS Meetup Bangalore - July 2015Joydeep Sen Sarma

The Holy Grail of Data AnalyticsDan Lynn

Efficiently Building Machine Learning Models for Predictive Maintenance in th...Databricks

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...Cloudera, Inc.

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dan Lynn

Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit

H2O Rains with Databricks Cloud - Parisoma SFSri Ambati

The Meta of Hadoop - COMAD 2012Joydeep Sen Sarma

Improving Python and Spark Performance and Interoperability with Apache Arrow...Databricks

Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma

Microservice-based software architectureArangoDB Database

Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit

ETL Practices for Better or WorseEric Sun

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit

Hadoop data ingestionVinod Nayal

PolyalgebraDataWorks Summit/Hadoop Summit

Apache spark on Hadoop Yarn Resource Managerharidasnss

La actualidad más candente (20)

Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph

An Introduction to Sparkling Water by Michal Malohlava

Oracle Migration to Postgres in the Cloud

Qubole @ AWS Meetup Bangalore - July 2015

The Holy Grail of Data Analytics

Efficiently Building Machine Learning Models for Predictive Maintenance in th...

Hadoop World 2011: Data Ingestion, Egression, and Preparation for Hadoop - Sa...

Dirty Data? Clean it up! - Rocky Mountain DataCon 2016

Using SparkR to Scale Data Science Applications in Production. Lessons from t...

H2O Rains with Databricks Cloud - Parisoma SF

The Meta of Hadoop - COMAD 2012

Improving Python and Spark Performance and Interoperability with Apache Arrow...

Qubole Overview at the Fifth Elephant Conference

Microservice-based software architecture

Hadoop Infrastructure @Uber Past, Present and Future

ETL Practices for Better or Worse

Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...

Hadoop data ingestion

Polyalgebra

Apache spark on Hadoop Yarn Resource Manager

Similar a Rich Data Graphs for MapReduce

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

APACHE SPARK.pptxDeepaThirumurugan

H2O Rains with Databricks Cloud - NY 02.16.16Sri Ambati

Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA

Map reducecloudtechJakir Hossain

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu

Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi

Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar

Apache Spark on HDinsight TrainingSynergetics Learning and Cloud Consulting

spark_v1_2Frank Schroeter

Apache hadoop, hdfs and map reduce OverviewNisanth Simon

PhillyDB Talk - Beyond Batchboorad

Intro to Apache Spark by CTO of TwingoMapR Technologies

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt

Big Data JourneyTugdual Grall

BDM25 - Spark runtime internalDavid Lauzon

Introduction to Impalamarkgrover

Bds session 13 14Infinity Tech Solutions

Architectural Evolution Starting from HadoopSpagoWorld

Hadoop introductionDong Ngoc

Similar a Rich Data Graphs for MapReduce (20)

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

APACHE SPARK.pptx

H2O Rains with Databricks Cloud - NY 02.16.16

Big_data_analytics_NoSql_Module-4_Session

Map reducecloudtech

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)

Processing Large Data with Apache Spark -- HasGeek

Big Data Hoopla Simplified - TDWI Memphis 2014

Apache Spark on HDinsight Training

spark_v1_2

Apache hadoop, hdfs and map reduce Overview

PhillyDB Talk - Beyond Batch

Intro to Apache Spark by CTO of Twingo

Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...

Big Data Journey

BDM25 - Spark runtime internal

Introduction to Impala

Bds session 13 14

Architectural Evolution Starting from Hadoop

Hadoop introduction

Último

Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis

Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Best Web Development Agency- Idiosys USA.pdfIdiosysTechnologies1

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts

KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

How to submit a standout Adobe Champion ApplicationBradBedford3

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

What is Fashion PLM and Why Do You Need ItWave PLM

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Rich Data Graphs for MapReduce

1. Rich Data Graphs for MapReduce Consuming and producing complex HBase structures in MapReduce™ with on-line NLP enhanced 1.5 Billion Word Wikipedia Text Corpus Example Scott Cinnamond – TerraMeta Software Inc. http://cloudgraph.org

2. Contents • CloudGraph Overview • CloudGraph MapReduce Overview • Example Wikipedia NLP Corpus • Conclusions / Lessons • On Line Resources • Status / Legal

3. CloudGraph Overview • Vendor Agnostic Big Data Services • Standards Based (XPath, SDO, UML) • Domain Driven (your domain model is the API insulating business logic from vendor specifics) Hadoop / DFS / MabReduce HBase Cassandra Oracle Accumulo CloudGraph RDB Service MySql CloudGraph HBase Service CloudGraph Cassandra Service* CloudGraph Accumulo Service*

4. CloudGraph MapReduce Overview • InputFormat Extensions – GraphInputFormat – HBase scan(s), graph assembly and “recognition” from input Query. (i.e. detect properties deep within table rows) – GraphXmlInputFormat – Heterogeneous, arbitrary graphs unmarshalled from SDO XML • OutputFormat Extensions – GraphXmlOutputFormat – Heterogeneous data graphs marshalled to SDO XML

5. • Mapper Extensions – GraphMapper – consume fully assembled data graphs as (GraphWritable) – traverse and produce aggregates and/or output/persist new XML or data graphs • Reducer Extensions – GraphReducer – consume aggregates output/persist new XML or data graphs • Counters – Graph Node Counts, Assembly Times, ..More CloudGraph MapReduce Overview

6. Wikipedia Demo (Wikicorpus) • See http://wikicorpus.cloudgraph.org • 5 Million Wiki Pages • 1.8 Million Wiki Categories • 1.5 Billion Words, 100 Million NLP Parsed Sentences (parsing in progress) • No Third Party Search Product • No Specialized Architecture, only HBase, Hadoop, MapReduce

7. Wikicorpus MapReduce Jobs • Parse Wiki XML into plain text, generating word dependency trees using Stanford NLP (avg. 10K NLP nodes per wiki page*) • Reduce dependency trees to typed governor/dependent aggregates with POS and other data • Reduce word frequencies, other counts to indexes *Stanford NLP is extremely CPU intensive. We are incrementally parsing and re- indexing on our demo cluster and exploring other hardware options

8. Wikicorpus MapReduce Jobs WIKI XML Dumps Hadoop DFS / HBase CloudGraph HBase / MapReduce Wiki Page Mapper NLP Parser Dependency Aggregate Mapper/Reducer Page NLP Graphs Word Frequency Mapper/Red ucer Sentence Mapper Dep. Aggreg ate Graphs

9. Conclusions • Stanford NLP, very accurate yet extremely CPU and memory intensive • Parse one sentence at a time (whole page takes too long, sentence enough context) • Don’t NLP parse from HBase reader – HBase scanner cannot be open this long between next() record/Wiki page – Read Wiki XML from Hadoop FS and NLP parse from there

10. On Line Resources • Download the complete CloudGraph Wiki example: https://github.com/cloudgraph/wikicorpus • Run the example online: http://wikicorpus.cloudgraph.org • Product details, contact information: http://cloudgraph.org • Beta Source Repo: https://github.com/terrameta/cloudgraph • Production Source Repo (under construction): https://github.com/cloudgraph

11. Status / Legal • Project Status – CloudGraph® is currently under private beta testing • Licensing – CloudGraph® 0.5.9 Community Edition (CE) is open source licensed under version 2 of the GNU General Public License • Trademarks – Apache Hadoop™ is a trademark of Apache Software Foundation – Apache HBase™ is a trademark of Apache Software Foundation – CloudGraph® is a trademark of TerraMeta Software LLC, TerraMeta Software Inc.

Rich Data Graphs for MapReduce

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Rich Data Graphs for MapReduce

Similar a Rich Data Graphs for MapReduce (20)

Último

Último (20)

Rich Data Graphs for MapReduce