SlideShare una empresa de Scribd logo
1 de 52
Descargar para leer sin conexión
Turning NoSQL data into Graph

Playing with Apache Giraph and Apache Gora
Team
Renato Marroquín
!
• PhD student:
• Interested in:
Information retrieval.
Distributed and scalable data management.
• Apache Gora:
PPMC Member
Committer.
• rmarroquin [at] apache [dot] org
Claudio Martella
• PhD student: LSDS @VU University Amsterdam.
• Interested in
Complex Networks.
Distributed and scalable infrastructures.
• Apache Girapher:
PPMC Member
Committer.
• claudio [at] apache [dot] org
Lewis McGibbney
• Scottish expat fae Glasgow
• Post Doc @Stanford University: Engineering Informatics
• Quantity Surveyor/Cost Consultant by 

profession
• Cycling mad
• Keen OSS enthusiast @TheASF 

and beyond
• lewismc [at] apacher [dot] org
Apache Gora
What is Apache Gora?
● Data Persistence : Persisting objects to Column stores, key-value
stores, SQL databases and to flat files in local file system of Hadoop
HDFS.
● Data Access : An easy to use Java-friendly common API for accessing
the data regardless of its location.
● Indexing : Persisting objects to Lucene and Solr indexes, accessing/
querying the data with Gora API.
● Analysis : Accesing the data and making analysis through adapters for
Apache Pig, Apache Hive and Cascading
● MapReduce support : Out-of-the-box and extensive MapReduce
(Apache Hadoop) support for data in the data store.
What is Apache Gora?
● Provides an in-memory data model and persistence for big data.
● Gora supports:
How does Gora work?
!
1.Define your schema using Apache AVRO.
2.Compile your schemas using Gora's Compiler.
3.Create a mapping between logical and physical layout.
4.Update gora.properties file to set back-end properties.



Rock the NoSQL world!!!
How does Gora work?
1.Define your schema using Apache AVRO.
How does Gora work?
2.Compile your schemas using Gora's Compiler.
	 java -jar gora-core-XYZ-.jar

" " o.a.gora.compiler.GoraCompiler.class 

" " " employee.avsc

" " " gora-app/src/main/java/
How does Gora work?
2.Compile your schemas using Gora's Compiler.
How does Gora work?
3.Create a mapping between logical and physical layout.
How does Gora work?
4.Update gora.properties file to set back-end properties.
How does Gora work?
Rock the NoSQL world!
Apache Giraph
MapReduce and Graphs
• Plain MapReduce is not well suited for graph
algorithms because:
• Graph algorithms are iterative.
• Not intuitive in MapReduce.
• Unnecessarily slow
• Each iteration is a single MapReduce job with too much
overhead
• Separately scheduled
• The graph structure is read from disk
• The intermediate results are read from disks
• Hard to implement
Google's Pregel
• Introduced on 2010
• Based on Valiant's BSP
• “Think like a vertex” that can send messages to any vertex in the
graph using the bulk synchronous parallel programming model.
• Computation complete when all components complete.
• Batch-oriented processing
• Computation happens in-memory
• Master/slave architecture
Bulk synchronous parallel
Time
Processors
Barrier
Computation + 

Communication
Superstep
Open source implementations
• There are some such as:
• Apache Giraph
• Apache Hama
• GoldenOrb
• Signal/Collect
Apache Giraph
• Incubated since summer 2011
• Written in Java
• Implements Pregel's API
• Runs on existing MapReduce infrastructure
• Active community from Yahoo!, Facebook, LinkedIn, Twitter, and
more.
• It's a single Map-only job
• It runs on Hadoop in-memory.
• Fault tolerant
• Zookeeper for state, No SPOF
During execution time
Setup
● Load graph
● Assign vertices to workers
● Validate workers' health
Teardown
● Write results back
● Write aggregators back
Computer
● Assign messages to workers
● Iterate on active vertices
● Call vertices compute()
Synchronize
● Send messages to workers
● Compute aggregators
● Checkpoint
Giraph's components
• Master
• Application coordinator
• One active master at a time
• Assigns partition owners to workers prior to each superstep
• Synchronizes supersteps
• Worker – Computation & messaging
• Loads the graph from input splits
• Performs computation/messaging of its assigned partitions
• Zookeeper
• Maintains global application state
What is needed then?
• Your algorithm in the Pregel model.
• A VertexInputFormat to read your graph.

e.g. <vertex><neighbor1><neighbor2>
• A VertexOutputFormat to write back the results.

e.g. <vertex> <pageRank>
• You could define:
• A Combiner (for reducing number of messages sent/received)
• An Aggregator (for enabling global computation)
Running a Giraph job
• It is just like running Hadoop
!
$HADOOP_HOME/bin/hadoop jar
giraph-examples-1.1.0-XXX-jar-with-dependencies.jar
o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation
-vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hduser/input/tiny_graph.txt
-vof o.a.g.io.formats.IdWithValueTextOutputFormat
-op /user/hduser/output/shortestpaths
-w 1
Apache Giraph + Apache Gora
The project idea
• Integrating Apache Gora with other cool projects.
• Provide access to different data stores out-of-the-
box for Apache Giraph.
• Give users more flexibility when deciding how to run graph
algorithms.
• Make the Hadoop Env bigger.
• Apply to for the Google Summer of Code Project.
The big picture
Integration hooks
• Vertices
Integration hooks
• Vertices
Integration hooks
• Edges
Integration hooks
• Edges
Integration hooks
• Key factory
Parameters offered
Label Description
giraph.gora.datastore.class Gora DataStore class to access to data from - required.
!giraph.gora.key.class Gora Key class to query the datastore - required.
giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required.
giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required.
giraph.gora.output.datastore.class Gora DataStore class to write data to - required.
giraph.gora.output.key.class Gora Key class to write to datastore - required.
giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required.
giraph.gora.start.key Gora start key to query the datastore.
giraph.gora.end.key Gora end key to query the datastore.
Rocks in the way
• Dependency issues.
• Supported versions by each project.
• Maven war for handling cyclic dependencies.
• Hadoop issues.
• Not all data stores support MapReduce out of the box.
• Finding what it is necessary to be in the classpath.
• Providing an API between both projects that is:
• Flexible.
• Simple.
• Pluggable.
So now what?
1.Create your data beans with Gora.
So now what?
2. Compile them.
java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
So now what?
3. Get your Gora files set up for passing them to Giraph.
	 Gora.properties
	 Gora-mapping-{datastore}.xml.
So now what?
4. Get your hooks in place.
	 GVertexInputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 KeyFactory
So now what?
5. Run Giraph!
	 hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner "
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml"
-Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization"
-Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore"
-Dgiraph.gora.key.class=java.lang.String"
-Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge"
-Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10"
-Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory"
-Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore "
-Dgiraph.gora.output.key.class=java.lang.String "
-Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult "
-libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR"
org.apache.giraph.examples.SimpleShortestPathsComputation "
-eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat "
-eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat "
-w 1"
Future work
More complex schemas
Adding more data stores
Send us an email on the mailing lists
New serialization formats
• Different serialization formats beside Apache Avro.
!
!
!
!
!
!
• And others that could be interesting for handling different use
cases.
Thanks!
Q&A
References
• http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-processing-in-the-cloud/
• http://de.slideshare.net/sscdotopen/large-scale
• http://www.slideshare.net/Hadoop_Summit/processing-edges-on-apache-giraph
Bulk synchronous parallel model
• Computation consists of a series of “supersteps”
• Supersteps are an atomic unit of computation where operations can
happen in parallel
• During a superstep, components are assigned to tasks and receive
unordered messages from previous supersteps.
• Point-to-point messages
• Sent during a superstep from one component to another and then
delivered in the following supersteps.
• Computation completes when all components complete

Más contenido relacionado

La actualidad más candente

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

La actualidad más candente (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Similar a Giraph+Gora in ApacheCon14

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 

Similar a Giraph+Gora in ApacheCon14 (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 

Último

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Último (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Giraph+Gora in ApacheCon14

  • 1. Turning NoSQL data into Graph
 Playing with Apache Giraph and Apache Gora
  • 3. Renato Marroquín ! • PhD student: • Interested in: Information retrieval. Distributed and scalable data management. • Apache Gora: PPMC Member Committer. • rmarroquin [at] apache [dot] org
  • 4. Claudio Martella • PhD student: LSDS @VU University Amsterdam. • Interested in Complex Networks. Distributed and scalable infrastructures. • Apache Girapher: PPMC Member Committer. • claudio [at] apache [dot] org
  • 5. Lewis McGibbney • Scottish expat fae Glasgow • Post Doc @Stanford University: Engineering Informatics • Quantity Surveyor/Cost Consultant by 
 profession • Cycling mad • Keen OSS enthusiast @TheASF 
 and beyond • lewismc [at] apacher [dot] org
  • 7. What is Apache Gora? ● Data Persistence : Persisting objects to Column stores, key-value stores, SQL databases and to flat files in local file system of Hadoop HDFS. ● Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location. ● Indexing : Persisting objects to Lucene and Solr indexes, accessing/ querying the data with Gora API. ● Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading ● MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
  • 8. What is Apache Gora? ● Provides an in-memory data model and persistence for big data. ● Gora supports:
  • 9. How does Gora work? ! 1.Define your schema using Apache AVRO. 2.Compile your schemas using Gora's Compiler. 3.Create a mapping between logical and physical layout. 4.Update gora.properties file to set back-end properties.
 
 Rock the NoSQL world!!!
  • 10. How does Gora work? 1.Define your schema using Apache AVRO.
  • 11. How does Gora work? 2.Compile your schemas using Gora's Compiler. java -jar gora-core-XYZ-.jar
 " " o.a.gora.compiler.GoraCompiler.class 
 " " " employee.avsc
 " " " gora-app/src/main/java/
  • 12. How does Gora work? 2.Compile your schemas using Gora's Compiler.
  • 13. How does Gora work? 3.Create a mapping between logical and physical layout.
  • 14. How does Gora work? 4.Update gora.properties file to set back-end properties.
  • 15. How does Gora work? Rock the NoSQL world!
  • 17. MapReduce and Graphs • Plain MapReduce is not well suited for graph algorithms because: • Graph algorithms are iterative. • Not intuitive in MapReduce. • Unnecessarily slow • Each iteration is a single MapReduce job with too much overhead • Separately scheduled • The graph structure is read from disk • The intermediate results are read from disks • Hard to implement
  • 18. Google's Pregel • Introduced on 2010 • Based on Valiant's BSP • “Think like a vertex” that can send messages to any vertex in the graph using the bulk synchronous parallel programming model. • Computation complete when all components complete. • Batch-oriented processing • Computation happens in-memory • Master/slave architecture
  • 20. Open source implementations • There are some such as: • Apache Giraph • Apache Hama • GoldenOrb • Signal/Collect
  • 21. Apache Giraph • Incubated since summer 2011 • Written in Java • Implements Pregel's API • Runs on existing MapReduce infrastructure • Active community from Yahoo!, Facebook, LinkedIn, Twitter, and more. • It's a single Map-only job • It runs on Hadoop in-memory. • Fault tolerant • Zookeeper for state, No SPOF
  • 22. During execution time Setup ● Load graph ● Assign vertices to workers ● Validate workers' health Teardown ● Write results back ● Write aggregators back Computer ● Assign messages to workers ● Iterate on active vertices ● Call vertices compute() Synchronize ● Send messages to workers ● Compute aggregators ● Checkpoint
  • 23. Giraph's components • Master • Application coordinator • One active master at a time • Assigns partition owners to workers prior to each superstep • Synchronizes supersteps • Worker – Computation & messaging • Loads the graph from input splits • Performs computation/messaging of its assigned partitions • Zookeeper • Maintains global application state
  • 24. What is needed then? • Your algorithm in the Pregel model. • A VertexInputFormat to read your graph.
 e.g. <vertex><neighbor1><neighbor2> • A VertexOutputFormat to write back the results.
 e.g. <vertex> <pageRank> • You could define: • A Combiner (for reducing number of messages sent/received) • An Aggregator (for enabling global computation)
  • 25. Running a Giraph job • It is just like running Hadoop ! $HADOOP_HOME/bin/hadoop jar giraph-examples-1.1.0-XXX-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation -vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof o.a.g.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
  • 26. Apache Giraph + Apache Gora
  • 27. The project idea • Integrating Apache Gora with other cool projects. • Provide access to different data stores out-of-the- box for Apache Giraph. • Give users more flexibility when deciding how to run graph algorithms. • Make the Hadoop Env bigger. • Apply to for the Google Summer of Code Project.
  • 34. Parameters offered Label Description giraph.gora.datastore.class Gora DataStore class to access to data from - required. !giraph.gora.key.class Gora Key class to query the datastore - required. giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required. giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required. giraph.gora.output.datastore.class Gora DataStore class to write data to - required. giraph.gora.output.key.class Gora Key class to write to datastore - required. giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required. giraph.gora.start.key Gora start key to query the datastore. giraph.gora.end.key Gora end key to query the datastore.
  • 35. Rocks in the way • Dependency issues. • Supported versions by each project. • Maven war for handling cyclic dependencies. • Hadoop issues. • Not all data stores support MapReduce out of the box. • Finding what it is necessary to be in the classpath. • Providing an API between both projects that is: • Flexible. • Simple. • Pluggable.
  • 36. So now what? 1.Create your data beans with Gora.
  • 37. So now what? 2. Compile them. java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
  • 38. So now what? 3. Get your Gora files set up for passing them to Giraph. Gora.properties Gora-mapping-{datastore}.xml.
  • 39. So now what? 4. Get your hooks in place. GVertexInputFormat
  • 40.
  • 41. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 42. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 43. So now what? 4. Get your hooks in place. KeyFactory
  • 44. So now what? 5. Run Giraph! hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner " -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml" -Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization" -Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore" -Dgiraph.gora.key.class=java.lang.String" -Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge" -Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10" -Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory" -Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore " -Dgiraph.gora.output.key.class=java.lang.String " -Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult " -libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR" org.apache.giraph.examples.SimpleShortestPathsComputation " -eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat " -eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat " -w 1"
  • 47. Adding more data stores Send us an email on the mailing lists
  • 48. New serialization formats • Different serialization formats beside Apache Avro. ! ! ! ! ! ! • And others that could be interesting for handling different use cases.
  • 50. Q&A
  • 52. Bulk synchronous parallel model • Computation consists of a series of “supersteps” • Supersteps are an atomic unit of computation where operations can happen in parallel • During a superstep, components are assigned to tasks and receive unordered messages from previous supersteps. • Point-to-point messages • Sent during a superstep from one component to another and then delivered in the following supersteps. • Computation completes when all components complete