SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
Krishna Sankar
https://www.linkedin.com/in/ksankar
June 1, 2016
1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs
2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark-
Practitioners/dp/1484209656
3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache-
spark-graph-processing
4. Spark GarphX in Action - Manning
5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/
6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf
7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf
- http://web.stanford.edu/class/cs246/handouts.html
8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx
10. Zeppelin Setup
- http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark
11. Data
- http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
- http://openflights.org/data.html
12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png
Thanks to the Giants whose work helped to prepare this tutorial
Agenda-Topic Time Data Description
1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB,
Introduction to Spark GraphX
2. Basics are Discussed & A Graph Is
Built
10 Min Simple
Graph
Edge,Vertex, Graph
3. GraphX API Landscape Is discussed 5 Min Explain APIs
4. Graph Structures Are Examined 5 Min Indegree, outdegree et al
5. Community, Affiliation & Strengths
Are Explored
5 Min Connected components, Triangle et al
6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work),
PageRank, Connected Components; APIs to implement
Algorithms viz. aggregateMessages, Pregel
Type Signature - aggregateMessages
Algorithms with aggregateMessages
7. Case Study (AlphaGo Tweets) Is
Conducted
10 Min AlphaGo
Retweet
Data
E2E Pipeline w/real data, Map attributes to properties,
vertices & edges, create graph and run algorithms
8. Questions Are Asked & Answered 10 Min
AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
Graph Applications
§ Many applications from social network to understanding collaboration,diseases, routing
logistics and others
§ Came across 3 interesting applications, as I was preparing the materials !!
1) Project effectiveness by social network analysis on the projects' collaboration structures :
“EC is interested in the impact generated by those projects … analyzing this latent impact
of TEL projects by applying social network analysis on the projects' collaboration structures
… TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects
and strong, sustained organizational collaboration ties.”
2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are
characterized by social cliques in which individuals associate more with members of their
own social clique than with those outside their clique … Individuals involved in weak,
between-clique social interactions are hypothesized to serve as bridges by which an
infection may enter a clique and, hence, may experience higher infection risk … individuals
who engaged in more between-clique associations, that is, those with multiple weak ties,
were more likely to be infected with gastrointestinal helminth parasites ... “
3) Panama papers : The leak presented them with a wealth of information, millions of
documents, but no guide to structure … (ie community detection)
Graph based systems
§ Graphs are everywhere – web pages, social
networks, people’s web browsingbehavior,
logisticsetc.
- Workingw/ graphshas become more
important. And harder due to the scaleand
complexityof algorithms
§ Two kinds– processingandquerying
§ Processing– GraphX, Pregel BSP, GraphLab
§ Querying– AllegroGraph,Titan, Neo4j, RDF
stores.
- SPARQL query language
§ Graph DBs have queries, canprocess
large dataset
§ Graph processingcanrun complex
algorithms
§ For Graph Analyticssystemsboth are
required
- Processon graphx, store in neo4j is a
good alternative.
- E.g. Panamapapers analytics
Graph Processing Frameworks
§ History – Graph based systems were specializedinnature
- Graph partitioningis hard
- Primitives of record based systemswere different than what graphbased systems
needed
§ Rapid innovationin distributed data processingframeworks– MapReduce etc
- Disk based, with limited partitioningabilities
- Still an impedancemismatch
§ Graph-parallelover Data ParallelRDD ! – Best of both worlds ?
ü We will see, later, GrapnX’s PartitionStrategy
Enter Spark§ More powerful partitioning mechanism
§ In-memory system makes iterative processing
easier
§ GraphX – Graph processing built ontop of Spark
§ Graph processing at scale (distributed system)
§ Fast evolving project
§
§
-
-
§
§
§
GraphX …§ Graph processing at Scale - “Graph Parallel System”
§ Has a rich
- Computational Model
- Edges, Vertices, Property Graph
- Bipartite & tripartite graphs (Users-Tags-Web pages)
- Algorithms (PageRank,…)
§ Current focus on computationthan query
§ APIs include :
§ Attribute Transformers,
§ Structure transformers,
§ Connection Mining & Algorithms
§ GraphFrames – interesting development
§ Exercises (lotsof interesting graph data)
- Airline data,co-occurrence &co-citation from
papers
- The AlphaGo Community – ReTweet network
- Wikipedia Page Rank Analysisusing GraphX
Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf
Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf
Scala, python support https://issues.apache.org/jira/browse/SPARK-3789
Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665
LDA https://issues.apache.org/jira/browse/SPARK-1405
Vertex Vertex(VertexId,VD) VD can be any object
Edges Edge(ED) ED can be any object
Graph Graph(VD,ED)
GraphX-Basics
V
StructuralQueries Indegrees, vertices,…
AttributeTransformers mapVertices, …
StructureTransformers, Join Reverse, subgraph
Connection Mining connectedComponents, triangles
Algorithms Aggregate messages, PageRank
InterestingObjects
o EdgeTriplet
o EdgeContext
http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
GraphX-Basics
o Computational Model
o Directed MultiGraph
o Directed - so in-degree & out-degree
o MultiGraph – so multiple parallel edges between nodes including loops !
o Algorithms beware – can cyclic, loops
o Property Graph
o vertexID(64bit long) int
o Need property, cannot ignore it
o Vertex,Edge Parameterized over object types
o Vertex[(VertexId,VD)], Edge[ED]
o Attach user-defined objects to edges and vertices (ED/VD)
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
GraphX-Basics
§ For our hands-on we will run thru 2 Zeppelin notebooks
(https://goo.gl/qCwZiq & https://goo.gl/EKHCFq):
a. First we will work with the following Giraffe graph
o … carefully chosen with interesting betweenness and properties
o … for understanding the APIs
b. Second, we will develop a retweet pipeline
o With the AlphaGo twitter topic
o 2 GB/330K tweets, 200K edges,…
A D
C
E
FG
B
125
5
4.5
4.5
4
1.5
1.5
1
Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University
http://web.stanford.edu/class/cs246/handouts.html
G-N Algorithm for inbetweenness
val vertexList = List(
(1L, Person("Alice", 18)),
(2L, Person("Bernie", 17)),
(3L, Person("Cruz", 15)),
(4L, Person("Donald", 12)),
(5L, Person("Ed", 15)),
(6L, Person("Fran", 10)),
(7L, Person("Genghis",854)))
Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
Fun facts about centrality betweenness : It shows how many paths an
edge is part of, i.e. relevancy. High centrality betweenness is the sign of a
bottleneck, point of single failure – these edges need HA, probably need
alternate paths for re routing, are susceptible to parasite infections and
good candidates for a cut !
Build A Graph
§ There are 4 ways
- GraphLoader.edgeListFile(…)
- From RDDs (shown above)
- fromEdgeTuples <- Id tuples
- fromEdges <- edgeList
Graph API Landscape
Separate algorithm from
Graph Implementation
Implementati
on of analytic
functions
Hint:
Many details are documented in the
object not in the class e.g.
PartitionStrategy, Graph,…
Graphx Structure API
Community-Affiliation-Strengths
§ Applied in many ways
§ For example in Fraud & Security Applications
§ Triangle detection – for spam servers
§ The age of a community isrelated to the density of
triangles
- New community will have few triangles,then triangles
start to form
§ Strong affiliation ie Heavy hitter = sqrt(m) degrees!
- Heavy hitter triangle !
§ Connected Communities– structure
Algorithms
§ Graph-Parallel Computation
- aggregateMessages()Function
- Pregel() (https://issues.apache.org/jira/browse/SPARK-5062)
§ pageRank()
- Influential papers in a citation network
- Influencer in retweet
§ staticPageRank()
- Static no of iterations, dynamic tolerance – see the parameters (tol vs.
numIter)
§ personalizedPageRank()
- Personalized PageRank is a variation on PageRank that gives a rank
relative to a specified “source” vertex in the graph – “People You May
Know”
§ shortestPaths, SVD++
§ LabelPropagation(LPA) as described by Raghavanet al
in their 2007 paper “Near linear time algorithm to detect
community structures in large-scale networks”
- Computationally inexpensive way to Identify
communities
- Convergence not guaranteed
- Might end up with trivial solution i.e. single community
§ SDVPlusPlus takes an RDD of Edges
§ The Global Clustering Coefficient, is better in that it
always returns a number between 0 and 1
- For comparing connectnedness between different
sized communities
http://graphframes.github.io/user-guide.html
§ Versatile Function useful for implementing PageRank et al
- Can be difficult at first, but easier to comprehend if treated
as a combined map-reduce function (it was called
MapReduceTriplets !! With a slightly different signature)
o aggregateMessage[Msg](
o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src!
o redcuce(Msg,Msg) => reduceFun)
If you really want to know what is
underneath the aggregateMessages()…
sendToDst vs sendToSrc
Pregel BSP API
§ Bulk SynchronousParallelMessagePassingAPI
- Developed by Leslie Valiant1
- Synchronizeddistributedsuper steps
- Super Step – local,in-memorycomputations
§ Computations on the vertex – “Think likea Vertex”
§ graphx.lib.LabelPropagationAlgorithmuses Pregelinternally
§ graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while
it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version)
§ Pregel is implementedsuccinctlywith the old mapReduceTriplets API
1http://delivery.acm.org/10.1145/80000/79181/p103-
valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
Graphx Partitioning & Processing
§ Unlike a relationaltable,graph processing is contextual w.r.t
a neighborhood
- Maintain locality,equal size partitioning
§ Edge-cut : The communication& storage overhead of an
edge-cut is directlyproportional to thenumber ofedges
that are cut
§ Vertex-cut : Thecommunication and storage overhead ofa
vertex-cut is directlyproportional to the sum of the
number of machines spanned byeach vertex
§ Vertex-cut strategy by default (balance hot-spot issue due to
power law/Zipf’s Law) w/ min replication ofedges
§ Batch processing/not streaming
§ Org.apache.spark.graphx.Graph.
- partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int)
- partitionBy(partitionStrategy: PartitionStrategy)
- 4 PartitionStrategies
1) RandomVertexCut(usually the best) : random vertex cut that colocates all same-
direction edges between two vertices (hashingthe source and destination vertexIDs)
2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between
twovertices, regardless of direction (hashing the source and destination vertex IDs in
a canonical direction) … remember GraphX is multi graph
3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID,
colocating edges withthe same source
4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix
http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
AlphaGo Tweets Analytics Pipeline
Collect Data Transform & Extract
Features
GraphX Model
GraphX
Algorithms
Store
§ Initiallyprimed data (7days from
twitter)
§ Then used thesinceID to get
incremental tweets
§ Use application authentication for higher
rate
§ Used tweepy, wait_on_rate_limit=True,
wait_on_rate_limit_notify=True
§ ~330K tweets (see Download program),
2GB
§ => MongoDB ~820 MB w/compression)
§ Rich data (see MongoDB)
- Retweet, user ids, hash tags,
locations et al
§ This exercise only covers the retweet
interest network
1-Extract Tweets
2-Pipeline screen shots§ Get max tweet id
§ db.alphago.find().sort({id:-1}).limit(1).pretty()
- "id" : NumberLong("709550216467185664"),
§ db.alphago.find().sort({id:+1}).limit(1).pretty()
- "id" : NumberLong("709211132498714627")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt
- 232296
- Min : "id" : NumberLong("705845567537221632")
- Max : "id" : NumberLong("709211104719872000")
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt
- +21368
- Min : "id" : NumberLong("705845567537221632"),
- Max : "id" : NumberLong("709550216467185664"),
- Count : 253664
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt
- +38452
- Min: "id_str" : "705845567537221632",
- Max: "id" : NumberLong("709817136437403649")
- Count : 292116
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt
- +17331
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("710312094227300352"),
- Count : 309447
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt
- +10797
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("710883718903140353"),
- Count : 320244
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt
- +11511
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("711720090954108928"),
- Count : 331755
§ /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt
- +5166
- Min: "id" : NumberLong("705845567537221632"),
- Max: "id" : NumberLong("712410062820511744"),
- Count : 336921
3-Twitter gives lots of fields in Mongo
4-Extract Retweet Fields
&
5-Store in csv
6-Read as dataframe
7-Create Vertices, Edges & Objects
8-Finally the graph & run algorithms
The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD)
VD can be
any object
Edges Edge(ED)
ED can be
any object
Graph Graph(VD,ED)
[(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)]
Vertex VertexEdge
1
2
3
4
5
An excursion into Graph Analytics with Apache Spark GraphX

Más contenido relacionado

La actualidad más candente

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MorePaco Nathan
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsJoshua Shinavier
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoSpark Summit
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudSri Ambati
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learningPaco Nathan
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014鉄平 土佐
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveSpark Summit
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jWilliam Lyon
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Stratio
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Communityjeykottalam
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLSpark Summit
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data PipelinesChristian Gügi
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 

La actualidad más candente (20)

Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Databricks with R: Deep Dive
Databricks with R: Deep DiveDatabricks with R: Deep Dive
Databricks with R: Deep Dive
 
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and MoreStrata 2015 Data Preview: Spark, Data Visualization, YARN, and More
Strata 2015 Data Preview: Spark, Data Visualization, YARN, and More
 
TinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBsTinkerPop: a story of graphs, DBs, and graph DBs
TinkerPop: a story of graphs, DBs, and graph DBs
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
H2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks CloudH2O World - H2O Rains with Databricks Cloud
H2O World - H2O Rains with Databricks Cloud
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Microservices, containers, and machine learning
Microservices, containers, and machine learningMicroservices, containers, and machine learning
Microservices, containers, and machine learning
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
GraphX is the blue ocean for scala engineers @ Scala Matsuri 2014
 
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur DaveGraphFrames: Graph Queries in Spark SQL by Ankur Dave
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Congressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4jCongressional PageRank: Graph Analytics of US Congress With Neo4j
Congressional PageRank: Graph Analytics of US Congress With Neo4j
 
Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0Why spark by Stratio - v.1.0
Why spark by Stratio - v.1.0
 
The BDAS Open Source Community
The BDAS Open Source CommunityThe BDAS Open Source Community
The BDAS Open Source Community
 
GraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQLGraphFrames: Graph Queries In Spark SQL
GraphFrames: Graph Queries In Spark SQL
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Building Scalable Big Data Pipelines
Building Scalable Big Data PipelinesBuilding Scalable Big Data Pipelines
Building Scalable Big Data Pipelines
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 

Similar a An excursion into Graph Analytics with Apache Spark GraphX

PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2Glenn Jones
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesData Ninja API
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIIVivek Krishnakumar
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Jean Ihm
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapePaco Nathan
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...Geoffrey Fox
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingPaco Nathan
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsDatabricks
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detectionhadooparchbook
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15MLconf
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConfQubole
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 

Similar a An excursion into Graph Analytics with Apache Spark GraphX (20)

PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Experiments in Data Portability 2
Experiments in Data Portability 2Experiments in Data Portability 2
Experiments in Data Portability 2
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Tripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIIITripal within the Arabidopsis Information Portal - PAG XXIII
Tripal within the Arabidopsis Information Portal - PAG XXIII
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1) Introduction to Property Graph Features (AskTOM Office Hours part 1)
Introduction to Property Graph Features (AskTOM Office Hours part 1)
 
How Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscapeHow Apache Spark fits into the Big Data landscape
How Apache Spark fits into the Big Data landscape
 
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...What is the "Big Data" version of the Linpack Benchmark?; What is “Big Data...
What is the "Big Data" version of the Linpack Benchmark? ; What is “Big Data...
 
QCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark StreamingQCon São Paulo: Real-Time Analytics with Spark Streaming
QCon São Paulo: Real-Time Analytics with Spark Streaming
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Hadoop Application Architectures - Fraud Detection
Hadoop Application Architectures - Fraud  DetectionHadoop Application Architectures - Fraud  Detection
Hadoop Application Architectures - Fraud Detection
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
Jason Huang, Solutions Engineer, Qubole at MLconf ATL - 9/18/15
 
Atlanta MLConf
Atlanta MLConfAtlanta MLConf
Atlanta MLConf
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 

Más de Krishna Sankar

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data ScienceKrishna Sankar
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkKrishna Sankar
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesKrishna Sankar
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01Krishna Sankar
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkKrishna Sankar
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesKrishna Sankar
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsKrishna Sankar
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonKrishna Sankar
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsKrishna Sankar
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team Krishna Sankar
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time SynchronizationKrishna Sankar
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleKrishna Sankar
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04Krishna Sankar
 

Más de Krishna Sankar (20)

Pandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data SciencePandas, Data Wrangling & Data Science
Pandas, Data Wrangling & Data Science
 
An excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache SparkAn excursion into Text Analytics with Apache Spark
An excursion into Text Analytics with Apache Spark
 
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & AntidotesBig Data Analytics - Best of the Worst : Anti-patterns & Antidotes
Big Data Analytics - Best of the Worst : Anti-patterns & Antidotes
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Architecture in action 01
Architecture in action 01Architecture in action 01
Architecture in action 01
 
Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)Data Science with Spark - Training at SparkSummit (East)
Data Science with Spark - Training at SparkSummit (East)
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
R, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science CompetitionsR, Data Wrangling & Kaggle Data Science Competitions
R, Data Wrangling & Kaggle Data Science Competitions
 
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache SparkThe Hitchhiker's Guide to Machine Learning with Python & Apache Spark
The Hitchhiker's Guide to Machine Learning with Python & Apache Spark
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
Bayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive BayesBayesian Machine Learning - Naive Bayes
Bayesian Machine Learning - Naive Bayes
 
AWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOpsAWS VPC distilled for MongoDB devOps
AWS VPC distilled for MongoDB devOps
 
The Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & PythonThe Art of Social Media Analysis with Twitter & Python
The Art of Social Media Analysis with Twitter & Python
 
Big Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 PragmaticsBig Data Engineering - Top 10 Pragmatics
Big Data Engineering - Top 10 Pragmatics
 
Scrum debrief to team
Scrum debrief to team Scrum debrief to team
Scrum debrief to team
 
The Art of Big Data
The Art of Big DataThe Art of Big Data
The Art of Big Data
 
Precision Time Synchronization
Precision Time SynchronizationPrecision Time Synchronization
Precision Time Synchronization
 
The Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to KaggleThe Hitchhiker’s Guide to Kaggle
The Hitchhiker’s Guide to Kaggle
 
Nosql hands on handout 04
Nosql hands on handout 04Nosql hands on handout 04
Nosql hands on handout 04
 

Último

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Último (20)

Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

An excursion into Graph Analytics with Apache Spark GraphX

  • 2. 1. Paco Nathan, Scala Days 2015, https://www.youtube.com/watch?v=P_V71n-gtDs 2. Big Data Analytics withSpark-Apress http://www.amazon.com/Big-Data-Analytics-Spark- Practitioners/dp/1484209656 3. Apache Spark Graph Processing-Packt https://www.packtpub.com/big-data-and-business-intelligence/apache- spark-graph-processing 4. Spark GarphX in Action - Manning 5. http://hortonworks.com/blog/introduction-to-data-science-with-apache-spark/ 6. http://stanford.edu/~rezab/nips2014workshop/slides/ankur.pdf 7. Mining Massive Datasets book v2 http://infolab.stanford.edu/~ullman/mmds/ch10.pdf - http://web.stanford.edu/class/cs246/handouts.html 8. http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm 9. http://kukuruku.co/hub/algorithms/social-network-analysis-spark-graphx 10. Zeppelin Setup - http://sparktutorials.net/setup-your-zeppelin-notebook-for-data-science-in-apache-spark 11. Data - http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time - http://openflights.org/data.html 12. https://adventuresincareerdevelopment.files.wordpress.com/2015/04/standing-on-the-shoulders-of-giants.png Thanks to the Giants whose work helped to prepare this tutorial
  • 3. Agenda-Topic Time Data Description 1. Graph Processing Is Introduced 10 Min Graph Processing vs GraphDB, Introduction to Spark GraphX 2. Basics are Discussed & A Graph Is Built 10 Min Simple Graph Edge,Vertex, Graph 3. GraphX API Landscape Is discussed 5 Min Explain APIs 4. Graph Structures Are Examined 5 Min Indegree, outdegree et al 5. Community, Affiliation & Strengths Are Explored 5 Min Connected components, Triangle et al 6. Algorithms Are Developed 10 Min PartitionStrategy (is what makes GraphParallel work), PageRank, Connected Components; APIs to implement Algorithms viz. aggregateMessages, Pregel Type Signature - aggregateMessages Algorithms with aggregateMessages 7. Case Study (AlphaGo Tweets) Is Conducted 10 Min AlphaGo Retweet Data E2E Pipeline w/real data, Map attributes to properties, vertices & edges, create graph and run algorithms 8. Questions Are Asked & Answered 10 Min AGENDA Titles Styled after Asimov’s ROBOT SERIES !!!
  • 4. Graph Applications § Many applications from social network to understanding collaboration,diseases, routing logistics and others § Came across 3 interesting applications, as I was preparing the materials !! 1) Project effectiveness by social network analysis on the projects' collaboration structures : “EC is interested in the impact generated by those projects … analyzing this latent impact of TEL projects by applying social network analysis on the projects' collaboration structures … TEL projects funded under FP6, FP7 and eContentplus, and identifies central projects and strong, sustained organizational collaboration ties.” 2) Weak ties in pathogen transmission : “structural motif … Giraffe social networks are characterized by social cliques in which individuals associate more with members of their own social clique than with those outside their clique … Individuals involved in weak, between-clique social interactions are hypothesized to serve as bridges by which an infection may enter a clique and, hence, may experience higher infection risk … individuals who engaged in more between-clique associations, that is, those with multiple weak ties, were more likely to be infected with gastrointestinal helminth parasites ... “ 3) Panama papers : The leak presented them with a wealth of information, millions of documents, but no guide to structure … (ie community detection)
  • 5. Graph based systems § Graphs are everywhere – web pages, social networks, people’s web browsingbehavior, logisticsetc. - Workingw/ graphshas become more important. And harder due to the scaleand complexityof algorithms § Two kinds– processingandquerying § Processing– GraphX, Pregel BSP, GraphLab § Querying– AllegroGraph,Titan, Neo4j, RDF stores. - SPARQL query language § Graph DBs have queries, canprocess large dataset § Graph processingcanrun complex algorithms § For Graph Analyticssystemsboth are required - Processon graphx, store in neo4j is a good alternative. - E.g. Panamapapers analytics
  • 6. Graph Processing Frameworks § History – Graph based systems were specializedinnature - Graph partitioningis hard - Primitives of record based systemswere different than what graphbased systems needed § Rapid innovationin distributed data processingframeworks– MapReduce etc - Disk based, with limited partitioningabilities - Still an impedancemismatch § Graph-parallelover Data ParallelRDD ! – Best of both worlds ? ü We will see, later, GrapnX’s PartitionStrategy
  • 7. Enter Spark§ More powerful partitioning mechanism § In-memory system makes iterative processing easier § GraphX – Graph processing built ontop of Spark § Graph processing at scale (distributed system) § Fast evolving project § § - - § § §
  • 8. GraphX …§ Graph processing at Scale - “Graph Parallel System” § Has a rich - Computational Model - Edges, Vertices, Property Graph - Bipartite & tripartite graphs (Users-Tags-Web pages) - Algorithms (PageRank,…) § Current focus on computationthan query § APIs include : § Attribute Transformers, § Structure transformers, § Connection Mining & Algorithms § GraphFrames – interesting development § Exercises (lotsof interesting graph data) - Airline data,co-occurrence &co-citation from papers - The AlphaGo Community – ReTweet network - Wikipedia Page Rank Analysisusing GraphX Graphx Paper : https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf Pragel paper : http://kowshik.github.io/JPregel/pregel_paper.pdf Scala, python support https://issues.apache.org/jira/browse/SPARK-3789 Java API for Graphxhttps://issues.apache.org/jira/browse/SPARK-3665 LDA https://issues.apache.org/jira/browse/SPARK-1405
  • 9. Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) GraphX-Basics V StructuralQueries Indegrees, vertices,… AttributeTransformers mapVertices, … StructureTransformers, Join Reverse, subgraph Connection Mining connectedComponents, triangles Algorithms Aggregate messages, PageRank InterestingObjects o EdgeTriplet o EdgeContext http://ampcamp.berkeley.edu/big-data-mini-course/graph-analytics-with-graphx.html
  • 10. GraphX-Basics o Computational Model o Directed MultiGraph o Directed - so in-degree & out-degree o MultiGraph – so multiple parallel edges between nodes including loops ! o Algorithms beware – can cyclic, loops o Property Graph o vertexID(64bit long) int o Need property, cannot ignore it o Vertex,Edge Parameterized over object types o Vertex[(VertexId,VD)], Edge[ED] o Attach user-defined objects to edges and vertices (ED/VD) Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweennessGood exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf
  • 11. GraphX-Basics § For our hands-on we will run thru 2 Zeppelin notebooks (https://goo.gl/qCwZiq & https://goo.gl/EKHCFq): a. First we will work with the following Giraffe graph o … carefully chosen with interesting betweenness and properties o … for understanding the APIs b. Second, we will develop a retweet pipeline o With the AlphaGo twitter topic o 2 GB/330K tweets, 200K edges,… A D C E FG B 125 5 4.5 4.5 4 1.5 1.5 1 Graphs and Social Networks-Prof. Jeffrey D. Ullman Stanford University http://web.stanford.edu/class/cs246/handouts.html G-N Algorithm for inbetweenness val vertexList = List( (1L, Person("Alice", 18)), (2L, Person("Bernie", 17)), (3L, Person("Cruz", 15)), (4L, Person("Donald", 12)), (5L, Person("Ed", 15)), (6L, Person("Fran", 10)), (7L, Person("Genghis",854))) Good exercises https://www.sics.se/~amir/files/download/dic/answers6.pdf Fun facts about centrality betweenness : It shows how many paths an edge is part of, i.e. relevancy. High centrality betweenness is the sign of a bottleneck, point of single failure – these edges need HA, probably need alternate paths for re routing, are susceptible to parasite infections and good candidates for a cut !
  • 12. Build A Graph § There are 4 ways - GraphLoader.edgeListFile(…) - From RDDs (shown above) - fromEdgeTuples <- Id tuples - fromEdges <- edgeList
  • 13. Graph API Landscape Separate algorithm from Graph Implementation Implementati on of analytic functions
  • 14. Hint: Many details are documented in the object not in the class e.g. PartitionStrategy, Graph,…
  • 16. Community-Affiliation-Strengths § Applied in many ways § For example in Fraud & Security Applications § Triangle detection – for spam servers § The age of a community isrelated to the density of triangles - New community will have few triangles,then triangles start to form § Strong affiliation ie Heavy hitter = sqrt(m) degrees! - Heavy hitter triangle ! § Connected Communities– structure
  • 17. Algorithms § Graph-Parallel Computation - aggregateMessages()Function - Pregel() (https://issues.apache.org/jira/browse/SPARK-5062) § pageRank() - Influential papers in a citation network - Influencer in retweet § staticPageRank() - Static no of iterations, dynamic tolerance – see the parameters (tol vs. numIter) § personalizedPageRank() - Personalized PageRank is a variation on PageRank that gives a rank relative to a specified “source” vertex in the graph – “People You May Know” § shortestPaths, SVD++ § LabelPropagation(LPA) as described by Raghavanet al in their 2007 paper “Near linear time algorithm to detect community structures in large-scale networks” - Computationally inexpensive way to Identify communities - Convergence not guaranteed - Might end up with trivial solution i.e. single community § SDVPlusPlus takes an RDD of Edges § The Global Clustering Coefficient, is better in that it always returns a number between 0 and 1 - For comparing connectnedness between different sized communities http://graphframes.github.io/user-guide.html
  • 18. § Versatile Function useful for implementing PageRank et al - Can be difficult at first, but easier to comprehend if treated as a combined map-reduce function (it was called MapReduceTriplets !! With a slightly different signature) o aggregateMessage[Msg]( o map(edgeContext=>mapFun,<- this can be up or downie sendToDst or Src! o redcuce(Msg,Msg) => reduceFun)
  • 19. If you really want to know what is underneath the aggregateMessages()…
  • 21. Pregel BSP API § Bulk SynchronousParallelMessagePassingAPI - Developed by Leslie Valiant1 - Synchronizeddistributedsuper steps - Super Step – local,in-memorycomputations § Computations on the vertex – “Think likea Vertex” § graphx.lib.LabelPropagationAlgorithmuses Pregelinternally § graphx.lib.PageRankuses Pragel(for the runUntilConvergenceWithOptionsversion,while it uses the aggregateMessagesfor the runWithOptions(staticPageRank)version) § Pregel is implementedsuccinctlywith the old mapReduceTriplets API 1http://delivery.acm.org/10.1145/80000/79181/p103- valiant.pdfhttp://www.slideshare.net/riyadparvez/pregel-35504069
  • 22. Graphx Partitioning & Processing § Unlike a relationaltable,graph processing is contextual w.r.t a neighborhood - Maintain locality,equal size partitioning § Edge-cut : The communication& storage overhead of an edge-cut is directlyproportional to thenumber ofedges that are cut § Vertex-cut : Thecommunication and storage overhead ofa vertex-cut is directlyproportional to the sum of the number of machines spanned byeach vertex § Vertex-cut strategy by default (balance hot-spot issue due to power law/Zipf’s Law) w/ min replication ofedges § Batch processing/not streaming § Org.apache.spark.graphx.Graph. - partitionBy(partitionStrategy: PartitionStrategy, numPartitions: Int) - partitionBy(partitionStrategy: PartitionStrategy) - 4 PartitionStrategies 1) RandomVertexCut(usually the best) : random vertex cut that colocates all same- direction edges between two vertices (hashingthe source and destination vertexIDs) 2) CanonicalRandomVertexCut - a random vertexcut that colocates all edges between twovertices, regardless of direction (hashing the source and destination vertex IDs in a canonical direction) … remember GraphX is multi graph 3) EdgePartition1D - Assigns edges to partitions using only the source vertex ID, colocating edges withthe same source 4) EdgePartition2D : 2D partitioning of the sparse edge adjacency matrix http://www.istc-cc.cmu.edu/publications/papers/2013/grades-graphx_with_fonts.pdf,https://www.sics.se/~amir/files/download/papers/jabeja-vc.pdf
  • 23. AlphaGo Tweets Analytics Pipeline Collect Data Transform & Extract Features GraphX Model GraphX Algorithms Store § Initiallyprimed data (7days from twitter) § Then used thesinceID to get incremental tweets § Use application authentication for higher rate § Used tweepy, wait_on_rate_limit=True, wait_on_rate_limit_notify=True § ~330K tweets (see Download program), 2GB § => MongoDB ~820 MB w/compression) § Rich data (see MongoDB) - Retweet, user ids, hash tags, locations et al § This exercise only covers the retweet interest network
  • 25. 2-Pipeline screen shots§ Get max tweet id § db.alphago.find().sort({id:-1}).limit(1).pretty() - "id" : NumberLong("709550216467185664"), § db.alphago.find().sort({id:+1}).limit(1).pretty() - "id" : NumberLong("709211132498714627") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --drop --file tweets-20160313.txt - 232296 - Min : "id" : NumberLong("705845567537221632") - Max : "id" : NumberLong("709211104719872000") § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160314.txt - +21368 - Min : "id" : NumberLong("705845567537221632"), - Max : "id" : NumberLong("709550216467185664"), - Count : 253664 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160315.txt - +38452 - Min: "id_str" : "705845567537221632", - Max: "id" : NumberLong("709817136437403649") - Count : 292116 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160316.txt - +17331 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710312094227300352"), - Count : 309447 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160318.txt - +10797 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("710883718903140353"), - Count : 320244 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160320.txt - +11511 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("711720090954108928"), - Count : 331755 § /usr/local/mongo/bin/mongoimport --db admin --collection alphago --file tweets-20160322.txt - +5166 - Min: "id" : NumberLong("705845567537221632"), - Max: "id" : NumberLong("712410062820511744"), - Count : 336921
  • 26. 3-Twitter gives lots of fields in Mongo
  • 28. 6-Read as dataframe 7-Create Vertices, Edges & Objects 8-Finally the graph & run algorithms
  • 29. The Art of an AlphaGo GraphX Vertex Vertex(VertexId,VD) VD can be any object Edges Edge(ED) ED can be any object Graph Graph(VD,ED) [(VertexId , VD)] [(VertexId, VertexId, [(VertexId, VD)]ED)] Vertex VertexEdge