SlideShare a Scribd company logo
1 of 29
Download to read offline
BUILDING  A  GRAPH  OF  ALL  
U.S.  BUSINESSES  USING  
SPARK  TECHNOLOGIES.  
Alexis  Roos @alexisroos
Engineering  manager  /  Lead  Data  Science  Engineer
Radius  Intelligence
AGENDA
• Radius  Intelligence
• Objectives  and  terminology
• Business  graph  data  pipeline
• Lessons  learned
• Q&As
2
Radius  Messaging  Framework
June  8,  2015  
Radius transforms the way B2B
marketers discover markets, acquire
customers, and measure success.
RADIUS  INTELLIGENCE
Our software is powered by the
Radius Intelligence Cloud
a proprietary data science engine
which provides predictive analytics,
powerful segmentation, and
seamless integrations.
Predictive Marketing software
RADIUS  INTELLIGENCE
➔ Founded  in  2009
➔ 125+  employees (and  growing)
➔ Headquartered  in  San  Francisco,  CA
➔ $125  million in  funding  to  date
➔ Cutting  edge  in  Data  Science and  Data  
Engineering  Technologies
4
Produce  an  accurate and comprehensive  Business  Graph from  records coming  from  
dozens  of  sources  comprising  hundreds  of  data  points  such  as:
BUSINESS  GRAPH  OBJECTIVE
5
…
Answer  complex  connected  questions  such  as:
6
BUSINESS  GRAPH  BENEFITS
• Employees:  Find  engineering  executives  at  a  given  company  
focusing  on  Big  Data  or  security
• News:  Show  News related  to  companies  in  my  marketing  segment  
such  as  Funding  announcements  or  mergers  and  acquisitions
• Intent:  Show  companies  with  a  buying  Intent for  my  product  or  
engineering  managers  looking  for  security  product
• Organization:  Show  all  locations  or  Employees  for  a  given  business
etc
o Business:  a  company  or  organization
ex:  Blue  Bottle  company
BUSINESS  GRAPH  TERMINOLOGY
7
E I
N T
o Location:  a  business  at  a  particular  address
ex:  Blue  Bottle  at  Sansome St,  San  Francisco
o Attributes:  information  associated  with  a
business  and/or  location
ex:  address,  website,  phone  number,  etc.
o Additional  Data  types:
ex:  Employees,  Intent,  News,  Technologies,  etc.
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
8
Dozens sources
7B+ records
50B+ Data  Points
• Crawling
• Licensing
Raw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
9
• Standardization
• Validation
• Normalization
Normalized  RecordsRaw  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering ConstructionData  acquisition
10
Cluster  records  on:
• Address
• Name  of  the  business
• Websites
• Phone  numbers
• Employees
• Description
• Category  codes
• Etc.
Clusters  of  records  by  
address/business
Normalized  Records
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation Clustering Construction
-­ Locations      
-­ Businesses  
&  more
Data  acquisition
11
Construct  Locations:
• Filter  bad  clusters
• Use  all  records  in  cluster  to  
create  Location  with  attributes:
o Select  best  value(s)
o Score  values:  confidence  
based  on  specific  record  
features  including  historical
o Impute  missing  values
• Etc.
Locations
MLLib
Clusters
BUSINESS  GRAPH  DATA  PIPELINE
Data  preparation ClusteringData  acquisition Construction
Businesses  
&  more
12
Construct  business  &  more:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
MLLib
E
I
N
T
E
GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph.
SPARK  /  GRAPHX
13
(1,  Location) (2,  Domain)
(1,  2,  source1)
(3,  Phone)
(1,  2,  source2)
class Graph[VD,  ED]  {
val vertices: VertexRDD[VD]
val edges: EdgeRDD[ED]
val triplets: RDD[EdgeTriplet[VD,  ED]]  
}
VertexRDD[VD] extends RDD[(VertexID,  VD)]
EdgeRDD[ED] extends RDD[Edge[ED]]
VD  -­ Vertex  data attribute type
ED  -­ Edge data attribute type
Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated
SPARK  /  GRAPHX
14
Operations  from  Graph  and  GraphOps:
• Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees
• Collections:  vertices,  edges,  triplets
• Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets,
filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges
• Join:  joinVertices,  outerJoinVertices
• Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields
• Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy
• Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount
And  data  operations  on  VertexRDD and  EdgeRDD:
• VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex
• EdgeRDD:  mapValues,  reverse,  innerJoin
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Data  preparation ClusteringData  acquisition
Construct  business  graph:
• Create  business  and  link  
locations
• Add  business  level  info:  
news,  employees,  intent,  
etc.
• Correct/  impute  fields  at  
business  level
• Etc.
Locations Business  graph
Construction
Businesses  
&  more
15
E
I
N
T
E
1. Create  vertex  for  each  location  and  each  known  business
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
16
VertexRDD[Location]
Clusters
map
VertexRDD[Business]
Curated  list
map
2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain)
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
flatMap
attributes
Create
Edges
Location Business
17
Locations
Businesses
join
3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc.
and  link  locations  to  these  attributes
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
flatMap
attributes
groupBy
Create
Vertices
& Edges
Location  and  attributes
18
Locations
Aggregate  messages:  
sendMsg,  mergeMsg
SPARK  /  GRAPHX
Seq
19
sendMsg:  (srcID,  name) mergeMsg:  x  ++  y
Location  1
Location  2
Domain
sendMsg mergeMsg
4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Locations  and  attributes
aggregate
Messages
.combinations(2)
20
Locations
.filter(isSimilar)  *
*  such  as  name  matching
Note:  .combinations  (2)  and  bottleneck
PERFORMANCE  TIP
.combinations(2)
O(n):  N^2
5
1
4
2
3
21
5
1
4
2
3
smarter  combination
O(n):  N
1 2 3 4 5
SPARK  /  GRAPHX
22
Connected  components
Group  1
Group  2
connected
Components
Locations
5.  Call  connected  components  on  linked  locations  and  create  businesses
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
connected
Components
Locations
Location Business
Create
Vertices
& Edges
23
GraphX Connected  components  implementation  already  performs  optimizations:
– Incremental  Updates  to  Mirror  Caches
– Join  Elimination
– Index  Scanning  for  Active  Sets
– Local  Vertex  and  Edge  Indices
– Index  and  Routing  Table  Reuse
But  algorithm  is  iterative  so  optimization  matters:
PERFORMANCE  TIP
24
• Vertices  and  edges  should  be  in  memory:  more  efficient  to
subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later
• Check-­pointing  improves  performance  (less  lineage)
6.  More  steps:
• Assign  headquarter
• Join  additional  information:  Employees,  News,  Intent,  etc.
• Further  corrections  /  imputations:  attributes,  revenues,  etc.
GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN
Location Business
more
operations
25
Business  graph
E
I
N
T
E
GRAPHICAL  VIEW
26
From  Locations  with  attributes  to  Locations  with  Business
GRAPHICAL  VIEW
27
From  Locations  with  attributes  to  Locations  with  Business
• Using  a  Graph  allows  us  to  naturally  model  business  relationships
and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data!
LESSONS  LEARNED
28
• GraphX scales:  250M  vertices  and  1b  edges  in  clustering
• Creating  and  updating  the  graph  is  easy  using  RDD  operations
• Edge  cutting  is  an  art  as  much  as  a  science
• Connected  Components  is  an  expensive  operation
THANK  YOU.
WE  ARE  HIRING!
http://radius.com/jobs/
Alexis  Roos
alexis@radius.com
@alexisroos

More Related Content

What's hot

Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Databricks
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
Databricks
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
IBM Cloud Data Services
 

What's hot (20)

Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan PanayotovSpark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
 
Spark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier AguedesSpark Summit EU talk by Javier Aguedes
Spark Summit EU talk by Javier Aguedes
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
Generative Hyperloop Design: Managing Massively Scaled Simulations Focused on...
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Big Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on SparkBig Data Heterogeneous Mixture Learning on Spark
Big Data Heterogeneous Mixture Learning on Spark
 
Gender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML PipelineGender Prediction with Databricks AutoML Pipeline
Gender Prediction with Databricks AutoML Pipeline
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Managing the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflowManaging the Complete Machine Learning Lifecycle with MLflow
Managing the Complete Machine Learning Lifecycle with MLflow
 
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
Operationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At ScaleOperationalizing Big Data Pipelines At Scale
Operationalizing Big Data Pipelines At Scale
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Frequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on EmbeddingsFrequently Bought Together Recommendations Based on Embeddings
Frequently Bought Together Recommendations Based on Embeddings
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Machine Learning with Apache Spark
Machine Learning with Apache SparkMachine Learning with Apache Spark
Machine Learning with Apache Spark
 

Viewers also liked

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Spark Summit
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Spark Summit
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
Andy Petrella
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Takayuki Shimizukawa
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
Spark Summit
 

Viewers also liked (20)

Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
 
GraphX and Pregel - Apache Spark
GraphX and Pregel - Apache SparkGraphX and Pregel - Apache Spark
GraphX and Pregel - Apache Spark
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
Using spark for timeseries graph analytics
Using spark for timeseries graph analyticsUsing spark for timeseries graph analytics
Using spark for timeseries graph analytics
 
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Machine learning in finance using python
Machine learning in finance using pythonMachine learning in finance using python
Machine learning in finance using python
 
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
Sphinx autodoc - automated API documentation (PyCon APAC 2015 in Taiwan)
 
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
快快樂樂成為 Coding Ninja (by pytest) @ PyConAPAC2015
 
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)Introductory Keynote at Hadoop Workshop by Ospcon (2014)
Introductory Keynote at Hadoop Workshop by Ospcon (2014)
 
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
3rd Moscow cassandra meetup (Fast In-memory Analytics Over Cassandra Data )
 
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
Building a Location Based Social Graph in Spark at InMobi-(Seinjuti Chatterje...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
Выступление Александра Крота из "Вымпелком" на Hadoop Meetup в рамках RIT++
 
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri 5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
5 Reasons Enterprise Adoption of Spark is Unstoppable by Mike Gualtieri
 
Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017Fighting financial crime with graph analysis at BIWA Summit 2017
Fighting financial crime with graph analysis at BIWA Summit 2017
 

Similar to Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
MongoDB
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional data
Pavel Hardak
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Databricks
 

Similar to Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos (20)

OLAP Cubes in Datawarehousing
OLAP Cubes in DatawarehousingOLAP Cubes in Datawarehousing
OLAP Cubes in Datawarehousing
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Neo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael MooreNeo4j GraphTour New York_EY Presentation_Michael Moore
Neo4j GraphTour New York_EY Presentation_Michael Moore
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
L’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazioneL’architettura di classe enterprise di nuova generazione
L’architettura di classe enterprise di nuova generazione
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Graph all the things - PRathle
Graph all the things - PRathleGraph all the things - PRathle
Graph all the things - PRathle
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Lightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional dataLightning-fast Analytics for Workday transactional data
Lightning-fast Analytics for Workday transactional data
 
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...
 
GraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4jGraphTalk Copenhagen - Introduction to Graphs and Neo4j
GraphTalk Copenhagen - Introduction to Graphs and Neo4j
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
2018 data warehouse features in spark
2018   data warehouse features in spark2018   data warehouse features in spark
2018 data warehouse features in spark
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 
RahulSoni_ETL_resume
RahulSoni_ETL_resumeRahulSoni_ETL_resume
RahulSoni_ETL_resume
 
Sandeep Grandhi (1)
Sandeep Grandhi (1)Sandeep Grandhi (1)
Sandeep Grandhi (1)
 
Big Data Modeling
Big Data ModelingBig Data Modeling
Big Data Modeling
 

More from Spark Summit

Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 

Recently uploaded (20)

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 

Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos

  • 1. BUILDING  A  GRAPH  OF  ALL   U.S.  BUSINESSES  USING   SPARK  TECHNOLOGIES.   Alexis  Roos @alexisroos Engineering  manager  /  Lead  Data  Science  Engineer Radius  Intelligence
  • 2. AGENDA • Radius  Intelligence • Objectives  and  terminology • Business  graph  data  pipeline • Lessons  learned • Q&As 2
  • 3. Radius  Messaging  Framework June  8,  2015   Radius transforms the way B2B marketers discover markets, acquire customers, and measure success. RADIUS  INTELLIGENCE Our software is powered by the Radius Intelligence Cloud a proprietary data science engine which provides predictive analytics, powerful segmentation, and seamless integrations. Predictive Marketing software
  • 4. RADIUS  INTELLIGENCE ➔ Founded  in  2009 ➔ 125+  employees (and  growing) ➔ Headquartered  in  San  Francisco,  CA ➔ $125  million in  funding  to  date ➔ Cutting  edge  in  Data  Science and  Data   Engineering  Technologies 4
  • 5. Produce  an  accurate and comprehensive  Business  Graph from  records coming  from   dozens  of  sources  comprising  hundreds  of  data  points  such  as: BUSINESS  GRAPH  OBJECTIVE 5 …
  • 6. Answer  complex  connected  questions  such  as: 6 BUSINESS  GRAPH  BENEFITS • Employees:  Find  engineering  executives  at  a  given  company   focusing  on  Big  Data  or  security • News:  Show  News related  to  companies  in  my  marketing  segment   such  as  Funding  announcements  or  mergers  and  acquisitions • Intent:  Show  companies  with  a  buying  Intent for  my  product  or   engineering  managers  looking  for  security  product • Organization:  Show  all  locations  or  Employees  for  a  given  business etc
  • 7. o Business:  a  company  or  organization ex:  Blue  Bottle  company BUSINESS  GRAPH  TERMINOLOGY 7 E I N T o Location:  a  business  at  a  particular  address ex:  Blue  Bottle  at  Sansome St,  San  Francisco o Attributes:  information  associated  with  a business  and/or  location ex:  address,  website,  phone  number,  etc. o Additional  Data  types: ex:  Employees,  Intent,  News,  Technologies,  etc.
  • 8. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 8 Dozens sources 7B+ records 50B+ Data  Points • Crawling • Licensing Raw  Records
  • 9. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 9 • Standardization • Validation • Normalization Normalized  RecordsRaw  Records
  • 10. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering ConstructionData  acquisition 10 Cluster  records  on: • Address • Name  of  the  business • Websites • Phone  numbers • Employees • Description • Category  codes • Etc. Clusters  of  records  by   address/business Normalized  Records
  • 11. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation Clustering Construction -­ Locations       -­ Businesses   &  more Data  acquisition 11 Construct  Locations: • Filter  bad  clusters • Use  all  records  in  cluster  to   create  Location  with  attributes: o Select  best  value(s) o Score  values:  confidence   based  on  specific  record   features  including  historical o Impute  missing  values • Etc. Locations MLLib Clusters
  • 12. BUSINESS  GRAPH  DATA  PIPELINE Data  preparation ClusteringData  acquisition Construction Businesses   &  more 12 Construct  business  &  more: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,  etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph MLLib E I N T E
  • 13. GraphX models  a  “property  graph”  which is  an  multi  directed,  attributed graph. SPARK  /  GRAPHX 13 (1,  Location) (2,  Domain) (1,  2,  source1) (3,  Phone) (1,  2,  source2) class Graph[VD,  ED]  { val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD,  ED]]   } VertexRDD[VD] extends RDD[(VertexID,  VD)] EdgeRDD[ED] extends RDD[Edge[ED]] VD  -­ Vertex  data attribute type ED  -­ Edge data attribute type
  • 14. Graph  operations  are  like  RDD  operations:  functional   and  lazily  evaluated SPARK  /  GRAPHX 14 Operations  from  Graph  and  GraphOps: • Information:  numEdges,  numVertices,  inDegrees,  outDegrees,  degrees • Collections:  vertices,  edges,  triplets • Transformations  and  modifications:  mapVertices,  mapEdges,  mapTriplets, filter,  reverse,  subgraph,  mask,  groupEdges,  convertToCanonicalEdges • Join:  joinVertices,  outerJoinVertices • Aggregations:  collectNeighbor(Id)s,  aggregateMessages,  sendMsg,  mergeMsg,  tripletFields • Caching  and  partitioning: persist,  cache,  checkpoint,  unpersist(Vertices),  partitionBy • Graph  algorithms:  pregel,  pageRank,  connectedComponents,  triangleCount And  data  operations  on  VertexRDD and  EdgeRDD: • VertexRDD:  filter,  mapValues,  minus,  diff,  innerJoin,  leftJoin,  aggregateUsingIndex • EdgeRDD:  mapValues,  reverse,  innerJoin
  • 15. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Data  preparation ClusteringData  acquisition Construct  business  graph: • Create  business  and  link   locations • Add  business  level  info:   news,  employees,  intent,   etc. • Correct/  impute  fields  at   business  level • Etc. Locations Business  graph Construction Businesses   &  more 15 E I N T E
  • 16. 1. Create  vertex  for  each  location  and  each  known  business GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN 16 VertexRDD[Location] Clusters map VertexRDD[Business] Curated  list map
  • 17. 2. Link  Locations  to  Known  Business  per  relationships  (such  as  domain) GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap flatMap attributes Create Edges Location Business 17 Locations Businesses join
  • 18. 3.  Create  vertex  for  unique  top  attributes  such  as  address,  phone,  domain,  etc. and  link  locations  to  these  attributes GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN flatMap attributes groupBy Create Vertices & Edges Location  and  attributes 18 Locations
  • 19. Aggregate  messages:   sendMsg,  mergeMsg SPARK  /  GRAPHX Seq 19 sendMsg:  (srcID,  name) mergeMsg:  x  ++  y Location  1 Location  2 Domain sendMsg mergeMsg
  • 20. 4.  Create  location  to  location  edges  using  attributes  relationship  and  graph  cutting GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Locations  and  attributes aggregate Messages .combinations(2) 20 Locations .filter(isSimilar)  * *  such  as  name  matching
  • 21. Note:  .combinations  (2)  and  bottleneck PERFORMANCE  TIP .combinations(2) O(n):  N^2 5 1 4 2 3 21 5 1 4 2 3 smarter  combination O(n):  N 1 2 3 4 5
  • 22. SPARK  /  GRAPHX 22 Connected  components Group  1 Group  2 connected Components Locations
  • 23. 5.  Call  connected  components  on  linked  locations  and  create  businesses GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN connected Components Locations Location Business Create Vertices & Edges 23
  • 24. GraphX Connected  components  implementation  already  performs  optimizations: – Incremental  Updates  to  Mirror  Caches – Join  Elimination – Index  Scanning  for  Active  Sets – Local  Vertex  and  Edge  Indices – Index  and  Routing  Table  Reuse But  algorithm  is  iterative  so  optimization  matters: PERFORMANCE  TIP 24 • Vertices  and  edges  should  be  in  memory:  more  efficient  to subset  /  keep  the  graph  minimal  or  filter  and  rejoin  later • Check-­pointing  improves  performance  (less  lineage)
  • 25. 6.  More  steps: • Assign  headquarter • Join  additional  information:  Employees,  News,  Intent,  etc. • Further  corrections  /  imputations:  attributes,  revenues,  etc. GRAPH  DATA  PIPELINE  CONSTRUCTION  ZOOM  IN Location Business more operations 25 Business  graph E I N T E
  • 26. GRAPHICAL  VIEW 26 From  Locations  with  attributes  to  Locations  with  Business
  • 27. GRAPHICAL  VIEW 27 From  Locations  with  attributes  to  Locations  with  Business
  • 28. • Using  a  Graph  allows  us  to  naturally  model  business  relationships and  append  new  data  types  over  time.  Also  allowed  us  to  debug  data! LESSONS  LEARNED 28 • GraphX scales:  250M  vertices  and  1b  edges  in  clustering • Creating  and  updating  the  graph  is  easy  using  RDD  operations • Edge  cutting  is  an  art  as  much  as  a  science • Connected  Components  is  an  expensive  operation
  • 29. THANK  YOU. WE  ARE  HIRING! http://radius.com/jobs/ Alexis  Roos alexis@radius.com @alexisroos