SlideShare una empresa de Scribd logo
1 de 32
Descargar para leer sin conexión
http://purdygoodengineering.com http://anant.us
Accumulo and Spark
With MLLib and GraphX
http://purdygoodengineering.com http://anant.us
Introduction
● Section 1: Understanding the Technology
○ Big Picture
○ Accumulo
○ Spark
○ Example Code
● Section 2: Use Cases
○ Multi-Tenant Data Processing
○ Machine Learning / Graph Processing in Spark
○ Example ML + Graph on Business Data
● Questions and Answers
● Contact Information
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Big Picture
● Accumulo
○ Scalable, sorted, distributed key/value store with cell level security
● Spark
○ General compute engine for large-scale data processing
■ Batch Processing
■ Streaming
■ Machine Learning Library
■ Graph Processing
● Use Spark for Compute and Accumulo for storage for a security distributed
scalable solution
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
(image from accumulo.apache.org)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Key Structure
Accumulo
Table
Design
RDBM
Table
Design
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Table Structure
● Each table has many tablets (distributed across nodes)
● Tablet servers are replicated (default is 3)
● Each row resides on the same tablets
○ A Row Id design strategy needs to ensure binning is
evenly distributed
○ Each table has “splits” which determine binning
○ If Row Ids are still too large; a sharding strategy is
required
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Cell Level Security
● Each cell (or field) has its own access control determined
by visibility
● Each user has authorizations which correspond to
visibilities
● Only fields with visibilities which a user has authorization
to access can be retrieved by that user
● Visibilities have limited logic such as AND and OR
○ e.g. private | system public & dna_partner
http://purdygoodengineering.com http://anant.us
Section 1: Splits
● Each table has a default split
● Splits can be added to tables
● Accumulo auto splits when tablets get to large
● Table splits and tablet max size can is configurable
● Row ids are generally hashed to support distribution
● Example splits based on hashing
○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo Reads
● Reads (are scans)
○ Scanner
○ BatchScanner (parallelizes over ranges)
● MapReduce/Spark
○ AccumuloInputFormat (one field at a time)
○ AccumuloRowInputFormat (one row at a time)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Writes
● Writes
○ Writer
○ BatchWriter (parallelizes over tablets)
● MapReduce/Spark
○ AccumuloOutputFormat
○ AccumuloFileOutputFormat (bulk ingest)
● Both use Mutations to write to accumulo
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo: Mutations (write and delete)
● Mutations are used to write and delete
● Mutation.put (to write)
● Mutation.putDelete (to delete)
● Writes are Upserts (insert or updates)
http://purdygoodengineering.com http://anant.us
Section 1: Accumulo
● accumulo.apache.org
● Download accumulo
● Examples
● Documentation
Concerned about scalling; how about 4T Nodes, 70T edges
in a graph => see link
http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2
013_56002v1.pdf
http://purdygoodengineering.com http://anant.us
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
http://purdygoodengineering.com http://anant.us
Section 1: Spark: MapReduce first
● Hadoop MapReduce (batch processing)
○ Mapping
○ Reducing
○ Chain jobs
○ 95% IO (each job must read/write to disk)
○ scalable
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● Batch Processing - MapReduce (many more functions)
● Streaming - mini batch processing
● Machine Learning - MLLib
● Graph Processing - GraphX
● Many Languages - (Java, Scala, Python, R)
http://purdygoodengineering.com http://anant.us
Section 1: Spark
● spark.apache.org
● Download spark
● Example code
● Documentation
http://purdygoodengineering.com http://anant.us
○ Spark
■ Batch/Streaming
■ Machine Learning
■ Graph Processing
○ Example Code
■ Writing to Accumulo
■ Reading from Accumulo
■ Shell
Section 1: Understanding the Technology
● Section 1: Understanding the Technology
○ Big Picture
■ Why Accumulo
■ Why Spark
○ Accumulo
■ Key/Value Structure
■ Table Structure
■ Cell Level Security
■ Splits
■ Reads (scans)
■ Writes (upserts)
■ Deletes
http://purdygoodengineering.com http://anant.us
Section 1: Example Code
Simple Examples for bookkeeping with spark and accumulo
https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
http://purdygoodengineering.com http://anant.us
Section 2: Use Case(s) Machine Learning and
Graph Processing
● Multi-Tenant Data Processing
● Machine Learning / Graph Processing in Spark
● Example Usecase of ML + Graph on Business Data
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
Team Customer Private Customer Data
shared w/ Provider
Private Provider Data
for Economy of Scale
Sales
Marketing
IBM Indicators
Relationships
Classification
Classification Model
Relationship Graph
Marketing
Finance
Apple Indicators
Correlation
Prediction
Correlation Model
Prediction Model
Sales
Marketing
Finance
Microsoft Indicators
Relationships
Correlation
Prediction
Correlation Model
Prediction Model
Relationship Graph
Finance Google Indicators
Correlation
Prediction
Correlation Model
Prediction Model
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
Customer (C) (P) & (C) Provider (P)
C User C Team C Management C Management
P Analytics
P Analytics
P Support
CU Manager
CU Employee
CT Sales CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Manager
CU Employee
CT Marketing CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Research CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
CU Employee CT Finance CM Executive CM Executive
CU Manager
PA * / PS *
PA * / PS *
http://purdygoodengineering.com http://anant.us
Section 2: Multi-Tenant Data Processing Needs
● Analyze Sales Team successes (Closed Accounts) to recommend companies
to target for Marketing campaigns.
● Analyze Sales Team User social account against social network users against
recommended companies to create Call Lists
● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads
& Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict
Sales from current Marketing & Sales activities
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Classification
● Regression
● Decision Trees
● Recommendation
● Clustering
● Topic Modeling
● Feature Transformations
● ML Pipelining / Persistence
● “Based on past
performance in the
companies in the CRM,
the most successful sales
have come from these
categories, so go after
these companies.”
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : MLLib in Spark
● Load Data
● Extract Features
● Train Model
● Find Best Model
● Use Model to Predict
http://purdygoodengineering.com http://anant.us
Section 2: KeystoneML - End to End ML
http://keystone-ml.org/
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● PageRank
● Connected components
● Label propagation
● SVD++
● Strongly connected components
● Triangle count
● “Based on the social graph
of sales team members
and the companies in your
CRM, talk to the
companies you are most
“closest” to.
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Nodes RDD
● Load Vertices RDD
● Create Graph from
Nodes & Vertices RDD
● Run Graph Process /
Query
● Get Data
http://ampcamp.berkeley.edu/big-d
ata-mini-course/graph-analytics-wit
h-graphx.html
http://purdygoodengineering.com http://anant.us
Section 2: Out of the Box : GraphX in Spark
● Load Edges into Graph
● Run Page Rank
● Load Nodes into RDD
● Join Users RDD with
Rank
http://purdygoodengineering.com http://anant.us
Questions and Answers
?
http://purdygoodengineering.com http://anant.us
Contact Information
Matthew Purdy
● matthew.purdy@purdygoodengineering.com
● http://www.purdygoodengineering.com
● https://www.linkedin.com/in/matthewpurdy
● https://github.com/matthewpurdy
Rahul Singh
● rahul.singh@anant.us
● http://www.anant.us
● http://www.linkedin.com/in/xingh
● https://github.com/xingh

Más contenido relacionado

Destacado

HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceTaha Kass-Hout, MD, MS
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemTaha Kass-Hout, MD, MS
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationTaha Kass-Hout, MD, MS
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataDataCards
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceSteve Ma
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechRob Emanuele
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentationDroidcon Berlin
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Taha Kass-Hout, MD, MS
 

Destacado (14)

HBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User GroupHBase and Accumulo | Washington DC Hadoop User Group
HBase and Accumulo | Washington DC Hadoop User Group
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
BioSense 2.0
BioSense 2.0BioSense 2.0
BioSense 2.0
 
Social Media for the Meta-Leader
Social Media for the Meta-LeaderSocial Media for the Meta-Leader
Social Media for the Meta-Leader
 
BioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 ConferenceBioSense Program Going Forward: HIMSS10 Conference
BioSense Program Going Forward: HIMSS10 Conference
 
Evolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response SystemEvolve: InSTEDD's Global Early Warning and Response System
Evolve: InSTEDD's Global Early Warning and Response System
 
Public Health Surveillance Through Collaboration
Public Health Surveillance Through CollaborationPublic Health Surveillance Through Collaboration
Public Health Surveillance Through Collaboration
 
Big Data in Public Health
Big Data in Public HealthBig Data in Public Health
Big Data in Public Health
 
precisionFDA
precisionFDAprecisionFDA
precisionFDA
 
Geohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial DataGeohash: Integration of Disparate Geospatial Data
Geohash: Integration of Disparate Geospatial Data
 
Latest Advances in Megapixel Surveillance
Latest Advances in Megapixel SurveillanceLatest Advances in Megapixel Surveillance
Latest Advances in Megapixel Surveillance
 
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTechGeoSpatially enabling your Spark and Accumulo clusters with LocationTech
GeoSpatially enabling your Spark and Accumulo clusters with LocationTech
 
Matchinguu droidcon presentation
Matchinguu droidcon presentationMatchinguu droidcon presentation
Matchinguu droidcon presentation
 
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...Riff: A Social Network and Collaborative Platform for Public Health Disease S...
Riff: A Social Network and Collaborative Platform for Public Health Disease S...
 

Similar a Machine Learning & Graph Processing w/ Spark and Accumulo

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fastDenis Karpenko
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsChamp Yen
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Etti Gur
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018javier ramirez
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned Omid Vahdaty
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopTamas K Lengyel
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackShapeBlue
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache sparkInfoFarm
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationHao Xu
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLEDB
 

Similar a Machine Learning & Graph Processing w/ Spark and Accumulo (20)

Databricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog FoodDatabricks: What We Have Learned by Eating Our Dog Food
Databricks: What We Have Learned by Eating Our Dog Food
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
[scala.by] Launching new application fast
[scala.by] Launching new application fast[scala.by] Launching new application fast
[scala.by] Launching new application fast
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
OpenCL Kernel Optimization Tips
OpenCL Kernel Optimization TipsOpenCL Kernel Optimization Tips
OpenCL Kernel Optimization Tips
 
Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?Optimizing spark based data pipelines - are you up for it?
Optimizing spark based data pipelines - are you up for it?
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
How a BEAM runner executes a pipeline. Apache BEAM Summit London 2018
 
Benchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbersBenchmarks, performance, scalability, and capacity what's behind the numbers
Benchmarks, performance, scalability, and capacity what's behind the numbers
 
Benchmarks, performance, scalability, and capacity what s behind the numbers...
Benchmarks, performance, scalability, and capacity  what s behind the numbers...Benchmarks, performance, scalability, and capacity  what s behind the numbers...
Benchmarks, performance, scalability, and capacity what s behind the numbers...
 
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
 
Dfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshopDfrws eu 2014 rekall workshop
Dfrws eu 2014 rekall workshop
 
Anurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStackAnurag Awasthi - Machine Learning applications for CloudStack
Anurag Awasthi - Machine Learning applications for CloudStack
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3  Big Data in 200 km/h | AWS Big Data Demystified #1.3
Big Data in 200 km/h | AWS Big Data Demystified #1.3
 
Boosting big data with apache spark
Boosting big data with apache sparkBoosting big data with apache spark
Boosting big data with apache spark
 
Software Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale AutomationSoftware Design Practices for Large-Scale Automation
Software Design Practices for Large-Scale Automation
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Data Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQLData Analysis with TensorFlow in PostgreSQL
Data Analysis with TensorFlow in PostgreSQL
 

Más de Rahul Singh

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Rahul Singh
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit TogetherRahul Singh
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Rahul Singh
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data ProcessingRahul Singh
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesRahul Singh
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersRahul Singh
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchRahul Singh
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesRahul Singh
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessRahul Singh
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the WebRahul Singh
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Rahul Singh
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsRahul Singh
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Rahul Singh
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentationRahul Singh
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayRahul Singh
 

Más de Rahul Singh (15)

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit Together
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B)
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data Processing
 
Building Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source TechnologiesBuilding Enterprise Search Engines using Open Source Technologies
Building Enterprise Search Engines using Open Source Technologies
 
Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your Customers
 
Building Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and ElasticsearchBuilding Search Engines - Lucene, SolR and Elasticsearch
Building Search Engines - Lucene, SolR and Elasticsearch
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal Sites
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & Happiness
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the Web
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 Years
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentation
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, Today
 

Último

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一F sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 

Último (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
办理学位证加利福尼亚大学洛杉矶分校毕业证,UCLA成绩单原版一比一
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 

Machine Learning & Graph Processing w/ Spark and Accumulo

  • 2. http://purdygoodengineering.com http://anant.us Introduction ● Section 1: Understanding the Technology ○ Big Picture ○ Accumulo ○ Spark ○ Example Code ● Section 2: Use Cases ○ Multi-Tenant Data Processing ○ Machine Learning / Graph Processing in Spark ○ Example ML + Graph on Business Data ● Questions and Answers ● Contact Information
  • 3. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 4. http://purdygoodengineering.com http://anant.us Section 1: Big Picture ● Accumulo ○ Scalable, sorted, distributed key/value store with cell level security ● Spark ○ General compute engine for large-scale data processing ■ Batch Processing ■ Streaming ■ Machine Learning Library ■ Graph Processing ● Use Spark for Compute and Accumulo for storage for a security distributed scalable solution
  • 5. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 6. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Key Structure (image from accumulo.apache.org)
  • 7. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Key Structure Accumulo Table Design RDBM Table Design
  • 8. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Table Structure ● Each table has many tablets (distributed across nodes) ● Tablet servers are replicated (default is 3) ● Each row resides on the same tablets ○ A Row Id design strategy needs to ensure binning is evenly distributed ○ Each table has “splits” which determine binning ○ If Row Ids are still too large; a sharding strategy is required
  • 9. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Cell Level Security ● Each cell (or field) has its own access control determined by visibility ● Each user has authorizations which correspond to visibilities ● Only fields with visibilities which a user has authorization to access can be retrieved by that user ● Visibilities have limited logic such as AND and OR ○ e.g. private | system public & dna_partner
  • 10. http://purdygoodengineering.com http://anant.us Section 1: Splits ● Each table has a default split ● Splits can be added to tables ● Accumulo auto splits when tablets get to large ● Table splits and tablet max size can is configurable ● Row ids are generally hashed to support distribution ● Example splits based on hashing ○ 0,1,2,3,4,5,6,7,8,9,a,b,c,d,e,f
  • 11. http://purdygoodengineering.com http://anant.us Section 1: Accumulo Reads ● Reads (are scans) ○ Scanner ○ BatchScanner (parallelizes over ranges) ● MapReduce/Spark ○ AccumuloInputFormat (one field at a time) ○ AccumuloRowInputFormat (one row at a time)
  • 12. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Writes ● Writes ○ Writer ○ BatchWriter (parallelizes over tablets) ● MapReduce/Spark ○ AccumuloOutputFormat ○ AccumuloFileOutputFormat (bulk ingest) ● Both use Mutations to write to accumulo
  • 13. http://purdygoodengineering.com http://anant.us Section 1: Accumulo: Mutations (write and delete) ● Mutations are used to write and delete ● Mutation.put (to write) ● Mutation.putDelete (to delete) ● Writes are Upserts (insert or updates)
  • 14. http://purdygoodengineering.com http://anant.us Section 1: Accumulo ● accumulo.apache.org ● Download accumulo ● Examples ● Documentation Concerned about scalling; how about 4T Nodes, 70T edges in a graph => see link http://www.pdl.cmu.edu/SDI/2013/slides/big_graph_nsa_rd_2 013_56002v1.pdf
  • 15. http://purdygoodengineering.com http://anant.us ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology
  • 16. http://purdygoodengineering.com http://anant.us Section 1: Spark: MapReduce first ● Hadoop MapReduce (batch processing) ○ Mapping ○ Reducing ○ Chain jobs ○ 95% IO (each job must read/write to disk) ○ scalable
  • 17. http://purdygoodengineering.com http://anant.us Section 1: Spark ● Batch Processing - MapReduce (many more functions) ● Streaming - mini batch processing ● Machine Learning - MLLib ● Graph Processing - GraphX ● Many Languages - (Java, Scala, Python, R)
  • 18. http://purdygoodengineering.com http://anant.us Section 1: Spark ● spark.apache.org ● Download spark ● Example code ● Documentation
  • 19. http://purdygoodengineering.com http://anant.us ○ Spark ■ Batch/Streaming ■ Machine Learning ■ Graph Processing ○ Example Code ■ Writing to Accumulo ■ Reading from Accumulo ■ Shell Section 1: Understanding the Technology ● Section 1: Understanding the Technology ○ Big Picture ■ Why Accumulo ■ Why Spark ○ Accumulo ■ Key/Value Structure ■ Table Structure ■ Cell Level Security ■ Splits ■ Reads (scans) ■ Writes (upserts) ■ Deletes
  • 20. http://purdygoodengineering.com http://anant.us Section 1: Example Code Simple Examples for bookkeeping with spark and accumulo https://github.com/matthewpurdy/purdy-good/tree/master/purdy-good-spark/purdy-good-spark-accumulo
  • 21. http://purdygoodengineering.com http://anant.us Section 2: Use Case(s) Machine Learning and Graph Processing ● Multi-Tenant Data Processing ● Machine Learning / Graph Processing in Spark ● Example Usecase of ML + Graph on Business Data
  • 22. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) Team Customer Private Customer Data shared w/ Provider Private Provider Data for Economy of Scale Sales Marketing IBM Indicators Relationships Classification Classification Model Relationship Graph Marketing Finance Apple Indicators Correlation Prediction Correlation Model Prediction Model Sales Marketing Finance Microsoft Indicators Relationships Correlation Prediction Correlation Model Prediction Model Relationship Graph Finance Google Indicators Correlation Prediction Correlation Model Prediction Model
  • 23. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs Customer (C) (P) & (C) Provider (P) C User C Team C Management C Management P Analytics P Analytics P Support CU Manager CU Employee CT Sales CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Manager CU Employee CT Marketing CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Research CM Executive CM Executive CU Manager PA * / PS * PA * / PS * CU Employee CT Finance CM Executive CM Executive CU Manager PA * / PS * PA * / PS *
  • 24. http://purdygoodengineering.com http://anant.us Section 2: Multi-Tenant Data Processing Needs ● Analyze Sales Team successes (Closed Accounts) to recommend companies to target for Marketing campaigns. ● Analyze Sales Team User social account against social network users against recommended companies to create Call Lists ● Correlate historic Marketing (Traffic & Conversions) with historic Sales (Leads & Closed Accounts) data with historic Finance ( Revenue & Profit) to Predict Sales from current Marketing & Sales activities
  • 25. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : MLLib in Spark ● Classification ● Regression ● Decision Trees ● Recommendation ● Clustering ● Topic Modeling ● Feature Transformations ● ML Pipelining / Persistence ● “Based on past performance in the companies in the CRM, the most successful sales have come from these categories, so go after these companies.”
  • 26. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : MLLib in Spark ● Load Data ● Extract Features ● Train Model ● Find Best Model ● Use Model to Predict
  • 27. http://purdygoodengineering.com http://anant.us Section 2: KeystoneML - End to End ML http://keystone-ml.org/
  • 28. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● PageRank ● Connected components ● Label propagation ● SVD++ ● Strongly connected components ● Triangle count ● “Based on the social graph of sales team members and the companies in your CRM, talk to the companies you are most “closest” to.
  • 29. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● Load Nodes RDD ● Load Vertices RDD ● Create Graph from Nodes & Vertices RDD ● Run Graph Process / Query ● Get Data http://ampcamp.berkeley.edu/big-d ata-mini-course/graph-analytics-wit h-graphx.html
  • 30. http://purdygoodengineering.com http://anant.us Section 2: Out of the Box : GraphX in Spark ● Load Edges into Graph ● Run Page Rank ● Load Nodes into RDD ● Join Users RDD with Rank
  • 32. http://purdygoodengineering.com http://anant.us Contact Information Matthew Purdy ● matthew.purdy@purdygoodengineering.com ● http://www.purdygoodengineering.com ● https://www.linkedin.com/in/matthewpurdy ● https://github.com/matthewpurdy Rahul Singh ● rahul.singh@anant.us ● http://www.anant.us ● http://www.linkedin.com/in/xingh ● https://github.com/xingh