SlideShare una empresa de Scribd logo
1 de 115
Descargar para leer sin conexión
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
• Mentors are world-class. CTOs, library authors, inventors,
founders of fast-growing companies, etc
• DSR accepts fewer than 5% of the applications
• Strong focus on commercial awareness
• 5 years of working experience on average
• 30+ partner companies in Europe
DSR participants do a portfolio
project
Why is DSR talking about Scala/Spark?
They are b
IBM is behind this
They hired
What is a good question?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Does he look like a bitch?
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
What is a good question?
• Business case
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
DJ J & MAX RECORDS
Overlap Tweet hours
Tweet frequency per UTC hour
What is a good question?
• Business case
• Data available
24GB
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
What is a good question?
• Business case
• Data available
• Technology to answer the question is available
• We know when the solution worked
Graph theory parts we can
use to solve this problem
Graph theory primer
• Random walk
• Shortest path
• Sampling
Sampling in networks
Sampling in Networks
Note that sampling in Networks is fraught with difficulties. One cannot simply
sample the edges and nodes and expect that the sample be representative of the
original network. In the graph below, a sample that missed node 1 or 2 would
disconnect the two clusters, and would not have the same properties as the
original
Node 11
Node 2
Random surfer
Random surfer
A
B
C
D
Random surfer
A
B
C
D
Random surfer
A
B
C
D
E
Visited more often:
• Nodes with many links
• Coming from frequently visited nodes
Computing Pagerank
 
 
A
B
C
D
E
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Computing Pagerank
 
 
A
B
C
D
E
 
 
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
Teleport
A
B
C
D
E
   
  
 
 
 
 
Teleport
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α)
 
 
 
 
 
 
(1-α)
α
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
 
 
 
 
 
(1-α)
Personalized pagerank
A
B
C
D
E
At regular node: invoke
teleport operation with
probability α and standard
random walk with
probability (1-α). When
teleporting, go to target
node
(1-α)
α
Personalized pagerank
• Special case of Pagerank with priors (distribution of weights
over the nodes)
Implementation
A partitioned, distributed graph processing engine
is significantly more complex and difficult to build
GraphX and graphframes (new in spark
2.0)
• GraphX is to RDD as graphframe is to dataframe
• GraphX is lower level, and the API is scala-only. Graphframe is
very new:
• It’s not designed to be a graph database, as neo4J. Nodes and
edges can contain metadata, but the query engine is not as
complete as cypher
Advantages of graphframes
• Graphframes have a python API
• Graphframes give you simple querying for free.  GraphFrame
vertices and edges are stored as DataFrames, many queries are
just DataFrame (or SQL) queries
• They contain most of the algorithms in graphX, but the API is
less well-tested
• Pyspark shell instead of spark-shell
Distributed PageRank
• Problem: Computing PageRank on graph too large for one
machine
• Algorithm:
– Shard edges randomly,
– compute on each machine
– average results
• Basic idea: Duplicate edges from low-degree nodes. Gives an
unbiased estimator
• Nodes: 41.652.230
• Edges:
1.468.365.182
Summary of implementation, benefits
• Graph theory is a really flexible way to represent a problem
• Data structures to represent graphs are mature
• You can do now out-of-core, distributed graph analysis for
cheap
• Implementations are there for even state-of-the-art methods
Summary, finding a problem
• We live in an age of abundance (methods, data, hardware, ideas)
• Finding the question is more than half of the battle
• I had about a week to prepare this talk, but I managed to put
together something that showcases what you can do with large
graphs today, and it could be effective as a startup idea
• My question is not great because you cannot demonstrate that it
works till you use it (common problem for unsupervised methods)
The question: When should I tweet
to influence the right account?

Or ‘beat Buffer at their own game’
References: Drawing graphs
• Graphs in this slide set have been drawn with Gephi
• If you use Zeppelin notebook, you can draw graphs with:
drawGraph(org.apache.spark.graphx.util.
GraphGenerators.rmatGraph(sc,32,60))


25 videos explaining ML on spark, 50 more
to come. A bunch on graphX
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-
scala-and-spark
About learning new tech over seven
weekends…
About learning new tech over seven
weekends
• You have time and enjoy using it to learn alone: learn it ‘the
hard way’
• You are extremely motivated and talented, have money: Apply
for DSR
• You want your weekends for yourself. You are already very
good but want to switch jobs. Apply for codekitt
Thanks!
Jose Quesada
Director, Data Science Retreat
jose@datascienceretreat.com
@quesada
http://datascienceretreat.com/
codekitt.com

Más contenido relacionado

La actualidad más candente

MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascentjeykottalam
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackTuri, Inc.
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Databricks
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkDatabricks
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinDatabricks
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyDatabricks
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsDatabricks
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OSri Ambati
 

La actualidad más candente (20)

MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
COCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate AscentCOCOA: Communication-Efficient Coordinate Ascent
COCOA: Communication-Efficient Coordinate Ascent
 
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
Using Deep Learning on Apache Spark to Diagnose Thoracic Pathology from Chest...
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
What’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics StackWhat’s New in the Berkeley Data Analytics Stack
What’s New in the Berkeley Data Analytics Stack
 
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
Using Crowdsourced Images to Create Image Recognition Models with Analytics Z...
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache SparkBuild, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
Build, Scale, and Deploy Deep Learning Pipelines Using Apache Spark
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold XinUnifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
Unifying State-of-the-Art AI and Big Data in Apache Spark with Reynold Xin
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph BradleyApache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Apache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new DirectionsApache Spark's MLlib's Past Trajectory and new Directions
Apache Spark's MLlib's Past Trajectory and new Directions
 
Scalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2OScalable Automatic Machine Learning in H2O
Scalable Automatic Machine Learning in H2O
 

Destacado

Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...DataStax
 
Distributed Graph Analytics with Gradoop
Distributed Graph Analytics with GradoopDistributed Graph Analytics with Gradoop
Distributed Graph Analytics with GradoopMartin Junghanns
 
Graph technology meetup slides
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slidesSean Mulvehill
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data ScienceErik Bernhardsson
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache KafkaBen Stopford
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldXavier Amatriain
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXAmir Payberah
 

Destacado (9)

Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
Connecting Cassandra Data with GraphFrames (Jon Haddad, The Last Pickle) | C*...
 
Distributed Graph Analytics with Gradoop
Distributed Graph Analytics with GradoopDistributed Graph Analytics with Gradoop
Distributed Graph Analytics with Gradoop
 
Graph technology meetup slides
Graph technology meetup slidesGraph technology meetup slides
Graph technology meetup slides
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineLeveraging Lucene/Solr as a Knowledge Graph and Intent Engine
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine
 
Luigi presentation NYC Data Science
Luigi presentation NYC Data ScienceLuigi presentation NYC Data Science
Luigi presentation NYC Data Science
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
Staying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning WorldStaying Shallow & Lean in a Deep Learning World
Staying Shallow & Lean in a Deep Learning World
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 

Similar a Distributed processing of large graphs in python

Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...DataScienceConferenc1
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...huguk
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersAlbert Y. C. Chen
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTrent McConaghy
 
Leveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesLeveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesInMobi
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadHari Prasad
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Discover Pinterest
 
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureTechnical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureAllison Pollard
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupDoug Needham
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science ChallengeMark Nichols, P.E.
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analyticsRik Van Bruggen
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningJeff Heaton
 
Hofstra University - Overview of Big Data
Hofstra University - Overview of Big DataHofstra University - Overview of Big Data
Hofstra University - Overview of Big Datasarasioux
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Doug Needham
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
 

Similar a Distributed processing of large graphs in python (20)

Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
[DSC Europe 23] Vladislav Belov - ChatBot Learning Assistant with Large Langu...
 
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
Sean Kandel - Data profiling: Assessing the overall content and quality of a ...
 
Machine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional ManagersMachine Learning Foundations for Professional Managers
Machine Learning Foundations for Professional Managers
 
UNit4.pdf
UNit4.pdfUNit4.pdf
UNit4.pdf
 
ENAR short course
ENAR short courseENAR short course
ENAR short course
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Leveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul GamesLeveraging Analytics In Gaming - Tiny Mogul Games
Leveraging Analytics In Gaming - Tiny Mogul Games
 
Data Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari PrasadData Science-Why?What?How? By Hari Prasad
Data Science-Why?What?How? By Hari Prasad
 
Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"Five Ways To Do Data Analytics "The Wrong Way"
Five Ways To Do Data Analytics "The Wrong Way"
 
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureTechnical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
 
Data Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup GroupData Science Challenge presentation given to the CinBITools Meetup Group
Data Science Challenge presentation given to the CinBITools Meetup Group
 
Cloudera Data Science Challenge
Cloudera Data Science ChallengeCloudera Data Science Challenge
Cloudera Data Science Challenge
 
Intro to graphs for HR analytics
Intro to graphs for HR analyticsIntro to graphs for HR analytics
Intro to graphs for HR analytics
 
rsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morningrsec2a-2016-jheaton-morning
rsec2a-2016-jheaton-morning
 
Hofstra University - Overview of Big Data
Hofstra University - Overview of Big DataHofstra University - Overview of Big Data
Hofstra University - Overview of Big Data
 
Apache Spark GraphX highlights.
Apache Spark GraphX highlights. Apache Spark GraphX highlights.
Apache Spark GraphX highlights.
 
Cloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug NeedhamCloudera Data Science Challenge 3 Solution by Doug Needham
Cloudera Data Science Challenge 3 Solution by Doug Needham
 

Último

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramMoniSankarHazra
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 

Último (20)

Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 

Distributed processing of large graphs in python

  • 1. Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada
  • 2.
  • 3. • Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  • 4.
  • 5. DSR participants do a portfolio project
  • 6.
  • 7. Why is DSR talking about Scala/Spark? They are b IBM is behind this They hired
  • 8.
  • 9. What is a good question?
  • 10. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 11. Does he look like a bitch?
  • 12. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 13. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  • 14. What is a good question? • Business case
  • 15. DJ J & MAX RECORDS
  • 16. DJ J & MAX RECORDS
  • 17. DJ J & MAX RECORDS
  • 18. DJ J & MAX RECORDS
  • 19. DJ J & MAX RECORDS
  • 20.
  • 21.
  • 22. DJ J & MAX RECORDS
  • 23. Overlap Tweet hours Tweet frequency per UTC hour
  • 24. What is a good question? • Business case • Data available
  • 25. 24GB
  • 26. What is a good question? • Business case • Data available • Technology to answer the question is available
  • 27. What is a good question? • Business case • Data available • Technology to answer the question is available • We know when the solution worked
  • 28. Graph theory parts we can use to solve this problem
  • 29. Graph theory primer • Random walk • Shortest path • Sampling
  • 31. Sampling in Networks Note that sampling in Networks is fraught with difficulties. One cannot simply sample the edges and nodes and expect that the sample be representative of the original network. In the graph below, a sample that missed node 1 or 2 would disconnect the two clusters, and would not have the same properties as the original Node 11 Node 2
  • 32.
  • 33.
  • 37. Random surfer A B C D E Visited more often: • Nodes with many links • Coming from frequently visited nodes
  • 48. Teleport A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α)             (1-α) α
  • 49. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node           (1-α)
  • 50. Personalized pagerank A B C D E At regular node: invoke teleport operation with probability α and standard random walk with probability (1-α). When teleporting, go to target node (1-α) α
  • 51. Personalized pagerank • Special case of Pagerank with priors (distribution of weights over the nodes)
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64.
  • 65.
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
  • 80.
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
  • 99.
  • 101. A partitioned, distributed graph processing engine is significantly more complex and difficult to build
  • 102. GraphX and graphframes (new in spark 2.0) • GraphX is to RDD as graphframe is to dataframe • GraphX is lower level, and the API is scala-only. Graphframe is very new: • It’s not designed to be a graph database, as neo4J. Nodes and edges can contain metadata, but the query engine is not as complete as cypher
  • 103. Advantages of graphframes • Graphframes have a python API • Graphframes give you simple querying for free.  GraphFrame vertices and edges are stored as DataFrames, many queries are just DataFrame (or SQL) queries • They contain most of the algorithms in graphX, but the API is less well-tested • Pyspark shell instead of spark-shell
  • 104. Distributed PageRank • Problem: Computing PageRank on graph too large for one machine • Algorithm: – Shard edges randomly, – compute on each machine – average results • Basic idea: Duplicate edges from low-degree nodes. Gives an unbiased estimator
  • 105. • Nodes: 41.652.230 • Edges: 1.468.365.182
  • 106.
  • 107.
  • 108. Summary of implementation, benefits • Graph theory is a really flexible way to represent a problem • Data structures to represent graphs are mature • You can do now out-of-core, distributed graph analysis for cheap • Implementations are there for even state-of-the-art methods
  • 109. Summary, finding a problem • We live in an age of abundance (methods, data, hardware, ideas) • Finding the question is more than half of the battle • I had about a week to prepare this talk, but I managed to put together something that showcases what you can do with large graphs today, and it could be effective as a startup idea • My question is not great because you cannot demonstrate that it works till you use it (common problem for unsupervised methods)
  • 110. The question: When should I tweet to influence the right account?
 Or ‘beat Buffer at their own game’
  • 111. References: Drawing graphs • Graphs in this slide set have been drawn with Gephi • If you use Zeppelin notebook, you can draw graphs with: drawGraph(org.apache.spark.graphx.util. GraphGenerators.rmatGraph(sc,32,60)) 

  • 112. 25 videos explaining ML on spark, 50 more to come. A bunch on graphX • For people who already know ML • http://datascienceretreat.com/videos/data-science-with- scala-and-spark
  • 113. About learning new tech over seven weekends…
  • 114. About learning new tech over seven weekends • You have time and enjoy using it to learn alone: learn it ‘the hard way’ • You are extremely motivated and talented, have money: Apply for DSR • You want your weekends for yourself. You are already very good but want to switch jobs. Apply for codekitt
  • 115. Thanks! Jose Quesada Director, Data Science Retreat jose@datascienceretreat.com @quesada http://datascienceretreat.com/ codekitt.com