SlideShare una empresa de Scribd logo
1 de 35
Descargar para leer sin conexión
Scala SWAT
Artur Bańkowski
artur@evojam.com
@abankowski
TACKLING
A 1 BILLION MEMBER
SOCIAL NETWORK
This is a story about
our participation in
one of our customer’s
project
1
Previous Experience
~10 millions records
~60 millions documents
~60 million vertices graph (non-commercial)
Datasets I have previously worked on were much smaller!
This time we were about to deal with a much bigger social network, which
can easily be represented as a graph. We were going to join the team…
Graph structure
User
User
User
Foo Inc.
User
User
User
worked
worked
knows
knows
knows
knows
Bar Ltd.
worked
knows
2 types of vertices
2 types of relations
Companies and Users
User
User
User
Foo Inc.
User
User
User
worked
worked
knows
knows
knows
knows
The concept: provide graph subset
Bar Ltd.
worked
Foo Inc.
User
User
User
User
worked
worked
knows
knows
knows
knows
knows
Fulltext searchable,
sortable, filterable
result for Foo Inc.
1 billion challenge
835,272,759
835,272,759 vertices
751,857,081 users
83,415,678 companies
6,956,990,209 relations
Subsets from thousands to millions users
When we finally put
our hands on the data,
we have realized that
datasize is smaller
than expected
What more have we
found?
Existing app workflow
One Engineer-Evening
Multiple custom scripts
JSON files
extract subset
eg: 3m profiles
750m profiles
data duplication
only manually selected
60m incl. duplicates
Manual process,
takes few days from
user perspective
This was already used by final customers
PoC - The Goal
Handle 1 billion profiles automatically
Our primary goal
PoC - Definition of done
• Graph traversable on demand
• First results available under 1 minute
• Entire subset ready in few minutes
REST API with search
PoC - Concept
Graph DB
Document
DB
API
6,956,990,209 relations
751,857,081 profiles
Whole dataset
stored in two
engines
All relations
All profiles
Why graph DB?
•Fast graph traversal
•Easily extendable with new relations and vertices
•Convenient algorithm description
PoC - Flow
Graph DB
Document
DB
API
1. Generate subset
2. Tag documents
3. Perform search with
tag as a filter
Generate subset from
graph with users and
companies relations
Tag documents
with unique id in
database with all
profiles
Search with
unique id as a
primary filter
Weapon of choice
? ?Graph DB
Document
DB
API
Scala and Play were
natural choice for API app
Some research
required for
databases
First Steps: Extraction
Extract anonymized
profiles, companies
and relations
Cleanup data, sort
and generate input
files
few days to pull, streaming
with Akka and Slick
To make a research we had to put our hands on the real data
First Steps: Pushing forward
Push profiles for
searching purposes
Push vertices and
relations for traversal
Document
DB
Graph DB
two tools,
highly dependent on db engines
Fulltext Searchable Document DB
• Mature
• Horizontally scalable
• Fast indexing (~3k documents per second on the single node)
• Well documented
• With scala libraries:
• https://github.com/sksamuel/elastic4s
• https://github.com/evojam/play-elastic4s
We already had significant experience with scaling Elastic Search
for 80 millions
Which graph DB?
Considered three engines,
OrientDB and Neo4j in
community editions
Goodbye Titan
• Performed well in ~60M
tests
• fast traversing
• Development has been put
on hold?
• No batch insertion, slow initial
load
• Stalled writes after 200
millions of vertices with
relations
• Horror stories in the internet
Neo4j FTW
https://github.com/AnormCypher/AnormCypher
• Fast
• Already known
• Convenient in Scala with
AnormCypher
• Offline bulk loading
• Result streaming
Weapon of choice
Graph DB
Document
DB
API
Final stack :)
Final Setup on AWS
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge
4vCPU 30.5GB
i2.2xlarge
8vCPU 61GB
m4.large
2vCPU 8GB
• 2 nodes
• 2 indexes
• 10 shards each index
• 0 replicas
Step #1 - Bulk loading into Neo4j
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
 835273352 nodes
 6956990209 relationships
 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
 835273352 nodes
 6956990209 relationships
 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
 835273352 nodes
 6956990209 relationships
 0 properties
Importing the contents of these files into data/graph.db.
[…]
IMPORT DONE in 3h 24m 58s 140ms. Imported:
 835273352 nodes
 6956990209 relationships
 0 properties
It took 12 hours on 2 times smaller Amazon instance
Step #2 - Bulk loading into ES
grouped insert
1. Create Source from CSV file, frame by n
2. Decode id from ByteString, generate user json
3. Group by the bulk size (eg.: 4500)
4. Throttle
5. Execute bulk insert into the ElasticSearch
throttle
Akka advantage:
CPU utilization,
parallel data
enrichment,
human readable
Step #2 - Bulk loading into ES
FileIO.fromFile(new File(sourceOfUserIds))

.via(
Framing.delimiter(ByteString('n'),
maximumFrameLength = 1024,
allowTruncation = false))

.mapAsyncUnordered(16)(prepareUserJson)

.grouped(4500)

.throttle(

elements = 2,

per = 2 seconds,

maximumBurst = 2,

mode = ThrottleMode.Shaping)

.mapAsyncUnordered(2)(executeBulkInsert)

.runWith(Sink.ignore)
Flow description
with Akka
Streams
User
User
User
Foo Inc.
User
User
User
worked
worked
knows
knows
knows
knows
Step #3 - Tagging
Bar Ltd.
worked
Foo Inc.
User
User
User
User
worked
worked
knows
knows
knows
knows
knows
MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User)

WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c)

RETURN DISTINCT ID(aquitance)
Neo4j traversal query in CYPHER
Akka Streams to the rescue
val idEnum : Enumeratee[CypherRow] = _
val src =
Source.fromPublisher(Streams.enumeratorToPublisher(idEnum))

.map(_.data.head.asInstanceOf[BigDecimal].toInt)

.via(new TimeoutOnEmptyBuffer())

.map(UserId(_))
.mapAsyncUnordered(parallelism = 1)(id =>
dao.tagProfiles(id, companyId))
Readable flow
Buffering to protect Neo4j when indexing is too slow
Timeout due to the bug in the underlying implementation
Bottleneck
1.5 hour for 3 million subset, that’s too long!
Bulk update with AkkaStream tuning
src
.grouped(20000)

.throttle(
elements = 2,
per = 6 second,
maximumBurst = 2,
mode = ThrottleMode.Shaping)
.mapAsyncUnordered(parallelism = 1)(ids =>
dao.bulkTag(ids, ...))
Bulk tagging,
few lines
Tagging Foo-Company
~14 seconds until first batch is tagged
~7 minutes 11 seconds until all tagged
reference implementation - few hours
2,222,840 profiles matching criteria
Sample subset :)
Step #5 - Search
Neo4j
API
hr-1 hr-2
ES Cluster
i2.xlarge
4vCPU 30.5GBi2.2xlarge
8vCPU 61GB
m4.large
2vCPU 8GB
With data ready in ES
search implementation was
pretty straightforward
Step #5 - Search benchmark
Fulltext search on 2 millions subset
GET /users?company=foo-company&phrase=John
2000 phrases for the benchmarking
Response JSON with 50 profiles
Searching in 750 millions database
Phrases based on
real names and
surnames, used
during profiles
enrichment
Random
requests ordering
Step #5 - Search under siege
Response time ~ 0.14s50 concurrent users
constant
latency
constant
search rate
Objective achieved
search response time from API ~0.14s
14 seconds until first batch is tagged
7 minutes until 2 millions ready
Summary
reference implementation PoC
manual automatic
few days 14 seconds
few days 7 minutes
~40 millions ~750 millions
no analytics GraphX ready
Tool ready for data scientists
W can implement core traversal modifications almost instantly
Summary
https://snap.stanford.edu/data/
http://neo4j.com/
https://playframework.com/
http://doc.akka.io/docs/akka-stream-and-http-experimental/2.0/scala/stream-index.html
Try at home! Sample graphs
ready to use
https://www.elastic.co/
Artur Bańkowski
artur@evojam.com
@abankowski

Más contenido relacionado

La actualidad más candente

[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...Konrad Malawski
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeGuido Schmutz
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketKonrad Malawski
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormServerless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormDmitri Zimine
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullJim Dowling
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaJoe Stein
 
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemKonrad Malawski
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesLightbend
 

La actualidad más candente (20)

[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
[Japanese] How Reactive Streams and Akka Streams change the JVM Ecosystem @ R...
 
Kafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtimeKafka and Storm - event processing in realtime
Kafka and Storm - event processing in realtime
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Testing at Stream-Scale
Testing at Stream-ScaleTesting at Stream-Scale
Testing at Stream-Scale
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Spark streaming + kafka 0.10
Spark streaming + kafka 0.10Spark streaming + kafka 0.10
Spark streaming + kafka 0.10
 
End to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to SocketEnd to End Akka Streams / Reactive Streams - from Business to Socket
End to End Akka Streams / Reactive Streams - from Business to Socket
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStormServerless on OpenStack with Docker Swarm, Mistral, and StackStorm
Serverless on OpenStack with Docker Swarm, Mistral, and StackStorm
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Tutorial Kafka-Storm
Tutorial Kafka-StormTutorial Kafka-Storm
Tutorial Kafka-Storm
 
Real-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache KafkaReal-time streaming and data pipelines with Apache Kafka
Real-time streaming and data pipelines with Apache Kafka
 
How Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM EcosystemHow Reactive Streams & Akka Streams change the JVM Ecosystem
How Reactive Streams & Akka Streams change the JVM Ecosystem
 
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka MicroservicesPutting the 'I' in IoT - Building Digital Twins with Akka Microservices
Putting the 'I' in IoT - Building Digital Twins with Akka Microservices
 

Destacado

OrientDB & Lucene
OrientDB & LuceneOrientDB & Lucene
OrientDB & Lucenewolf4ood
 
Aws cost optimization: lessons learned, strategies, tips and tools
Aws cost optimization: lessons learned, strategies, tips and toolsAws cost optimization: lessons learned, strategies, tips and tools
Aws cost optimization: lessons learned, strategies, tips and toolsFelipe
 
Cloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and AlarmsCloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and AlarmsFelipe
 
Cloudwatch: Monitoring your AWS services with Metrics and Alarms
Cloudwatch: Monitoring your AWS services with Metrics and AlarmsCloudwatch: Monitoring your AWS services with Metrics and Alarms
Cloudwatch: Monitoring your AWS services with Metrics and AlarmsFelipe
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionSymeon Papadopoulos
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examplesFelipe
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka StreamsKonrad Malawski
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityCurtis Mosters
 

Destacado (10)

OrientDB & Lucene
OrientDB & LuceneOrientDB & Lucene
OrientDB & Lucene
 
Aws cost optimization: lessons learned, strategies, tips and tools
Aws cost optimization: lessons learned, strategies, tips and toolsAws cost optimization: lessons learned, strategies, tips and tools
Aws cost optimization: lessons learned, strategies, tips and tools
 
Cloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and AlarmsCloudwatch: Monitoring your Services with Metrics and Alarms
Cloudwatch: Monitoring your Services with Metrics and Alarms
 
Cloudwatch: Monitoring your AWS services with Metrics and Alarms
Cloudwatch: Monitoring your AWS services with Metrics and AlarmsCloudwatch: Monitoring your AWS services with Metrics and Alarms
Cloudwatch: Monitoring your AWS services with Metrics and Alarms
 
Benchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detectionBenchmarking graph databases on the problem of community detection
Benchmarking graph databases on the problem of community detection
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Online Machine Learning: introduction and examples
Online Machine Learning:  introduction and examplesOnline Machine Learning:  introduction and examples
Online Machine Learning: introduction and examples
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 
Reactive integrations with Akka Streams
Reactive integrations with Akka StreamsReactive integrations with Akka Streams
Reactive integrations with Akka Streams
 
OrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionalityOrientDB vs Neo4j - Comparison of query/speed/functionality
OrientDB vs Neo4j - Comparison of query/speed/functionality
 

Similar a Tackling a 1 billion member social network

Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterPaolo Castagna
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformEva Tse
 
How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales InstagramC4Media
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify StoryNeville Li
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)Amazon Web Services Korea
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudRick Bilodeau
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudStreamsets Inc.
 

Similar a Tackling a 1 billion member social network (20)

Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Big data apache spark + scala
Big data   apache spark + scalaBig data   apache spark + scala
Big data apache spark + scala
 
Introduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matterIntroduction to apache kafka, confluent and why they matter
Introduction to apache kafka, confluent and why they matter
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaDeep learning and streaming in Apache Spark 2.2 by Matei Zaharia
Deep learning and streaming in Apache Spark 2.2 by Matei Zaharia
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
How a Small Team Scales Instagram
How a Small Team Scales InstagramHow a Small Team Scales Instagram
How a Small Team Scales Instagram
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Scio - Moving to Google Cloud, A Spotify Story
 Scio - Moving to Google Cloud, A Spotify Story Scio - Moving to Google Cloud, A Spotify Story
Scio - Moving to Google Cloud, A Spotify Story
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
 
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco IntercloudCase Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
 

Último

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 

Último (20)

APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 

Tackling a 1 billion member social network

  • 1. Scala SWAT Artur Bańkowski artur@evojam.com @abankowski TACKLING A 1 BILLION MEMBER SOCIAL NETWORK This is a story about our participation in one of our customer’s project 1
  • 2. Previous Experience ~10 millions records ~60 millions documents ~60 million vertices graph (non-commercial) Datasets I have previously worked on were much smaller! This time we were about to deal with a much bigger social network, which can easily be represented as a graph. We were going to join the team…
  • 3. Graph structure User User User Foo Inc. User User User worked worked knows knows knows knows Bar Ltd. worked knows 2 types of vertices 2 types of relations Companies and Users
  • 4. User User User Foo Inc. User User User worked worked knows knows knows knows The concept: provide graph subset Bar Ltd. worked Foo Inc. User User User User worked worked knows knows knows knows knows Fulltext searchable, sortable, filterable result for Foo Inc.
  • 5. 1 billion challenge 835,272,759 835,272,759 vertices 751,857,081 users 83,415,678 companies 6,956,990,209 relations Subsets from thousands to millions users When we finally put our hands on the data, we have realized that datasize is smaller than expected What more have we found?
  • 6. Existing app workflow One Engineer-Evening Multiple custom scripts JSON files extract subset eg: 3m profiles 750m profiles data duplication only manually selected 60m incl. duplicates Manual process, takes few days from user perspective This was already used by final customers
  • 7. PoC - The Goal Handle 1 billion profiles automatically Our primary goal
  • 8. PoC - Definition of done • Graph traversable on demand • First results available under 1 minute • Entire subset ready in few minutes REST API with search
  • 9. PoC - Concept Graph DB Document DB API 6,956,990,209 relations 751,857,081 profiles Whole dataset stored in two engines All relations All profiles
  • 10. Why graph DB? •Fast graph traversal •Easily extendable with new relations and vertices •Convenient algorithm description
  • 11. PoC - Flow Graph DB Document DB API 1. Generate subset 2. Tag documents 3. Perform search with tag as a filter Generate subset from graph with users and companies relations Tag documents with unique id in database with all profiles Search with unique id as a primary filter
  • 12. Weapon of choice ? ?Graph DB Document DB API Scala and Play were natural choice for API app Some research required for databases
  • 13. First Steps: Extraction Extract anonymized profiles, companies and relations Cleanup data, sort and generate input files few days to pull, streaming with Akka and Slick To make a research we had to put our hands on the real data
  • 14. First Steps: Pushing forward Push profiles for searching purposes Push vertices and relations for traversal Document DB Graph DB two tools, highly dependent on db engines
  • 15. Fulltext Searchable Document DB • Mature • Horizontally scalable • Fast indexing (~3k documents per second on the single node) • Well documented • With scala libraries: • https://github.com/sksamuel/elastic4s • https://github.com/evojam/play-elastic4s We already had significant experience with scaling Elastic Search for 80 millions
  • 16. Which graph DB? Considered three engines, OrientDB and Neo4j in community editions
  • 17. Goodbye Titan • Performed well in ~60M tests • fast traversing • Development has been put on hold?
  • 18. • No batch insertion, slow initial load • Stalled writes after 200 millions of vertices with relations • Horror stories in the internet
  • 19. Neo4j FTW https://github.com/AnormCypher/AnormCypher • Fast • Already known • Convenient in Scala with AnormCypher • Offline bulk loading • Result streaming
  • 20. Weapon of choice Graph DB Document DB API Final stack :)
  • 21. Final Setup on AWS Neo4j API hr-1 hr-2 ES Cluster i2.xlarge 4vCPU 30.5GB i2.2xlarge 8vCPU 61GB m4.large 2vCPU 8GB • 2 nodes • 2 indexes • 10 shards each index • 0 replicas
  • 22. Step #1 - Bulk loading into Neo4j Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties Importing the contents of these files into data/graph.db. […] IMPORT DONE in 3h 24m 58s 140ms. Imported:  835273352 nodes  6956990209 relationships  0 properties It took 12 hours on 2 times smaller Amazon instance
  • 23. Step #2 - Bulk loading into ES grouped insert 1. Create Source from CSV file, frame by n 2. Decode id from ByteString, generate user json 3. Group by the bulk size (eg.: 4500) 4. Throttle 5. Execute bulk insert into the ElasticSearch throttle Akka advantage: CPU utilization, parallel data enrichment, human readable
  • 24. Step #2 - Bulk loading into ES FileIO.fromFile(new File(sourceOfUserIds))
 .via( Framing.delimiter(ByteString('n'), maximumFrameLength = 1024, allowTruncation = false))
 .mapAsyncUnordered(16)(prepareUserJson)
 .grouped(4500)
 .throttle(
 elements = 2,
 per = 2 seconds,
 maximumBurst = 2,
 mode = ThrottleMode.Shaping)
 .mapAsyncUnordered(2)(executeBulkInsert)
 .runWith(Sink.ignore) Flow description with Akka Streams
  • 25. User User User Foo Inc. User User User worked worked knows knows knows knows Step #3 - Tagging Bar Ltd. worked Foo Inc. User User User User worked worked knows knows knows knows knows MATCH (c:Company)-[:worked]-(employees:User)-[:knows]-(aquitance:User)
 WHERE ID(c)={foo-inc-id} AND NOT (aquitance)-[:worked]-(c)
 RETURN DISTINCT ID(aquitance) Neo4j traversal query in CYPHER
  • 26. Akka Streams to the rescue val idEnum : Enumeratee[CypherRow] = _ val src = Source.fromPublisher(Streams.enumeratorToPublisher(idEnum))
 .map(_.data.head.asInstanceOf[BigDecimal].toInt)
 .via(new TimeoutOnEmptyBuffer())
 .map(UserId(_)) .mapAsyncUnordered(parallelism = 1)(id => dao.tagProfiles(id, companyId)) Readable flow Buffering to protect Neo4j when indexing is too slow Timeout due to the bug in the underlying implementation
  • 27. Bottleneck 1.5 hour for 3 million subset, that’s too long!
  • 28. Bulk update with AkkaStream tuning src .grouped(20000)
 .throttle( elements = 2, per = 6 second, maximumBurst = 2, mode = ThrottleMode.Shaping) .mapAsyncUnordered(parallelism = 1)(ids => dao.bulkTag(ids, ...)) Bulk tagging, few lines
  • 29. Tagging Foo-Company ~14 seconds until first batch is tagged ~7 minutes 11 seconds until all tagged reference implementation - few hours 2,222,840 profiles matching criteria Sample subset :)
  • 30. Step #5 - Search Neo4j API hr-1 hr-2 ES Cluster i2.xlarge 4vCPU 30.5GBi2.2xlarge 8vCPU 61GB m4.large 2vCPU 8GB With data ready in ES search implementation was pretty straightforward
  • 31. Step #5 - Search benchmark Fulltext search on 2 millions subset GET /users?company=foo-company&phrase=John 2000 phrases for the benchmarking Response JSON with 50 profiles Searching in 750 millions database Phrases based on real names and surnames, used during profiles enrichment Random requests ordering
  • 32. Step #5 - Search under siege Response time ~ 0.14s50 concurrent users constant latency constant search rate
  • 33. Objective achieved search response time from API ~0.14s 14 seconds until first batch is tagged 7 minutes until 2 millions ready
  • 34. Summary reference implementation PoC manual automatic few days 14 seconds few days 7 minutes ~40 millions ~750 millions no analytics GraphX ready Tool ready for data scientists W can implement core traversal modifications almost instantly