SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Apache Druid @
London Apache Druid Meetup
15th of January 2020
Agenda
➢ Introduction to GameAnalytics
➢ Backend overview
➢ Druid based solution
○ Schema design
○ Ingestion
○ Performance
○ Monitoring
➢ Planned next steps
➢ Summary / Takeaways
➢ Questions
Introduction to GameAnalytics
We provide user behaviour analytics for video game developers
Similar to services like Google Analytics, Firebase, Facebook Analytics and so on..
In contrast to those services, we are focused on just gaming
We provide SDKs for the most popular game development tools
We also provide a Rest API: https://gameanalytics.com/docs/item/rest-api-doc
The main tool game developers interact with is our web application, where they can
see results in real time and also historical aggregates.
Introduction to GameAnalytics
125M+ 25,000+
Daily
Active Games
DAU (Daily
Active Users)
15B+ JSON
All of our Data is
in JSON format
Billion events per day
(at peak days)
How much data do we process?
1.2B+
MAU (Monthly
Active Users)
25,000 Daily Active Games
Analytics for 90,000 Game Developers
Interactive Filtering
Interactive Filtering
Technical Requirements
What are the high level technical requirements for a service like
GameAnalytics ?
● Low query latency (responsive Frontend)
● Streaming ingestion and real time queries with relatively small delay
● Reliability
● Keep infrastructure cost low
● Provide flexible querying for users
● Most of the queries relate to number of unique users count
Backend Overview
We can talk about three main components or services
● Data collection
● Data annotation (enrichment)
● Aggregation and reporting
Data Collection
We run a web service with an auto scaling group
It simply writes the raw JSON events to S3 with
some buffering
We have some articles in our blog about this topic
Data Annotation System
s3 s3
DynamoDB
Data Annotation System
We run micro-batching
We keep the state in DynamoDB, with cache in Redis
Moving to read and write from / to Kinesis (about to deploy to production)
The service annotates events to make querying in the follow up service
easier
More on this topic later
Aggregation and Reporting: Legacy System
Legacy system
Aggregation and Reporting: Legacy System
Implemented using Erlang
Data storage in-memory (recent data) and DynamoDB (historical)
It supported streaming (micro batching) and real time queries
Query latency was fast
In-house implementation of HyperLogLog algorithm is open source:
https://github.com/GameAnalytics/hyper
Aggregation and Reporting: Challenges
We had several problems
● Cost: traffic was increasing and the system was not cost efficient enough
● Reliability: Being a master-slave architecture made it difficult to make it stable
● Difficult to implement new features
● Knowledge of the code base was lost
● It was only possible to filter using one dimension
We needed a replacement that would allow us to spend more time delivering valuable
features for our customers, and at the same time control cost and be able to scale
easily
Aggregation and Reporting: Druid
s3
Druid: Schema Design
Schema design is key for optimizing performance and cost
When implementing Druid, we ran several ingestion tests in order to test resulting
rollups. You want to achieve the best rollup possible that can fulfill your query
requirements
We have mostly one big datasource and one streaming ingestion
We use HyperLogLog sketches for most of our queries
We ingest with hourly granularity, and then later on rollup to daily using EMR
Druid documentation is your friend:
https://druid.apache.org/docs/latest/ingestion/schema-design.html
Druid: Schema Design. HyperLogLog
From Wikipedia:
HyperLogLog is an algorithm for the count-distinct problem, approximating the
number of distinct elements in a multiset. Calculating the exact cardinality of a multiset
requires an amount of memory proportional to the cardinality, which is impractical for
very large data sets.
Druid provides a HLL based aggregator
We leverage this at GA, since our queries report mostly on a per user basis:
● Active Users (Daily, Weekly, Monthly)
● Average Revenue Per Daily Active User (ARPDAU)
● Retention
● ….
Druid: Schema Design. Metrics and Dimensions
We currently have 53 dimensions and 10 metrics
The resulting roll-up ends up with about 10x less number of rows than raw data
Druid: Real-time Ingestion
We ingest data in a streaming fashion using Kinesis Ingestion Service (KIS)
You need one Kinesis ingestion per each datasource
We re-ingest our main datasource after 48 hours with daily granularity
Kinesis ingestion does not generate perfect rollup segments, so there is the need of
reingestion or segments compaction for optimal performance
There are different approaches we could take
Druid: Batch Ingestion
You can use segments compaction to create better segments from the KIS ones
We must reingest however to change granularity from hour to day
It is also possible to re-ingest from the Kinesis segments instead of ingesting from
raw data again (both using EMR and native ingestion)
We use EMR, but it is also possible to use Druid native ingestion (index parallel)
which can guarantee perfect rollup (in latest Druid versions)
Druid: Batch Ingestion Coordination
We use AWS Step Functions and
Lambdas to coordinate EMR ingestion
for Druid clusters
Druid: Cluster Topology
Druid: Tiering
Two Tiers
Our oldest data (older than
6 months) is accessed
less often, so we can
manage serving it with
less powerful hardware
Druid: Tiering
You should leverage tiering accordingly with your query patterns
In our case example, we use it to lower costs by serving data that is accessed less
often with cheaper hardware
Other use cases could be for example serving more frequently accessed data
with more powerful hardware (in memory, AWS R type instances)
We are considering doing this in the future
Druid: Query Layer
We decided to build our own query layer, which allows us to
● Provide higher abstraction for Frontend, and define metrics on backend side
● Implement authentication that works well with our other backend systems
● Fine tune things like caching, query priorities, rate limiting and so on
● Use a programming language that we are comfortable with
We implemented the query layer using the Elixir language
There was no available Druid client for Elixir, so we created our own. It allows you to
build Druid queries using macros and translates those to Druid JSON
It is open source: https://github.com/GameAnalytics/panoramix
Druid: Query Layer Caching
Initially, we went live without cache in the query layer
It was a huge mistake, as we assumed caching on Druid brokers would be enough
Lesson learned, to always implement good caching in front of your DB in a use case
like this one
After adding caching, query latency improved dramatically
Druid: Query Layer Caching
Druid: Performance on Druid brokers
Druid: Performance on Query Layer
Druid: Performance
We keep data for 1 year in Druid cluster
Around 40k queries per hour to Druid cluster
We process 15 billion events per day, with peaks of over 250k events per
second
Druid: Imply Pivot
Imply Pivot is a potential alternative to writing your own query layer and Frontend
We leverage Pivot for internal use within GA
It is also possible to use it as an interface for external customers
Druid: Pivot
Druid: Monitoring and Upgrades
We use Graphite and Grafana for application level monitoring on the query layer
To monitor the Druid cluster we use Imply Clarity
And of course, we can also use Cloudwatch for monitoring both
Rolling cluster upgrades are automated by Imply Cloud
Druid: Monitoring with Clarity
Druid: What about the annotations?
We talked about the annotation service before
Druid: Annotation System
Two sources of data:
SDK (devices) and attribution
partners
SDK provides user behaviour,
while attribution is about where
the user comes from
Users want to filter the data on
both user behaviour and
attribution
We need to join both data
streams
Druid: Annotation System
From Druid documentation:
If you need to join two large distributed tables with each other, you must do this before
loading the data into Druid. Druid does not support query-time joins of two datasources.
Our annotation service prepares data for ingestion into Druid
Druid: Annotation System
This is a design choice as stated in Druid documentation:
https://druid.apache.org/docs/latest/querying/joins.html
Support for joins is in Druid roadmap, however it is possible to perform simple joins
using lookups right now
Some preparation before ingestion into Druid is generally needed, however Druid
ingestion handles things such as aggregation, rollup, filtering, one time processing
guarantees and so on for you
Druid: Lookups
Lookups feature allows simple joins with data stored outside of Druid
In our case, we have the following relation
Game_id -> studio_id -> organization_id
We only ingest game_id into Druid, and using lookups we can query on studio and
organization level, which are stored in a MySQL DB
Druid: Calculating Player Retention
We also make use of the annotation service for other things such as player retention
calculation
The most common way to calculate retention in Druid would be using the Thetha
Sketches feature.
Adding the sketches to our datasource increases the size by 30% (and therefore, the
cost)
We annotate events with installation timestamp (truncated) so that we do not need
the sketches to calculate retention
Druid: Other considerations
Data Partitioning
It is possible to ingest data partitioned by tenant_id (game_id)
Multiple datasources instead of just one
Ingesting data several times into different datasources removing certain
dimensions might allow to speed query times
Druid @ GameAnalytics: What next?
We are building an A/B testing solution, and Druid plays an important role
Maybe implementing a funnels feature using Druid Thetha Sketches?
https://imply.io/post/clickstream-funnel-analysis-with-apache-druid
We are about to enable query vectorization in production
Resources
GameAnalytics technical blog
https://gameanalytics.com/blog/category/game-development/engineering
GameAnalytics case study (Imply blog)
https://imply.io/post/why-gameanalytics-migrated-to-druid
Druid Joins design proposal
https://github.com/apache/druid/issues/8728

Más contenido relacionado

La actualidad más candente

Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016DataStax
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax Academy
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceHBaseCon
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa HBaseCon
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingMongoDB
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexesDaniel Lemire
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseAmazon Web Services
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBTimescale
 
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016Amazon Web Services Korea
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012Amazon Web Services
 

La actualidad más candente (20)

Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
게임을 위한 DynamoDB 사례 및 팁 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
Lambda Architecture with Cassandra (Vaibhav Puranik, GumGum) | C* Summit 2016
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgentHBaseCon 2015: HBase @ CyberAgent
HBaseCon 2015: HBase @ CyberAgent
 
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffDatabases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade Off
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa Rolling Out Apache HBase for Mobile Offerings at Visa
Rolling Out Apache HBase for Mobile Offerings at Visa
 
Lightning Talk: MongoDB Sharding
Lightning Talk: MongoDB ShardingLightning Talk: MongoDB Sharding
Lightning Talk: MongoDB Sharding
 
Engineering fast indexes
Engineering fast indexesEngineering fast indexes
Engineering fast indexes
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Leveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data WarehouseLeveraging Amazon Redshift for Your Data Warehouse
Leveraging Amazon Redshift for Your Data Warehouse
 
druid.io
druid.iodruid.io
druid.io
 
Processing and Analytics
Processing and AnalyticsProcessing and Analytics
Processing and Analytics
 
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleZeotap: Moving to ScyllaDB - A Graph of Billions Scale
Zeotap: Moving to ScyllaDB - A Graph of Billions Scale
 
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDBHow to Reduce Your Database Total Cost of Ownership with TimescaleDB
How to Reduce Your Database Total Cost of Ownership with TimescaleDB
 
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016
개발자가 알아야 할 Amazon DynamoDB 활용법 :: 김일호 :: AWS Summit Seoul 2016
 
Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...Amazon aws big data demystified | Introduction to streaming and messaging flu...
Amazon aws big data demystified | Introduction to streaming and messaging flu...
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012
DAT102 Introduction to Amazon DynamoDB - AWS re: Invent 2012
 

Similar a Game Analytics at London Apache Druid Meetup

Building a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidBuilding a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidImply
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Guglielmo Iozzia
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Deepu K Sasidharan
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterJulien Dubois
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Gary Arora
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Corneil du Plessis
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionDaniel Zivkovic
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like ProductsVMware Tanzu
 
Windows Azure Platform + PHP - Jonathan Wong
Windows Azure Platform + PHP - Jonathan WongWindows Azure Platform + PHP - Jonathan Wong
Windows Azure Platform + PHP - Jonathan WongSpiffy
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorialrustd
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...PerformanceVision (previously SecurActive)
 

Similar a Game Analytics at London Apache Druid Meetup (20)

Building a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache DruidBuilding a Real-Time Gaming Analytics Service with Apache Druid
Building a Real-Time Gaming Analytics Service with Apache Druid
 
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
Building a data pipeline to ingest data into Hadoop in minutes using Streamse...
 
Big Trends in Big Data
Big Trends in Big DataBig Trends in Big Data
Big Trends in Big Data
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsEnabling Next Gen Analytics with Azure Data Lake and StreamSets
Enabling Next Gen Analytics with Azure Data Lake and StreamSets
 
Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017Easy Microservices with JHipster - Devoxx BE 2017
Easy Microservices with JHipster - Devoxx BE 2017
 
Devoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipsterDevoxx Belgium 2017 - easy microservices with JHipster
Devoxx Belgium 2017 - easy microservices with JHipster
 
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
Leapfrog into Serverless - a Deloitte-Amtrak Case Study | Serverless Confere...
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Running Data Platforms Like Products
Running Data Platforms Like ProductsRunning Data Platforms Like Products
Running Data Platforms Like Products
 
Windows Azure Platform + PHP - Jonathan Wong
Windows Azure Platform + PHP - Jonathan WongWindows Azure Platform + PHP - Jonathan Wong
Windows Azure Platform + PHP - Jonathan Wong
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Big Data on Azure Tutorial
Big Data on Azure TutorialBig Data on Azure Tutorial
Big Data on Azure Tutorial
 
How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...How to create custom dashboards in Elastic Search / Kibana with Performance V...
How to create custom dashboards in Elastic Search / Kibana with Performance V...
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 

Último

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 

Último (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 

Game Analytics at London Apache Druid Meetup

  • 1. Apache Druid @ London Apache Druid Meetup 15th of January 2020
  • 2. Agenda ➢ Introduction to GameAnalytics ➢ Backend overview ➢ Druid based solution ○ Schema design ○ Ingestion ○ Performance ○ Monitoring ➢ Planned next steps ➢ Summary / Takeaways ➢ Questions
  • 3. Introduction to GameAnalytics We provide user behaviour analytics for video game developers Similar to services like Google Analytics, Firebase, Facebook Analytics and so on.. In contrast to those services, we are focused on just gaming We provide SDKs for the most popular game development tools We also provide a Rest API: https://gameanalytics.com/docs/item/rest-api-doc The main tool game developers interact with is our web application, where they can see results in real time and also historical aggregates.
  • 4. Introduction to GameAnalytics 125M+ 25,000+ Daily Active Games DAU (Daily Active Users) 15B+ JSON All of our Data is in JSON format Billion events per day (at peak days) How much data do we process? 1.2B+ MAU (Monthly Active Users)
  • 6. Analytics for 90,000 Game Developers
  • 9. Technical Requirements What are the high level technical requirements for a service like GameAnalytics ? ● Low query latency (responsive Frontend) ● Streaming ingestion and real time queries with relatively small delay ● Reliability ● Keep infrastructure cost low ● Provide flexible querying for users ● Most of the queries relate to number of unique users count
  • 10. Backend Overview We can talk about three main components or services ● Data collection ● Data annotation (enrichment) ● Aggregation and reporting
  • 11. Data Collection We run a web service with an auto scaling group It simply writes the raw JSON events to S3 with some buffering We have some articles in our blog about this topic
  • 13. Data Annotation System We run micro-batching We keep the state in DynamoDB, with cache in Redis Moving to read and write from / to Kinesis (about to deploy to production) The service annotates events to make querying in the follow up service easier More on this topic later
  • 14. Aggregation and Reporting: Legacy System Legacy system
  • 15. Aggregation and Reporting: Legacy System Implemented using Erlang Data storage in-memory (recent data) and DynamoDB (historical) It supported streaming (micro batching) and real time queries Query latency was fast In-house implementation of HyperLogLog algorithm is open source: https://github.com/GameAnalytics/hyper
  • 16. Aggregation and Reporting: Challenges We had several problems ● Cost: traffic was increasing and the system was not cost efficient enough ● Reliability: Being a master-slave architecture made it difficult to make it stable ● Difficult to implement new features ● Knowledge of the code base was lost ● It was only possible to filter using one dimension We needed a replacement that would allow us to spend more time delivering valuable features for our customers, and at the same time control cost and be able to scale easily
  • 18. Druid: Schema Design Schema design is key for optimizing performance and cost When implementing Druid, we ran several ingestion tests in order to test resulting rollups. You want to achieve the best rollup possible that can fulfill your query requirements We have mostly one big datasource and one streaming ingestion We use HyperLogLog sketches for most of our queries We ingest with hourly granularity, and then later on rollup to daily using EMR Druid documentation is your friend: https://druid.apache.org/docs/latest/ingestion/schema-design.html
  • 19. Druid: Schema Design. HyperLogLog From Wikipedia: HyperLogLog is an algorithm for the count-distinct problem, approximating the number of distinct elements in a multiset. Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Druid provides a HLL based aggregator We leverage this at GA, since our queries report mostly on a per user basis: ● Active Users (Daily, Weekly, Monthly) ● Average Revenue Per Daily Active User (ARPDAU) ● Retention ● ….
  • 20. Druid: Schema Design. Metrics and Dimensions We currently have 53 dimensions and 10 metrics The resulting roll-up ends up with about 10x less number of rows than raw data
  • 21. Druid: Real-time Ingestion We ingest data in a streaming fashion using Kinesis Ingestion Service (KIS) You need one Kinesis ingestion per each datasource We re-ingest our main datasource after 48 hours with daily granularity Kinesis ingestion does not generate perfect rollup segments, so there is the need of reingestion or segments compaction for optimal performance There are different approaches we could take
  • 22. Druid: Batch Ingestion You can use segments compaction to create better segments from the KIS ones We must reingest however to change granularity from hour to day It is also possible to re-ingest from the Kinesis segments instead of ingesting from raw data again (both using EMR and native ingestion) We use EMR, but it is also possible to use Druid native ingestion (index parallel) which can guarantee perfect rollup (in latest Druid versions)
  • 23. Druid: Batch Ingestion Coordination We use AWS Step Functions and Lambdas to coordinate EMR ingestion for Druid clusters
  • 25. Druid: Tiering Two Tiers Our oldest data (older than 6 months) is accessed less often, so we can manage serving it with less powerful hardware
  • 26. Druid: Tiering You should leverage tiering accordingly with your query patterns In our case example, we use it to lower costs by serving data that is accessed less often with cheaper hardware Other use cases could be for example serving more frequently accessed data with more powerful hardware (in memory, AWS R type instances) We are considering doing this in the future
  • 27. Druid: Query Layer We decided to build our own query layer, which allows us to ● Provide higher abstraction for Frontend, and define metrics on backend side ● Implement authentication that works well with our other backend systems ● Fine tune things like caching, query priorities, rate limiting and so on ● Use a programming language that we are comfortable with We implemented the query layer using the Elixir language There was no available Druid client for Elixir, so we created our own. It allows you to build Druid queries using macros and translates those to Druid JSON It is open source: https://github.com/GameAnalytics/panoramix
  • 28. Druid: Query Layer Caching Initially, we went live without cache in the query layer It was a huge mistake, as we assumed caching on Druid brokers would be enough Lesson learned, to always implement good caching in front of your DB in a use case like this one After adding caching, query latency improved dramatically
  • 30. Druid: Performance on Druid brokers
  • 31. Druid: Performance on Query Layer
  • 32. Druid: Performance We keep data for 1 year in Druid cluster Around 40k queries per hour to Druid cluster We process 15 billion events per day, with peaks of over 250k events per second
  • 33. Druid: Imply Pivot Imply Pivot is a potential alternative to writing your own query layer and Frontend We leverage Pivot for internal use within GA It is also possible to use it as an interface for external customers
  • 35. Druid: Monitoring and Upgrades We use Graphite and Grafana for application level monitoring on the query layer To monitor the Druid cluster we use Imply Clarity And of course, we can also use Cloudwatch for monitoring both Rolling cluster upgrades are automated by Imply Cloud
  • 37. Druid: What about the annotations? We talked about the annotation service before
  • 38. Druid: Annotation System Two sources of data: SDK (devices) and attribution partners SDK provides user behaviour, while attribution is about where the user comes from Users want to filter the data on both user behaviour and attribution We need to join both data streams
  • 39. Druid: Annotation System From Druid documentation: If you need to join two large distributed tables with each other, you must do this before loading the data into Druid. Druid does not support query-time joins of two datasources. Our annotation service prepares data for ingestion into Druid
  • 40. Druid: Annotation System This is a design choice as stated in Druid documentation: https://druid.apache.org/docs/latest/querying/joins.html Support for joins is in Druid roadmap, however it is possible to perform simple joins using lookups right now Some preparation before ingestion into Druid is generally needed, however Druid ingestion handles things such as aggregation, rollup, filtering, one time processing guarantees and so on for you
  • 41. Druid: Lookups Lookups feature allows simple joins with data stored outside of Druid In our case, we have the following relation Game_id -> studio_id -> organization_id We only ingest game_id into Druid, and using lookups we can query on studio and organization level, which are stored in a MySQL DB
  • 42. Druid: Calculating Player Retention We also make use of the annotation service for other things such as player retention calculation The most common way to calculate retention in Druid would be using the Thetha Sketches feature. Adding the sketches to our datasource increases the size by 30% (and therefore, the cost) We annotate events with installation timestamp (truncated) so that we do not need the sketches to calculate retention
  • 43. Druid: Other considerations Data Partitioning It is possible to ingest data partitioned by tenant_id (game_id) Multiple datasources instead of just one Ingesting data several times into different datasources removing certain dimensions might allow to speed query times
  • 44. Druid @ GameAnalytics: What next? We are building an A/B testing solution, and Druid plays an important role Maybe implementing a funnels feature using Druid Thetha Sketches? https://imply.io/post/clickstream-funnel-analysis-with-apache-druid We are about to enable query vectorization in production
  • 45. Resources GameAnalytics technical blog https://gameanalytics.com/blog/category/game-development/engineering GameAnalytics case study (Imply blog) https://imply.io/post/why-gameanalytics-migrated-to-druid Druid Joins design proposal https://github.com/apache/druid/issues/8728