3. Introduction to GameAnalytics
We provide user behaviour analytics for video game developers
Similar to services like Google Analytics, Firebase, Facebook Analytics and so on..
In contrast to those services, we are focused on just gaming
We provide SDKs for the most popular game development tools
We also provide a Rest API: https://gameanalytics.com/docs/item/rest-api-doc
The main tool game developers interact with is our web application, where they can
see results in real time and also historical aggregates.
4. Introduction to GameAnalytics
125M+ 25,000+
Daily
Active Games
DAU (Daily
Active Users)
15B+ JSON
All of our Data is
in JSON format
Billion events per day
(at peak days)
How much data do we process?
1.2B+
MAU (Monthly
Active Users)
9. Technical Requirements
What are the high level technical requirements for a service like
GameAnalytics ?
● Low query latency (responsive Frontend)
● Streaming ingestion and real time queries with relatively small delay
● Reliability
● Keep infrastructure cost low
● Provide flexible querying for users
● Most of the queries relate to number of unique users count
10. Backend Overview
We can talk about three main components or services
● Data collection
● Data annotation (enrichment)
● Aggregation and reporting
11. Data Collection
We run a web service with an auto scaling group
It simply writes the raw JSON events to S3 with
some buffering
We have some articles in our blog about this topic
13. Data Annotation System
We run micro-batching
We keep the state in DynamoDB, with cache in Redis
Moving to read and write from / to Kinesis (about to deploy to production)
The service annotates events to make querying in the follow up service
easier
More on this topic later
15. Aggregation and Reporting: Legacy System
Implemented using Erlang
Data storage in-memory (recent data) and DynamoDB (historical)
It supported streaming (micro batching) and real time queries
Query latency was fast
In-house implementation of HyperLogLog algorithm is open source:
https://github.com/GameAnalytics/hyper
16. Aggregation and Reporting: Challenges
We had several problems
● Cost: traffic was increasing and the system was not cost efficient enough
● Reliability: Being a master-slave architecture made it difficult to make it stable
● Difficult to implement new features
● Knowledge of the code base was lost
● It was only possible to filter using one dimension
We needed a replacement that would allow us to spend more time delivering valuable
features for our customers, and at the same time control cost and be able to scale
easily
18. Druid: Schema Design
Schema design is key for optimizing performance and cost
When implementing Druid, we ran several ingestion tests in order to test resulting
rollups. You want to achieve the best rollup possible that can fulfill your query
requirements
We have mostly one big datasource and one streaming ingestion
We use HyperLogLog sketches for most of our queries
We ingest with hourly granularity, and then later on rollup to daily using EMR
Druid documentation is your friend:
https://druid.apache.org/docs/latest/ingestion/schema-design.html
19. Druid: Schema Design. HyperLogLog
From Wikipedia:
HyperLogLog is an algorithm for the count-distinct problem, approximating the
number of distinct elements in a multiset. Calculating the exact cardinality of a multiset
requires an amount of memory proportional to the cardinality, which is impractical for
very large data sets.
Druid provides a HLL based aggregator
We leverage this at GA, since our queries report mostly on a per user basis:
● Active Users (Daily, Weekly, Monthly)
● Average Revenue Per Daily Active User (ARPDAU)
● Retention
● ….
20. Druid: Schema Design. Metrics and Dimensions
We currently have 53 dimensions and 10 metrics
The resulting roll-up ends up with about 10x less number of rows than raw data
21. Druid: Real-time Ingestion
We ingest data in a streaming fashion using Kinesis Ingestion Service (KIS)
You need one Kinesis ingestion per each datasource
We re-ingest our main datasource after 48 hours with daily granularity
Kinesis ingestion does not generate perfect rollup segments, so there is the need of
reingestion or segments compaction for optimal performance
There are different approaches we could take
22. Druid: Batch Ingestion
You can use segments compaction to create better segments from the KIS ones
We must reingest however to change granularity from hour to day
It is also possible to re-ingest from the Kinesis segments instead of ingesting from
raw data again (both using EMR and native ingestion)
We use EMR, but it is also possible to use Druid native ingestion (index parallel)
which can guarantee perfect rollup (in latest Druid versions)
23. Druid: Batch Ingestion Coordination
We use AWS Step Functions and
Lambdas to coordinate EMR ingestion
for Druid clusters
25. Druid: Tiering
Two Tiers
Our oldest data (older than
6 months) is accessed
less often, so we can
manage serving it with
less powerful hardware
26. Druid: Tiering
You should leverage tiering accordingly with your query patterns
In our case example, we use it to lower costs by serving data that is accessed less
often with cheaper hardware
Other use cases could be for example serving more frequently accessed data
with more powerful hardware (in memory, AWS R type instances)
We are considering doing this in the future
27. Druid: Query Layer
We decided to build our own query layer, which allows us to
● Provide higher abstraction for Frontend, and define metrics on backend side
● Implement authentication that works well with our other backend systems
● Fine tune things like caching, query priorities, rate limiting and so on
● Use a programming language that we are comfortable with
We implemented the query layer using the Elixir language
There was no available Druid client for Elixir, so we created our own. It allows you to
build Druid queries using macros and translates those to Druid JSON
It is open source: https://github.com/GameAnalytics/panoramix
28. Druid: Query Layer Caching
Initially, we went live without cache in the query layer
It was a huge mistake, as we assumed caching on Druid brokers would be enough
Lesson learned, to always implement good caching in front of your DB in a use case
like this one
After adding caching, query latency improved dramatically
32. Druid: Performance
We keep data for 1 year in Druid cluster
Around 40k queries per hour to Druid cluster
We process 15 billion events per day, with peaks of over 250k events per
second
33. Druid: Imply Pivot
Imply Pivot is a potential alternative to writing your own query layer and Frontend
We leverage Pivot for internal use within GA
It is also possible to use it as an interface for external customers
35. Druid: Monitoring and Upgrades
We use Graphite and Grafana for application level monitoring on the query layer
To monitor the Druid cluster we use Imply Clarity
And of course, we can also use Cloudwatch for monitoring both
Rolling cluster upgrades are automated by Imply Cloud
37. Druid: What about the annotations?
We talked about the annotation service before
38. Druid: Annotation System
Two sources of data:
SDK (devices) and attribution
partners
SDK provides user behaviour,
while attribution is about where
the user comes from
Users want to filter the data on
both user behaviour and
attribution
We need to join both data
streams
39. Druid: Annotation System
From Druid documentation:
If you need to join two large distributed tables with each other, you must do this before
loading the data into Druid. Druid does not support query-time joins of two datasources.
Our annotation service prepares data for ingestion into Druid
40. Druid: Annotation System
This is a design choice as stated in Druid documentation:
https://druid.apache.org/docs/latest/querying/joins.html
Support for joins is in Druid roadmap, however it is possible to perform simple joins
using lookups right now
Some preparation before ingestion into Druid is generally needed, however Druid
ingestion handles things such as aggregation, rollup, filtering, one time processing
guarantees and so on for you
41. Druid: Lookups
Lookups feature allows simple joins with data stored outside of Druid
In our case, we have the following relation
Game_id -> studio_id -> organization_id
We only ingest game_id into Druid, and using lookups we can query on studio and
organization level, which are stored in a MySQL DB
42. Druid: Calculating Player Retention
We also make use of the annotation service for other things such as player retention
calculation
The most common way to calculate retention in Druid would be using the Thetha
Sketches feature.
Adding the sketches to our datasource increases the size by 30% (and therefore, the
cost)
We annotate events with installation timestamp (truncated) so that we do not need
the sketches to calculate retention
43. Druid: Other considerations
Data Partitioning
It is possible to ingest data partitioned by tenant_id (game_id)
Multiple datasources instead of just one
Ingesting data several times into different datasources removing certain
dimensions might allow to speed query times
44. Druid @ GameAnalytics: What next?
We are building an A/B testing solution, and Druid plays an important role
Maybe implementing a funnels feature using Druid Thetha Sketches?
https://imply.io/post/clickstream-funnel-analysis-with-apache-druid
We are about to enable query vectorization in production