This document discusses evolving an analytics stack to match business changes. It recommends defining self-describing event and entity schemas that can be updated over time. Event data modeling aggregates raw events into modeled data like users and sessions for easier analysis. To evolve the data pipeline, businesses should use self-describing data that allows recomputing models on historical data when new questions arise or data collection changes.
2. SNOWPLOW - LONDON MEETUP #4
BUSINESSES ARE CONSTANTLY EVOLVING…
▸ Your products (apps & platforms) change
▸ Your questions should change too
▸ It’s critical that the analytics stack can evolve with your
business
3. SNOWPLOW - LONDON MEETUP #4
SELF-DESCRIBING DATA EVENT DATA MODELING+
EVOLVING EVENT DATA PIPELINE
HOW?
6. SNOWPLOW - LONDON MEETUP #4
DEFINE YOUR OWN EVENTS AND ENTITIES
Events
Entities
‣ Build castle
‣ Form alliance
‣ Declare war
‣ Player
‣ Game
‣ Level
‣ Castle
‣ View product
‣ Buy product
‣ Deliver product
‣ Product
‣ Customer
‣ Basket
‣ Vehicle
7. "description": "Schema for a fighter context",
"vendor": "com.ufc",
"name": “fighter",
"version": “1-0-2“,
"properties": {
"FirstName": {"type": "string"},
"LastName": {"type": "string"},
"Nickname": {"type": "string"},
"FacebookProfile": {"type": "string"},
"WeightLbs": {"type": ["integer", "null"]},
"Record": {"type": “string", "pattern": "^[0-9]+-[0-9]+-[0-9]+$"}
}
}
SNOWPLOW - LONDON MEETUP #4
YOU THEN DEFINE A SCHEMA FOR EACH EVENT AND ENTITY
I DON’T DO EVENTS
THAT AREN’T SCHEMA’ED
8. SNOWPLOW - LONDON MEETUP #4
YOU THEN DEFINE A SCHEMA FOR EACH EVENT AND ENTITY
"schema": "iglu:ufc/fighter/jsonschema/1-0-2",
"data": {
"FirstName": “Daniel”
"LastName": “Cormier”,
"Nickname": “DC”,
"FacebookProfile": “Daniel-Cormier”,
"TwitterName": “dc_mma”,
"WeightLbs": 205
}
}
9. SNOWPLOW - LONDON MEETUP #4
THE SCHEMAS CAN THEN BE USED IN A NUMBER OF WAYS
▸ Validate the data (important for data quality)
▸ Load the data into tidy tables in your data warehouse
▸ Make it easy / safe to write downstream data processing
application (e.g. for real-time users)
11. SNOWPLOW - LONDON MEETUP #4
WHAT IS EVENT DATA MODELING?
▸ Event data modeling is the process of using business logic
to aggregate over event-level data to produce 'modeled'
data that is simpler for querying.
12. SNOWPLOW - LONDON MEETUP #4
MODELED VS UNMODELED DATA
event 1
event n
…
Users
Sessions
…
Funnels
IMMUTABLE.
UNOPINIATED. HARD TO CONSUME. NOT
MUTABLE
AND OPINIONATED. EASY TO CONSUME.
13. SNOWPLOW - LONDON MEETUP #4
IN GENERAL, EVENT DATA MODELING IS PERFORMED ON THE COMPLETE EVENT STREAM
▸ Late arriving events can change the way you understand
earlier arriving events
▸ If we change our data models: this gives us the flexibility
to recompute historical data based on the new model
15. SNOWPLOW - LONDON MEETUP #4
HOW DO WE HANDLE PIPELINE EVOLUTION?
▸ Businesses change over time
▸ The events that occur are going to change
▸ Use of the data will change
▸ Insight -> more questions -> more insight -> more
questions
▸ Two types of evolution: push and pull
BUSINESSES ARE NOT STATIC, SO EVENT PIPELINES SHOULD NOT BE EITHER
16. SNOWPLOW - LONDON MEETUP #4
PUSH EXAMPLE:
▸ If data is self-describing it is easy to add an additional
sources
▸ Self-describing data is good for managing bad data and
pipeline evolution
I’M
AN EMAIL SEND
EVENT AND I HAVE
INFORMATION ABOUT THE
RECIPIENT (EMAIL
17. SNOWPLOW - LONDON MEETUP #4
ANSWERING THE QUESTION:
1. EXISTING DATA MODEL
SUPPORTS ANSWER
2. NEED TO UPDATE DATA
MODEL
3. NEED TO UPDATE DATA
MODEL AND DATA COLLECTION
18. SNOWPLOW - LONDON MEETUP #4
SELF-DESCRIBING DATA AND THE ABILITY TO RECOMPUTE DATA MODELS ARE ESSENTIAL TO ENABLE PIPELINE EVOLUTION
SELF-DESCRIBING DATA RECOMPUTE DATA MODELS ON ENTIRE DATA SET
‣ Updating existing events and entities in a
backward compatible way e.g. add
optional new fields
‣ Update existing events and entities in a
backwards incompatible way e.g. change
field types, remove fields, add
compulsory fields
‣ Add new event and entity types
‣ Add new columns to existing derived
tables e.g. add new audience
segmentation
‣ Change the way existing derived tables
are generated e.g. change
sessionization logic
‣ Create new derived tables