Architecting for analytics

Rob Winters
Head of Data Science
Architecting for
Analytics
Data Science at TravelBird

Founded in 2010, TravelBird’s
focus is to bring back the joy of
travel by providing inspiration to
explore and simplicity in
discovering new destinations.
Active in eleven markets across
Europe and inspiring three
million travelers daily via email,
web, and mobile app.
Our Values
Inspiring
Prompting you to visit a place you’d
never thought about before.
Curated & local
Proudly introducing travellers to the
very best their destinations have to
offer, with insider tips and local
insight.
Simple & easy
Taking care of the core elements of
your journey, and there for you
every step of the way.

● Team Lead: Rob
● Data Engineering: Niels
● Data Science: Tedy, Egle, Bastien
● Reporting: Jeff, Enzo
Data
Science @
TB
Team
Composition
Summary Stats
● 30 million events processed/day
● 2,5 million personalized
interactions/day
● 700 discrete dashboards/ad hoc
analyses, 300 FTE supported with
60% daily reporting utilization

What does the
Data Science
team do?

Systems
Engineering
The Data Science team independently
manages >50% of the company
technology stack composed of over a
dozen systems and services to support
all components of data capture,
management, storage, and utilization

Omni-channel personalization and CRM
Every email The web and app Every service interaction

Marketing channel attribution and spend decision support
Order
Affiliate EmailAffiliate Affiliate Email Email Email SEA Organic Email Email
3d 5d 2d 1d 1d 1d 3d 3d 1d 5d
Journey’s net revenue is spread to the
touchpoints backwards, according to:
- channel weight,
- the amount of touchpoints in the
journey and whether it is the
- first or the last click of the journey.

Forecasting
Using a mixture of internal and
external data fed through ARIMA
and neural net models, we
predict the expected travel
demand and how that matches
our negotiated availability
Data Science in Operations
Decision Support
By analyzing offer features and
user interactions, automatically
recommend changes to image,
text, calendar, price, etc
Calendar Analysis
By using CNNs to analyze
calendar features, we can
identify how users interpret
calendars and how that changes
over time, allowing better
negotiation with partners

Other Tasks
● Customer base management and
strategy
● A/B and multivariate testing
● Financial target setting
● Liability risk management
● Organizational coaching/training
● GDPR management of 3rd parties
● Website optimization algorithms
● Partner billing/invoicing
If it has
data, we
deal with it

Driving outcomes in the company
Organizational
Philosophies

Self-service
everywhere
The number one focus in reporting is
self-service, and everyone from the
CEO down is expected to be
comfortable in BI. Reporting is a
standard part of new hire onboarding
and advanced trainings are conducted
on a monthly basis.
The best way
to get value
from data is if
everyone
mines it

Never deliver
more than MVP
With our stakeholders we have agreed
to always deliver MVP and iterate
together, focusing on shipping quickly
and perfecting later. This means that
many times our initial products have
incomplete features, partly invalidated
data, known bugs, etc. Once the core
problems are solved, the incomplete
solution may remain sufficient for
several quarters.
Agile
Always

Close partners
with all teams
Data Science acts as a peer operational
team with Marketing, Sales, etc. This
means that we join the same
operational meetings, have similar
targets, and directly work together to
solve problems. This also means that
every project is jointly run start-to-finish
with one or more people from an ops
team.
Partnership,
not Service

How we do what we do
Team
Philosophies

T-Shaped People
Every person is expected to be
full-stack capable and understand the
general mechanics of everything relating
to their domain. This includes technical
components (ex reporting guys write
ETL and API integrations) as well as
functional (everyone does their own
stakeholder management)
Everyone
can do
everything

Specialists lower
complexity for
others
We place large focus on working to
reduce complexity for others when they
need to interface in different domains.
This means building standardized tools,
conducting trainings, and continuous
side-by-side coaching and pair
programming
Always be
helping

Continuous
Improvement
Learning and coaching are center to our
work. Everyone works on projects each
quarter that are outside their expertise
but in their learning goals, 15% of time
is reserved for learning and “hack time”,
and each person is paired quarterly with
another to coach and be coached.
Always Be
Learning

Translating philosophy to
technology
In Detail:
Infrastructure

Cheap
We don’t want to pay for
anything unless we have to, and
even then we try not to pay. Our
architecture is designed to
minimize costs whenever
possible by using open source,
lots of flexible scaling, and low
cost hosted solutions
Architectural Goals
Auto scaling/recovering
With only one engineer and no
on-call, our systems should be
able to automatically adjust to
demand and handle near
catastrophic failure gracefully
Easily flexible
As every person must be able to
partially manage parts of the
infrastructure, we have to be
able to build tooling and
functionality that allow
non-engineers to build and
destroy servers, scale clusters,
and productionalize jobs without
any support

Our Architecture (Overall)
● Fully AWS hosted
● Mixture of permanent hosts, auto-scaled,
and dynamically launched (ex for ML jobs)
● Production is built in Django + MySQL
● Data Science architecture (interesting stuff
in red) is:
○ Postgres + Vertica for databases
○ Kinesis for event buffering
○ Spark, Keras, Tensorflow for ML
○ Airflow + Rundeck for scheduling
○ Redis for real-time data
○ S3 + HDFS + GFS for storage
And Python for EVERYTHING

Reporting in Detail: Self-Service
● Structural trainings every month,
total of 12 hours of training
material prepared by team
● Two tools
○ Tableau for general reporting
○ Metabase for more technical users
(allows raw SQL)
● >80% of all reporting is end user
created and maintained

Event In Detail: Real Time + Microbatch
Our Inspiration: Lambda architecture

Machine Learning In Detail
● Used for all the big, sexy analytics
○ Regression billions of records
○ Collaborative filtering
■ Average domain has 15k
products and 1,5M training
users
● PySpark instead of Scala allows
recycling of all our custom Python
libraries into ML jobs (rather than
rewriting)
● In modern Spark, performance in Python
and Scala is about the same (when using
Spark functionality)
● Used for all the small, sexy analytics
○ Deep learning on session purchase
propensity
○ Predicting sellout dates using
RNNs
● Keras is easier and cleaner to read than
raw TensorFlow
● Spark deep learning functionality is
underdeveloped at this time
● In deep learning, TF is #1 and Keras #2,
so Keras + TF is … #12? Great
community and development

The BI-brary and Central Config
The bibrary is a Python library
everyone contributes to which
contains standardized functionality to
be reused for any conceivable tasks.
Everything from data management to
Spark and Tensorflow functionality
Tools we’ve built to facilitate data science
The Executor
The executor allows anyone to launch
servers or clusters, execute code
remotely, process data into the
database, etc from models all using a
simple JSON configuration block
Auto-DBA
A large part of performance
management and optimization is
automated including storage
management, likely foreign key
identification, and data security

Working in Python exclusively means that
data science is easy
● This is a simplified
recommender model in 20 lines
of Python
● A data scientist familiar with
Python can be working
productively in Spark in a few
days
● Easy, fast modeling means we
can keep iteration time low,
increasing number of tests

But the production code is equally easy
This Bibrary function interprets a JSON
blob into SQL to determine what content
to be sent in an email
● SQL + Python makes it easy for
data scientists to understand
● Using consistent input/output
structure means that very little
testing is needed when introducing
new models, templates, or products

Translating technology and
philosophy to outcomes
Example:
Attribution
Enhancement

The primary goal of the project was to
improve the effectiveness of our
marketing attribution model, improving
the team’s ability to spend effectively.
To achieve this goal, the secondary
goals were to:
● Identify the largest opportunities for
model improvement
● Build, test, and accept model changes
for two largest opportunities
● Conduct a workshop with Marketing on
how the changes will impact their
channel strategies
The Goal

The team consisted of:
● Enzo: Data Analyst studying Data
Science
● Bastien: Data Scientist
● Noah: Display marketer
● Colin: SEA marketer
Together they reviewed products that
had the lowest performance in
attribution and identified likely model
factors that could be adjusted to
account for the product variances
Project
Start
Team and Kickoff

Together they prioritized two changes:
● Last click: use channel conversion
propensity to re-weight the last session
● Dynamic journey decay: based on a
products average time-to-purchase,
dynamically reweight older sessions
Together they defined deliverables, timelines,
scope of work and jointly divided tasks
including learning goals:
● Enzo is the better engineer and would
supervise Bastien in data pipeline
changes
● Bastien is the more experienced Data
Scientist and would support Enzo in
algorithm development
Analysis and
Planning

Development
The team used standard deployment
scripting to create a sandbox DWH
environment and to build new model
workers each day, allowing them to
easily test and evaluate on 100% of
historical data (>300M rows)
Development and Acceptance
Communication
The team directly communicated
progress with the CMO and
stakeholders, with intermediary
acceptance conducted based on
slack messages. Colin regularly
looked into intermediary output using
SQL and Tableau
Final Acceptance
After shipping the model changes,
acceptance was conducted as a joint
review with marketing team leads.
Start to finish was two weeks from
agreement of project to production

Knowledge Sharing
To conclude, Bastien conducted a
workshop/attribution Q&A with all of
marketing, senior leadership, and
other operational folks to explain
attribution and how markov chains
work

Culture, not
Technology, drives
data-driven
outcomes

Architecting for analytics

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Architecting for analytics

Similar a Architecting for analytics (20)

Más de Rob Winters

Más de Rob Winters (11)

Último

Último (20)

Architecting for analytics