- How do we currently think about Data Science?
- Why is infrastructure important to our field?
- Two tools we've built on Sailthru's Data Science team to deal with these problems are "Stolos" and "Relay.Mesos".
4. Talk Outline
Part 1:
● What is Data Science
● Where should we spend our time as data
scientists?
Part 2:
● How we balance infrastructure,
optimization and problem formulation at
Sailthru.
19. Components of a Solid Infrastructure
● Lots of Machinery. VMs, Containers
● Machines require coordination, redundancy
and fault tolerance. CAP Theorem
20. Components of a Solid Infrastructure
● Resource Allocation Fair Scheduling, Bin Packing
● Control strategies Auto Scaling, Feedback, PID
● Communication algorithms Gossip, Paxos, ...
● Configuration Dynamic Persistence, Namespaces
● Monitoring Anomaly Detection, Visualization
● Data Storage Relational, Graph, Key-Value
● SO MANY TOOLS!
21. So What is Data Science?
Problem
Formulation
Infrastructure
Optimization
23. As a Data Scientist, ...
...when do I:
○ build infrastructure that supports my ideas
○ optimize my existing models and
problems
○ find new problems to work on
27. ● Sailthru is a personalization platform.
● We help our clients communicate with their
customers.
● Our goal is to maximize the lifetime value of these
customers so that our clients do well, customers
are happy, and Sailthru is successful.
29. Sightlines - Example Use Cases
Incentivize users with low
chance of purchasing
Personalize discounts
above expected order value
Suppress users likely to opt-
out of messages
Engage users unlikely to
open on other channels
34. What problem does it solve?
A Directed Acyclic Multi-Graph task dependency
scheduler designed to simplify complex, distributed
pipelines.
It creates application queues that can be consumed
from in any order.
38. What problem does it solve?
Relay actively minimizes the difference between a
measured signal and a target signal.
Relay.Mesos plugs Relay into a tool called Mesos.
→ Lets us auto-scale consumers of queued Stolos
jobs
42. The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here
43. The PID Algorithm
PV = Process Variable (Signal)
SP = Set Point (Target)
MV = Manipulated Variable (Output)
t = index on timesteps
**The “D” in PID is excluded here
+ Kd
Δ dt
48. Sightlines - On Mesos
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->
←----------------> CPU Units <------------------>
←--------------------->RAM←--------------------->