Data flow in the data center

wealthfront.com

DATA FLOW
IN THE DATA CENTER

Adam Cataldo @djscrooge
November 7, 2013

Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations

wealthfront.com | 2

Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields

wealthfront.com | 3

MapReduce & Hadoop

wealthfront.com | 4

Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing

• Long latency
• Overkill for small data sets

wealthfront.com | 5

Cascading

wealthfront.com | 6

Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended

wealthfront.com | 7

From SQL to Cascading

select name from users join mails on users.email=mails.to

Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);

wealthfront.com | 8

Cascading to Hadoop

mails

mails
mappers
result
join
reducers

users

users
mappers

wealthfront.com | 9

Getting data ready for Cascading

Production
MySQL DB

Avro
Avro
Avrofile
file
files

extract

transform

Production
Amazon Simple
MySQL DB
Storage Service

load

wealthfront.com | 10

Why Avro?

• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop


Running Cascading Jobs
Elastic MapReduce

Production
Amazon Simple
MySQL DB
Storage Service

Online
Systems

Redshift
data
warehouse


What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website


Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster


Does anyone know
where the name bandit
testing comes from?

Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling

What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes


What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-

We ended up writing a small library to make testing
Cascading jobs simpler

• Running multiple Hadoop jobs on large datasets takes a
long time
-

We use Spark for prototyping, to get a speedup

• Your assumptions about the constraints on the data is
always wrong


Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns


How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads


Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.

Data flow in the data center

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (11)

Destacado

Destacado (8)

Similar a Data flow in the data center

Similar a Data flow in the data center (20)

Último

Último (20)

Data flow in the data center