Adam Cataldo discusses how Wealthfront uses data analytics and data flows. Wealthfront is an automated financial advisor that manages portfolios for a low fee. Cataldo works on Wealthfront's data platform, which uses Hadoop and Cascading to process large amounts of data from users, investments, and business operations. This data is used for website optimization, investment research, and monitoring systems. Cascading provides a data flow abstraction to specify transformations across multiple MapReduce jobs. Avro is used to store and transport data efficiently in Hadoop. Results are analyzed in Amazon Redshift for ad-hoc queries.
2. Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations
wealthfront.com | 2
3. Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields
wealthfront.com | 3
5. Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing
• Long latency
• Overkill for small data sets
wealthfront.com | 5
7. Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended
wealthfront.com | 7
8. From SQL to Cascading
select name from users join mails on users.email=mails.to
Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);
wealthfront.com | 8
10. Getting data ready for Cascading
Production
MySQL DB
Avro
Avro
Avrofile
file
files
extract
transform
Production
Amazon Simple
MySQL DB
Storage Service
load
wealthfront.com | 10
11. Why Avro?
• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop
wealthfront.com | 11
12. Running Cascading Jobs
Elastic MapReduce
Production
Amazon Simple
MySQL DB
Storage Service
Online
Systems
Redshift
data
warehouse
wealthfront.com | 12
13. What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website
wealthfront.com | 13
14. Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster
wealthfront.com | 14
16. Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling
wealthfront.com | 16
17. What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes
wealthfront.com | 17
18. What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-
We ended up writing a small library to make testing
Cascading jobs simpler
• Running multiple Hadoop jobs on large datasets takes a
long time
-
We use Spark for prototyping, to get a speedup
• Your assumptions about the constraints on the data is
always wrong
wealthfront.com | 18
19. Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns
wealthfront.com | 19
20. How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads
wealthfront.com | 20
21. Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.
wealthfront.com | 21