SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
wealthfront.com

DATA FLOW
IN THE DATA CENTER

Adam Cataldo @djscrooge
November 7, 2013
Wealthfront & Me
• Wealthfront is the largest and fastest growing softwarebased financial advisor
• We manage the first $10,000 for free the rest for only
0.25% a year
• Our automated trading system continuously rebalances
a portfolio of low-cost ETFs, with continuous tax-loss
harvesting for accounts over $100,000
• I’ve been working on the data platform we use for
website optimization, investment research, business
analytics, and operations

wealthfront.com | 2
Why the Ptolemy conference?
• This is not a talk about modeling, simulation, and
design of concurrent, real-time embedded systems
• This is a talk about the design of a data analytics
system
• It turns out many of the patterns are the same in both
fields

wealthfront.com | 3
MapReduce & Hadoop

wealthfront.com | 4
Hadoop at a Glance
• Scales well for large data sets
• Industry standard for data processing
• Optimized for throughput batch-processing

• Long latency
• Overkill for small data sets

wealthfront.com | 5
Cascading

wealthfront.com | 6
Why Cascading?
• Most real problems require multiple MapReduce jobs
• Provides a data-flow abstraction to specify data
transformations
• Builds on standard database concepts: joins, groups,
and so on
• Provides decent testing capabilities, which we’ve
extended

wealthfront.com | 7
From SQL to Cascading

select name from users join mails on users.email=mails.to

Pipe joined = new CoGroup(users, “email”, mails, “to);
Pipe name = new Retain(joined, “lastName”);

wealthfront.com | 8
Cascading to Hadoop

mails

mails
mappers
result
join
reducers

users

users
mappers

wealthfront.com | 9
Getting data ready for Cascading

Production
MySQL DB

Avro
Avro
Avrofile
file
files

extract

transform

Production
Amazon Simple
MySQL DB
Storage Service

load

wealthfront.com | 10
Why Avro?

• A compact data format, capable of storing large data sets
• We compress with Google
Snappy
• Compressed is splittable
into 128MB chunks
• De-facto file format for
Hadoop

wealthfront.com | 11
Running Cascading Jobs
Elastic MapReduce

Production
Amazon Simple
MySQL DB
Storage Service

Online
Systems

Redshift
data
warehouse

wealthfront.com | 12
What do we do with the data?
• We use it to track how well the investment product is
performing
• We use it to track how well the business is performing
• We use it to monitor our production systems
• We use it to test how well new features perform on the
website

wealthfront.com | 13
Bandit Testing
• When rolling new features out, we expose
the new version to some users and the old
version to the rest
• We monitor what percent of users
“convert”: sign up, fund account, etc.
• We gradually send more traffic to the
winning variant of the experiment
• Similar to A/B testing, but way faster

wealthfront.com | 14
Does anyone know
where the name bandit
testing comes from?
Thompson Sampling
1. Estimate the probability for each variant of the
experiment that it performs best, using Bayesian
inference
2. Weight the percentage of traffic sent to each variant
according to this probability
3. End the experiment when one variant has a 95%
chance of winning, or when the losing arms have no
more than a %5 chance of beating the winner by more
than 1%
4. In 2012, Kaufmann et al proved optimality of
Thompson sampling
wealthfront.com | 16
What’s Redshift?
• Amazon’s cloud-based data
warehouse database
• To support ad-hoc analysis,
we copy all raw and computed
data into redshift
• It’s a column-oriented
database, optimized for
aggregate queries and joins
over large batch sizes

wealthfront.com | 17
What are the technical challenges?
• Testing complicated analytics computations is nontrivial
-

We ended up writing a small library to make testing
Cascading jobs simpler

• Running multiple Hadoop jobs on large datasets takes a
long time
-

We use Spark for prototyping, to get a speedup

• Your assumptions about the constraints on the data is
always wrong

wealthfront.com | 18
Where’s this heading?
• We have a unique collection of
consumer web data and
financial data
• There are many ways we can
combine this data to make our
product better
• Hypothetical example: suggest
portfolio risk adjustments
based on a client’s withdrawal
patterns

wealthfront.com | 19
How is this relevant?
• We use data flow as the
primary model of computation
• While the time scales are much
slower, we have timing
constraints, called SLAs,
imposed by production use
cases
• We have to make sure all code
can safely execute
concurrently on multiple
machines, cores, and threads

wealthfront.com | 20
Disclosure
Nothing in this presentation should be construed as
a solicitation or offer, or recommendation, to buy
or sell any security. Financial advisory services
are only provided to investors who become
Wealthfront clients pursuant to a written agreement,
Tex
which investors are urged to read and carefully
consider in determining t
whether such agreement is
suitable for their individual facts and
circumstances. Past performance is no guarantee of
future results, and any hypothetical returns,
expected returns, or probability projections may not
reflect actual future performance. Investors should
review Wealthfront’s website for additional
information about advisory services.
wealthfront.com | 21
Data flow in the data center

Más contenido relacionado

La actualidad más candente

Paralegal assistant perfomance appraisal 2
Paralegal assistant perfomance appraisal 2Paralegal assistant perfomance appraisal 2
Paralegal assistant perfomance appraisal 2tonychoper0504
 
Major incident classification tool
Major incident classification toolMajor incident classification tool
Major incident classification toolRonald Bartels
 
Database administrator performance appraisal
Database administrator performance appraisalDatabase administrator performance appraisal
Database administrator performance appraisaltaylorshannon964
 
Kpi for finance manager
Kpi for finance managerKpi for finance manager
Kpi for finance managermohablackdavis
 
Observabilidad: Todo lo que hay que ver
Observabilidad: Todo lo que hay que verObservabilidad: Todo lo que hay que ver
Observabilidad: Todo lo que hay que verSoftware Guru
 
Brown-paper process mapping in a workshop
Brown-paper process mapping in a workshop Brown-paper process mapping in a workshop
Brown-paper process mapping in a workshop 1STOUTSOURCE LTD
 
Get a Grip on Your Business
Get a Grip on Your BusinessGet a Grip on Your Business
Get a Grip on Your BusinessTraction Masters
 
SecureSphere ThreatRadar: Improve Security Team Productivity and Focus
SecureSphere ThreatRadar: Improve Security Team Productivity and FocusSecureSphere ThreatRadar: Improve Security Team Productivity and Focus
SecureSphere ThreatRadar: Improve Security Team Productivity and FocusImperva
 
用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務Bo-Yi Wu
 
Billing specialist performance appraisal
Billing specialist performance appraisalBilling specialist performance appraisal
Billing specialist performance appraisalmillielopez95
 
Balderton Meetup: How To Build a Marketing Machine with Dave Kellogg
Balderton Meetup:  How To Build a Marketing Machine with Dave KelloggBalderton Meetup:  How To Build a Marketing Machine with Dave Kellogg
Balderton Meetup: How To Build a Marketing Machine with Dave KelloggDave Kellogg
 

La actualidad más candente (11)

Paralegal assistant perfomance appraisal 2
Paralegal assistant perfomance appraisal 2Paralegal assistant perfomance appraisal 2
Paralegal assistant perfomance appraisal 2
 
Major incident classification tool
Major incident classification toolMajor incident classification tool
Major incident classification tool
 
Database administrator performance appraisal
Database administrator performance appraisalDatabase administrator performance appraisal
Database administrator performance appraisal
 
Kpi for finance manager
Kpi for finance managerKpi for finance manager
Kpi for finance manager
 
Observabilidad: Todo lo que hay que ver
Observabilidad: Todo lo que hay que verObservabilidad: Todo lo que hay que ver
Observabilidad: Todo lo que hay que ver
 
Brown-paper process mapping in a workshop
Brown-paper process mapping in a workshop Brown-paper process mapping in a workshop
Brown-paper process mapping in a workshop
 
Get a Grip on Your Business
Get a Grip on Your BusinessGet a Grip on Your Business
Get a Grip on Your Business
 
SecureSphere ThreatRadar: Improve Security Team Productivity and Focus
SecureSphere ThreatRadar: Improve Security Team Productivity and FocusSecureSphere ThreatRadar: Improve Security Team Productivity and Focus
SecureSphere ThreatRadar: Improve Security Team Productivity and Focus
 
用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務
 
Billing specialist performance appraisal
Billing specialist performance appraisalBilling specialist performance appraisal
Billing specialist performance appraisal
 
Balderton Meetup: How To Build a Marketing Machine with Dave Kellogg
Balderton Meetup:  How To Build a Marketing Machine with Dave KelloggBalderton Meetup:  How To Build a Marketing Machine with Dave Kellogg
Balderton Meetup: How To Build a Marketing Machine with Dave Kellogg
 

Destacado

Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Adam Nash
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopGagan Agrawal
 
Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Cascading
 
Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3Jeong, Wookjae
 
Data center proposal
Data center proposalData center proposal
Data center proposalMuhammad Ahad
 
Data Center Network Topologies
Data Center Network TopologiesData Center Network Topologies
Data Center Network Topologiesrjain51
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To CascadingNate Murray
 
Introduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureIntroduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureAnkita Mahajan
 

Destacado (8)

Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)Be A Great Product Leader (Opower 2014)
Be A Great Product Leader (Opower 2014)
 
Building Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on HadoopBuilding Complex Data Workflows with Cascading on Hadoop
Building Complex Data Workflows with Cascading on Hadoop
 
Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink Overview of Cascading 3.0 on Apache Flink
Overview of Cascading 3.0 on Apache Flink
 
Data center network architectures v1.3
Data center network architectures v1.3Data center network architectures v1.3
Data center network architectures v1.3
 
Data center proposal
Data center proposalData center proposal
Data center proposal
 
Data Center Network Topologies
Data Center Network TopologiesData Center Network Topologies
Data Center Network Topologies
 
Intro To Cascading
Intro To CascadingIntro To Cascading
Intro To Cascading
 
Introduction to Data Center Network Architecture
Introduction to Data Center Network ArchitectureIntroduction to Data Center Network Architecture
Introduction to Data Center Network Architecture
 

Similar a Data flow in the data center

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...Deepak Chandramouli
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Precisely
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...DataWorks Summit
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformRackspace
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...Amazon Web Services
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesSingleStore
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxterewog808
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataHalo BI
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Group
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsInside Analysis
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantageAmazon Web Services
 

Similar a Data flow in the data center (20)

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
PayPal datalake journey | teradata - edge of next | san diego | 2017 october ...
 
Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?Which Change Data Capture Strategy is Right for You?
Which Change Data Capture Strategy is Right for You?
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...Reaching scale limits on a Hadoop platform: issues and errors created by spee...
Reaching scale limits on a Hadoop platform: issues and errors created by spee...
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
presentation slides
presentation slidespresentation slides
presentation slides
 
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data PlatformDeploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
Deploy Apache Spark™ on Rackspace OnMetal™ for Cloud Big Data Platform
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
Getting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming ArchitecturesGetting It Right Exactly Once: Principles for Streaming Architectures
Getting It Right Exactly Once: Principles for Streaming Architectures
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Retail & CPG
Retail & CPGRetail & CPG
Retail & CPG
 
Cloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptxCloud Services helping in cloud service to be fully knowledgably .pptx
Cloud Services helping in cloud service to be fully knowledgably .pptx
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Conflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big DataConflict in the Cloud – Issues & Solutions for Big Data
Conflict in the Cloud – Issues & Solutions for Big Data
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Hadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both WorldsHadoop and the Relational Database: The Best of Both Worlds
Hadoop and the Relational Database: The Best of Both Worlds
 
Using real time big data analytics for competitive advantage
 Using real time big data analytics for competitive advantage Using real time big data analytics for competitive advantage
Using real time big data analytics for competitive advantage
 

Último

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 

Último (20)

Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 

Data flow in the data center

  • 1. wealthfront.com DATA FLOW IN THE DATA CENTER Adam Cataldo @djscrooge November 7, 2013
  • 2. Wealthfront & Me • Wealthfront is the largest and fastest growing softwarebased financial advisor • We manage the first $10,000 for free the rest for only 0.25% a year • Our automated trading system continuously rebalances a portfolio of low-cost ETFs, with continuous tax-loss harvesting for accounts over $100,000 • I’ve been working on the data platform we use for website optimization, investment research, business analytics, and operations wealthfront.com | 2
  • 3. Why the Ptolemy conference? • This is not a talk about modeling, simulation, and design of concurrent, real-time embedded systems • This is a talk about the design of a data analytics system • It turns out many of the patterns are the same in both fields wealthfront.com | 3
  • 5. Hadoop at a Glance • Scales well for large data sets • Industry standard for data processing • Optimized for throughput batch-processing • Long latency • Overkill for small data sets wealthfront.com | 5
  • 7. Why Cascading? • Most real problems require multiple MapReduce jobs • Provides a data-flow abstraction to specify data transformations • Builds on standard database concepts: joins, groups, and so on • Provides decent testing capabilities, which we’ve extended wealthfront.com | 7
  • 8. From SQL to Cascading select name from users join mails on users.email=mails.to Pipe joined = new CoGroup(users, “email”, mails, “to); Pipe name = new Retain(joined, “lastName”); wealthfront.com | 8
  • 10. Getting data ready for Cascading Production MySQL DB Avro Avro Avrofile file files extract transform Production Amazon Simple MySQL DB Storage Service load wealthfront.com | 10
  • 11. Why Avro? • A compact data format, capable of storing large data sets • We compress with Google Snappy • Compressed is splittable into 128MB chunks • De-facto file format for Hadoop wealthfront.com | 11
  • 12. Running Cascading Jobs Elastic MapReduce Production Amazon Simple MySQL DB Storage Service Online Systems Redshift data warehouse wealthfront.com | 12
  • 13. What do we do with the data? • We use it to track how well the investment product is performing • We use it to track how well the business is performing • We use it to monitor our production systems • We use it to test how well new features perform on the website wealthfront.com | 13
  • 14. Bandit Testing • When rolling new features out, we expose the new version to some users and the old version to the rest • We monitor what percent of users “convert”: sign up, fund account, etc. • We gradually send more traffic to the winning variant of the experiment • Similar to A/B testing, but way faster wealthfront.com | 14
  • 15. Does anyone know where the name bandit testing comes from?
  • 16. Thompson Sampling 1. Estimate the probability for each variant of the experiment that it performs best, using Bayesian inference 2. Weight the percentage of traffic sent to each variant according to this probability 3. End the experiment when one variant has a 95% chance of winning, or when the losing arms have no more than a %5 chance of beating the winner by more than 1% 4. In 2012, Kaufmann et al proved optimality of Thompson sampling wealthfront.com | 16
  • 17. What’s Redshift? • Amazon’s cloud-based data warehouse database • To support ad-hoc analysis, we copy all raw and computed data into redshift • It’s a column-oriented database, optimized for aggregate queries and joins over large batch sizes wealthfront.com | 17
  • 18. What are the technical challenges? • Testing complicated analytics computations is nontrivial - We ended up writing a small library to make testing Cascading jobs simpler • Running multiple Hadoop jobs on large datasets takes a long time - We use Spark for prototyping, to get a speedup • Your assumptions about the constraints on the data is always wrong wealthfront.com | 18
  • 19. Where’s this heading? • We have a unique collection of consumer web data and financial data • There are many ways we can combine this data to make our product better • Hypothetical example: suggest portfolio risk adjustments based on a client’s withdrawal patterns wealthfront.com | 19
  • 20. How is this relevant? • We use data flow as the primary model of computation • While the time scales are much slower, we have timing constraints, called SLAs, imposed by production use cases • We have to make sure all code can safely execute concurrently on multiple machines, cores, and threads wealthfront.com | 20
  • 21. Disclosure Nothing in this presentation should be construed as a solicitation or offer, or recommendation, to buy or sell any security. Financial advisory services are only provided to investors who become Wealthfront clients pursuant to a written agreement, Tex which investors are urged to read and carefully consider in determining t whether such agreement is suitable for their individual facts and circumstances. Past performance is no guarantee of future results, and any hypothetical returns, expected returns, or probability projections may not reflect actual future performance. Investors should review Wealthfront’s website for additional information about advisory services. wealthfront.com | 21