While business analysis rapidly grows more data-driven, the analyst community is slow to adapt the best practices of data science workflows. Many parallels exists between data science “top topics” (e.g. reproducibility) and business pain points, but these common needs are obscured by the different “languages” of these two communities. The opportunity cost is greatest in heavily regulated industries such as finance and insurance where documentation and compliance are paramount.
In this talk, we will review our experience transitioning Capital One business analysts from legacy systems to open-source workflows by developing user-friendly tools. We incentivized business analysts to adopt the data science mindset by curating open-source tools and developing code packages which simplify workflows and eliminate pain points.
Our internal R package, tidycf, reimagines cumbersome Excel cashflow statements as dataframes and uses RMarkdown templates and the RStudio IDE for an intuitive, user-friendly experience without the overhead of maintaining a custom GUI. We tackle challenges in documentation and communication while immersing new users in the R language.
We will share best practices and lessons learned from our experience designing tools for non-technical end-users, standardizing workflows based on the RStudio IDE’s infrastructure, and evangelizing data science methods.
What's in your workflow? Bringing data science workflows to business analysis at Capital One
1. what’s in your workflow?
reproducible business analysis at Capital One
Emily Riederer
Sr. Analyst, Capital One
@EmilyRiederer / emily.riederer@capitalone.com
3. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
The replicability crisis empowered the open science movement and turned reproducibility
and workflows into hot topics
"The Myth of Self-Correcting Science"
The Atlantic
Sarah Estes
December 20, 2012
"How science goes wrong"
The Economist
Cover Story
October 21, 2013
"Psychology's Replication Crisis has a Silver Lining"
The Atlantic
Paul Bloom
Feb 18, 2016
"When the Revolution Came for Amy Cuddy"
New York Times Magazine
Susan Dominus
October 18, 2017
4. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Many organizations emerged to develop and evangelize better practices for scientific
transparency and collaboration
• Offer two-day bootcamps on scientific computing
and reproducible research
• Target researchers in diverse academic disciplines
(e.g. ecology)
• Over 22,000 workshop attendees and 1,000
instructors since 2012
• “Good Enough Practices in Scientific Computing”
5. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Industry embraced advancements in reproducibility for gains to quality and efficiency
• Developed internal R package for tooling and
community building among data scientists
• Promotes efficient work and standardized style
• Share analyses on Knowledge Repo (built on
GitHub)
6. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Analyst
Scientist
Traditional business analysis’ focus on solving a specific case is a barrier to reproducible
thinking
Abstraction
Concrete Business Cases
Reproducibility Taxonomy
(Stodden, et al, 2013)
Open/Reproducible
Auditable
Confirmable
Replicable
Reviewable
7. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
The traditional lens of business analysis is reinforced by the standard tools of the trade
Abstraction
Concrete Business Cases
Spreadsheets expose
data, hide computation
Scripting logs computation,
abstracts data
Analyst
Scientist
8. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
The language business analysts use to articulate their problems obscures the link to
technology-driven solutions
Reproducibility
Version Control
Peer/Code Review
Re-Work
Tribal Knowledge
Team Work
Sanity Checking
9. the tidycf R package at Capital One
turning business analysis on its head by turning cashflows on their side
10. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Decision
Making
Validation
&
Monitoring
Modeling
Scenario
Analysis
Cashflow analysis is integral to many interrelated pieces of business analytics
Documentation
& Governance
11. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Decision
Making
Validation
&
Monitoring
Modeling
Scenario
Analysis
Documentation
& Governance
Database
System
BI Visualization
Tool
Legacy Statistical
Computing Platform
Legacy Statistical
Computing Platform
Legacy Statistical
Computing
Platform
FTP Client
FTP Client
Spreadsheet
Software
Spreadsheet
Software
Word Processor
Word Processor
Spreadsheet
Software
Presentation
Software
• Black box
• Limited capability
• Manual documentation
• Highly manual process
• System-specific
knowledge
• Slow iteration
Patchwork processes lead to inefficiency, poor documentation, and limited reproducibility
12. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Decision
Making
Validation
&
Monitoring
Modeling
Scenario
Analysis
Building an end-to-end R package enabled an efficient and reproducible workflow
• Accessible code
• Extensible code
• Real-time
documentation
• Automated &
reproducible
• General versus system-
specific knowledge
• Rapid iteration
13. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Nuanced business decisions are driven by a remarkably standard analytical “engine”
Business Processes
Workflow
(R for Data Science)
15. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
However, cashflow statements are not optimized for either human or machine readability
Time
Mix of time series and pointwise fields; data and assumptions
Variable information contained in formatting – bold, italics, indentations
Mix of data and calculations, with every cell exposed to potential typos
Fake data is provided for illustrative purposes only and does not represent Capital One performance
Data defined by specific location on sheet so adding a
new line-item can perturb downstream calculations
16. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Instead, we used tidy cashflows, applying the principles of tidy data analysis
• “Happy families are all alike; every unhappy family is unhappy in its own way “ – Leo Tolstoy, Anna Karenina
• “Like families, tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham, “Tidy Data”
Segment Time Tot_Rev Tot_Exp NIBT Tax NIAT Eq_Flow Cashflow
Super 1 4.69 14.20 -9.50 -6.18 -3.33 -5.00 -8.33
Super 2 21.08 3.19 17.90 11.63 6.26 1.00 7.26
Super 3 47.07 9.94 37.13 24.13 13.00 1.00 14.00
Super 4 34.35 1.96 32.39 21.05 11.34 1.00 12.34
Super 5 1.59 8.89 -7.30 -4.75 -2.56 1.00 -1.56
Prime 1 57.61 47.45 10.16 6.60 3.56 -5.00 -1.44
Prime 2 93.78 5.52 88.26 57.37 30.89 1.00 31.89
Prime 3 17.74 54.17 -36.43 -23.68 -12.75 1.00 -11.75
Prime 4 36.98 3.93 33.05 21.48 11.57 1.00 12.57
Prime 5 78.72 55.98 22.74 14.78 7.96 1.00 8.96
Time 1 2 3 4 5
Superprime
Total Revenue $4.69 $21.08 $47.07 $34.35 $1.59
Total Expense $14.20 $3.19 $9.94 $1.96 $8.89
NIBT -$9.50 $17.90 $37.13 $32.39 -$7.30
Tax -$6.18 $11.63 $24.13 $21.05 -$4.75
NIAT -$3.33 $6.26 $13.00 $11.34 -$2.56
Equity Flow -$5.00 $1.00 $1.00 $1.00 $1.00
Cashflow -$8.33 $7.26 $14.00 $12.34 -$1.56
Prime
Total Revenue $57.61 $93.78 $17.74 $36.98 $78.72
Total Expense $47.45 $5.52 $54.17 $3.93 $55.98
NIBT $10.16 $88.26 -$36.43 $33.05 $22.74
Tax $6.60 $57.37 -$23.68 $21.48 $14.78
NIAT $3.56 $30.89 -$12.75 $11.57 $7.96
Equity Flow -$5.00 $1.00 $1.00 $1.00 $1.00
Cashflow -$1.44 $31.89 -$11.75 $12.57 $8.96
Fake data is provided for illustrative purposes only and does not represent Capital One performance
17. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Tidy cashflows streamline the workflow to facilitate advanced analytics like bootstrapping
error bars and indifference curves
Fake data is provided for illustrative purposes only and does not represent Capital One performance
18. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Organically evolving the tidycf package while addressing business problems led to
efficient and empathetic development
Task:
Valuations
Process 1:
Data Validation
Process 2:
Data Exploration
Process n-1:
Model Validation
Process n+1 … z:
Analysis with Model
Framework 1:
Data Validation
Framework 2:
Data Exploration
Framework n-1:
Model Validation
Framework n+1 … z:
Analysis with Model
calc
functions
viz
functions
tbl
functions
… …
Process 3:
Model Building
Framework 3:
Model Building
Process n:
Model Intuition
Framework n:
Model Intuition
Business Problems R Markdown Templates R functions
19. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
RMarkdown templates enable package discoverability, immersion into the broader R
language, and contextual knowledge transfer while generating documentation
Code comments explain syntax and
suggest new functions to try
Text commentary facilitates knowledge
transfer of business context and intuition
20. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Functions manipulate tidy data and provide output consistent with their taxonomy and
compatible with any tidyverse pipeline
calc
functions
viz
functions
tbl
functions
data frame
data frame
graphic
pivoted data
get
functions
DB Conns,
Internet Proxies,
etc.
21. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
These functions are intuitively related to help users quickly generate standard views
Fake data is provided for illustrative purposes only and does not represent Capital One performance
22. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Integration with the tidyverse allows functions to provide both structure and flexibility
Fake data is provided for illustrative purposes only and does not represent Capital One performance
23. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
tidycf’s R Project template standardizes file structure for better project management
RStudio > File > New Project > New Directory
• /analysis/ : core scripts (.Rmd) and final outputs (.HTML)
• /data/ : raw data (or in our case, pulled from SQL)
• /doc/: text files providing context and documentation
• /ext/ : external files needed for project
• /output/ : intermediate/final data formats
• /src/ : other helper scripts (e.g. SQL, python)
24. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
RMarkdown templates document data transformations and demonstrate relative paths to
save intermediate artifacts in the appropriate directory
Data Validation
Exploratory Data Analysis
Model Validation
Model Analytics
(Multiple Modeling Steps)
…
Model Monitoring
raw
out1
out_t
out_t+1
model
model
out1
out2
out_t+1
model
R Markdown Templates Output DataInput Data
./data/
./analysis/
./output/
Directoryexternal
External Source
25. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Industry, academia, and the open-source community served as inspirations for our three-
pronged training approach
Community BuildingBootcampsIndividual Resources
Set-Up Guides, Self-Directed
Learning, and Prerequisites
Three day bootcamp –
conceptual and hands-on
Interaction (forum) and
engagement (contribution)
opportunities
26. Emily Riederer, Capital One (@EmilyRiederer) References on GitHub: emilyriederer/references/WIYW.md
Our resulting package emphasizes reproducibility while immersing business analysts in R
odbc
DBI
dbplyr
rmarkdown
knitr
httr
devtools
plotly
tidyverse
27. what’s in your workflow?
reproducible business analysis at Capital One
Emily Riederer
Sr. Analyst, Capital One
@EmilyRiederer / emily.riederer@capitalone.com