3. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
SelfScore: Financial Education & Inclusion
3
SelfScore
Industry
FinTech Alt-Lending Startup, Menlo Park, CA
What we do
Use ML models with alternative financial signals
to help deserving but underserved populations
gain access to fair credit, started with
international students (2 products in market)
Differentiator
Measure borrower’s potential
instead of history (eg without SSN / FICO etc)
Team
~ 30 (4 in Data Science + You?)
Funding
Series B, Founded in 2013
4. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
... was born on Twitter
For Startups + New Teams
1) Evolving Data Science needs
2) What’s “Full Stack” DS?
3) Why use R (or Python)?
4) Cloud R-based DS Stack
- Sample Infra
- Open Source tools
-------------------------
5) Production Mindset
6) Buy or Build?
This talk
4
5. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science (VC) Expectations Evolve
Innovation Vertical + Optimization Laterally
5
Data Science
IP, AI,
Innovation,
R&D
Operations
Finance
Compliance
Technology
Product
CX
Demand Gen
Growth
Infra Process Automation Product Optimization Ad / Comms Optim
Considerations:
● Disruptive if
relying on resources
from other verticals
● More ad-hoc work
● R&D timelines not
predictable
● Faster cadence for
analytics
Solution:
● “Full Stack”
Infra & Teams!
● Tools & Training for
others to self-serve
Data Science in Modern (Gen-AI) Startups
6. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
The “Full Stack” Analogy
6
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
7. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Full Stack” Web Services - Technologies
7
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
8. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Technology
rocker, EMIs, ECS, GCE, other cloud
tools
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), Spark etc.
Your internal pkgs, RServer, CI, Git,
Chron, (most R packages), sparkR
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino Data Lab
8
Front End
Back End
Data Store
Devops
APIs
UX
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science
Goal: Scalable, Timely, Intelligence/Economic Services
9. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
R is Sufficient For All Key Stack Functions
1) Retrieve Data
- Ad / Marketing
- Sales
- Transaction
- 3rd Party / Behavioral
2) Process (ETL)
- Fetch, clean up, store
3) Analyze
- Cross-Connectivity
- Aggregation & Features
- Algorithms
4) Predict
- Models in batch
- In-memory modeling
- REST APIs
5) Inform
- Customers (Services & API)
- Partners
Eg: Marketing, fulfillment
- Internal Stakeholders
Eg: Reporting / Dashboards
9
10. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10
Front End
Back End
Data Store
Devops
APIs
UX
Technology
rocker, EMIs, ECS, GCE, other cloud
tools, Domino Data Lab, Azure
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), SparkR etc.
Your internal pkgs, RServer, CI, Git,
H2O, (most R packages), Spark
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science with R
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino, Lambda
Goal: Scalable, Timely, Intelligence/Economic Services
11. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Detractors
- Fewer hard-core devs
- Only handful of dev shops;
no serious bandwidth for hire
- Memory mgmt (still?)
R is great for startups!
Top Drivers for Startups
1. Instant Reactive Web Visualizations
via Shiny (Zero front-end dev)
2. Low barrier for cross-training
3. Fantastic IDE (RStudio)
(single-point access to stack)
4. Large ecosystem of packages
(modeling + viz + utils)
5. Great client libraries
for ML frameworks
6. Statistically Trained Prospects
(Python / Pandas odds good too)
11
So how do we build an R based stack?
12. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science Should Be This Easy
12
A
U
T
O
M
A
T
I
O
N
Data Science IDE
Interactive Dashboards
Predictive Models & APIs
Alerts Notification, Files
So how do we build this in the cloud?
13. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Assembly of Cloud Container Services
1) Bastion - to connect to external world
(small, low memory, public IP)
2) Scheduler - do things triggered by time & events
(medium, run CI tools, invoke compute slaves)
3) Workers - heavy feature computations
(highmem, multi core, stateless)
4) Storage - DBs, pipelines & message queues
(distributed storage services or internal clusters)
5) Modeler - H2O Cluster, MLLib, Sci-Kit etc
(multi-node cluster, available on demand)
6) Reporter - API Service / Shiny server
(medium, autoscaled containers)
13
16. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Staging” Shiny App
1. Git Commit App to “Dev” branch
2. Jenkins Sync Repo on Commit
3. Sync triggers next Jenkins job
creates Docker container
4. Next job: AWS cli tools deploy
Docker container to ECS
5. “Dev” Shiny app live on staging
6. API call to notify Slack channel
Sample Production Workflows
SEM Cost Forecaster
1. Rscript fetches Adwords
spend & internal sales data
every 5 minutes.
2. Rscript runs existing anomaly
detection & forecast model
3. When check fails, API calls
from R to SMS (eg Twilio) and
Email (eg: SendGrid).
16
17. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Building Full-Stack Data Science Teams
People
- Data / Backend Engineer
- Data Scientist
- Modeller / Statistician
- Product Manager
- Devops Engineer
Team Output
- EDA / ad-hoc
- Scheduled Reporting
- Batch Predictions
- Stream Processing
- Real-Time Prediction APIs
Our “product” is scalable, actionable intelligence
17
… let’s adopt good software development practices
18. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
BetteR habits:
1. Write inline and offline tests for your code (testthat, checkmate)
2. Generate informational logs so you can debug later (futile.logger)
3. Add versioning (github)
4. Save business logic as functions in package (selfscoRe)
5. Add examples (Rmd)
6. Write documentation (Rmd)
7. Create a web service (Shiny apps)
8. Put the service in a docker container
The Production Mindset for Data Scientists
18
19. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Should we buy or build?
VS
Should my company buy the infra? Should my team build it?
19
20. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Buy vs Build Considerations
BUY / RENT
- If no dev/tech in-house
- If time-to-market is key
requires:
- Custom Integrations
- Higher Cost Tolerance
- Niche engagements
BUILD
- If compliance is major factor
(HIPAA, PCI)
- If cost control is key
- Full Control of Features Reqd
requires:
- In-house talent
- Longer time-to-market?
- Ongoing maintenance
20
21. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Thank You!
21
Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean
*ML Models
Hiring Sr “Full Stack” Data Scientist
In Summary
- Data Science is
Vertical + Lateral!
- Colocate data sources
- Containerize services in the cloud
- Use R’s Rich Ecosystem
(or something easy to
cross-train other verticals on)