SlideShare una empresa de Scribd logo
1 de 21
"Full Stack" Data Science with R
Startups: Production-Ready
with Open Source Tools
#rstats #SoCalDS17 #IDEAS17
Oct 22, 2017
Ajay Gopal
1
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Me: (Data) Scientist, Technologist, Entrepreneur
2
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Ajay Gopal, PhD
: ajzz : @aj2z
2017: Chief Data Scientist, SelfScore Inc
#FinTech #ML #Underwriting #Risk #rstats
2016: VP, Data Science & Growth, CARD.com
#FinTech #MktgAutomation #BehavEcon #rstats
2012: Postdoc / Staff Researcher, UCLA
#BioInformatics #GraphTheory #StatMech #Python
2005: PhD, Univ of Chicago
#SurfacePhysics #BioPhysics #StatMech #Matlab
SelfScore: Financial Education & Inclusion
3
SelfScore
Industry
FinTech Alt-Lending Startup, Menlo Park, CA
What we do
Use ML models with alternative financial signals
to help deserving but underserved populations
gain access to fair credit, started with
international students (2 products in market)
Differentiator
Measure borrower’s potential
instead of history (eg without SSN / FICO etc)
Team
~ 30 (4 in Data Science + You?)
Funding
Series B, Founded in 2013
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
... was born on Twitter
For Startups + New Teams
1) Evolving Data Science needs
2) What’s “Full Stack” DS?
3) Why use R (or Python)?
4) Cloud R-based DS Stack
- Sample Infra
- Open Source tools
-------------------------
5) Production Mindset
6) Buy or Build?
This talk
4
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science (VC) Expectations Evolve
Innovation Vertical + Optimization Laterally
5
Data Science
IP, AI,
Innovation,
R&D
Operations
Finance
Compliance
Technology
Product
CX
Demand Gen
Growth
Infra Process Automation Product Optimization Ad / Comms Optim
Considerations:
● Disruptive if
relying on resources
from other verticals
● More ad-hoc work
● R&D timelines not
predictable
● Faster cadence for
analytics
Solution:
● “Full Stack”
Infra & Teams!
● Tools & Training for
others to self-serve
Data Science in Modern (Gen-AI) Startups
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
The “Full Stack” Analogy
6
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Full Stack” Web Services - Technologies
7
Front End
Back End
Data Store
Devops
APIs
UX
Technology
Puppet, Chef, Ansible, AWS EC2,
Docker, ECS/GCE, Heroku
MySQL, PostGres, MongoDB, Redis,
MemCached etc.
PHP, JS, Python, Ruby, ORMs, CI, Git
Restify, Django, Rails, ASP.net, Lambda
HTML/CSS, JS (Node, React), Bootstrap,
iOS, Android, Ionic, Cordova
Email (SendGrid), SMS (Twilio), Push
(SNS, Firebase), Msg Frmwks
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
Goal: Scalable, Engaging, Valuable Web Service
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Technology
rocker, EMIs, ECS, GCE, other cloud
tools
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), Spark etc.
Your internal pkgs, RServer, CI, Git,
Chron, (most R packages), sparkR
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino Data Lab
8
Front End
Back End
Data Store
Devops
APIs
UX
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
R is Sufficient For All Key Stack Functions
1) Retrieve Data
- Ad / Marketing
- Sales
- Transaction
- 3rd Party / Behavioral
2) Process (ETL)
- Fetch, clean up, store
3) Analyze
- Cross-Connectivity
- Aggregation & Features
- Algorithms
4) Predict
- Models in batch
- In-memory modeling
- REST APIs
5) Inform
- Customers (Services & API)
- Partners
Eg: Marketing, fulfillment
- Internal Stakeholders
Eg: Reporting / Dashboards
9
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10
Front End
Back End
Data Store
Devops
APIs
UX
Technology
rocker, EMIs, ECS, GCE, other cloud
tools, Domino Data Lab, Azure
DBI, RMySQL, RPostGreSQL, Redis,
Hadoop, Kinesis (AWR), SparkR etc.
Your internal pkgs, RServer, CI, Git,
H2O, (most R packages), Spark
shiny, HTML, CSV, rook, googlesheets,
HtmlWidgets, shinyapps.io, Dropbox
httr, curl - API interactions for Email,
SMS, Push, Slack, OR via CI tool
Function
Multi-Channel Engagement
Optimal Service Delivery
Platform-agnostic function &
information availability
Business Logic
Identities, Attribs, Relations
Scaleable Services &
Contingencies
“Full Stack” Data Science with R
Generic: rapache, opencpu, plumber
ML: h2o/steam, Domino, Lambda
Goal: Scalable, Timely, Intelligence/Economic Services
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Detractors
- Fewer hard-core devs
- Only handful of dev shops;
no serious bandwidth for hire
- Memory mgmt (still?)
R is great for startups!
Top Drivers for Startups
1. Instant Reactive Web Visualizations
via Shiny (Zero front-end dev)
2. Low barrier for cross-training
3. Fantastic IDE (RStudio)
(single-point access to stack)
4. Large ecosystem of packages
(modeling + viz + utils)
5. Great client libraries
for ML frameworks
6. Statistically Trained Prospects
(Python / Pandas odds good too)
11
So how do we build an R based stack?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Data Science Should Be This Easy
12
A
U
T
O
M
A
T
I
O
N
Data Science IDE
Interactive Dashboards
Predictive Models & APIs
Alerts Notification, Files
So how do we build this in the cloud?
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Assembly of Cloud Container Services
1) Bastion - to connect to external world
(small, low memory, public IP)
2) Scheduler - do things triggered by time & events
(medium, run CI tools, invoke compute slaves)
3) Workers - heavy feature computations
(highmem, multi core, stateless)
4) Storage - DBs, pipelines & message queues
(distributed storage services or internal clusters)
5) Modeler - H2O Cluster, MLLib, Sci-Kit etc
(multi-node cluster, available on demand)
6) Reporter - API Service / Shiny server
(medium, autoscaled containers)
13
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Sample AWS Infra
14
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Choice of Tools
15
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
“Staging” Shiny App
1. Git Commit App to “Dev” branch
2. Jenkins Sync Repo on Commit
3. Sync triggers next Jenkins job
creates Docker container
4. Next job: AWS cli tools deploy
Docker container to ECS
5. “Dev” Shiny app live on staging
6. API call to notify Slack channel
Sample Production Workflows
SEM Cost Forecaster
1. Rscript fetches Adwords
spend & internal sales data
every 5 minutes.
2. Rscript runs existing anomaly
detection & forecast model
3. When check fails, API calls
from R to SMS (eg Twilio) and
Email (eg: SendGrid).
16
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Building Full-Stack Data Science Teams
People
- Data / Backend Engineer
- Data Scientist
- Modeller / Statistician
- Product Manager
- Devops Engineer
Team Output
- EDA / ad-hoc
- Scheduled Reporting
- Batch Predictions
- Stream Processing
- Real-Time Prediction APIs
Our “product” is scalable, actionable intelligence
17
… let’s adopt good software development practices
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
BetteR habits:
1. Write inline and offline tests for your code (testthat, checkmate)
2. Generate informational logs so you can debug later (futile.logger)
3. Add versioning (github)
4. Save business logic as functions in package (selfscoRe)
5. Add examples (Rmd)
6. Write documentation (Rmd)
7. Create a web service (Shiny apps)
8. Put the service in a docker container
The Production Mindset for Data Scientists
18
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Should we buy or build?
VS
Should my company buy the infra? Should my team build it?
19
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Buy vs Build Considerations
BUY / RENT
- If no dev/tech in-house
- If time-to-market is key
requires:
- Custom Integrations
- Higher Cost Tolerance
- Niche engagements
BUILD
- If compliance is major factor
(HIPAA, PCI)
- If cost control is key
- Full Control of Features Reqd
requires:
- In-house talent
- Longer time-to-market?
- Ongoing maintenance
20
#SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore
Thank You!
21
Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean
*ML Models
Hiring Sr “Full Stack” Data Scientist
In Summary
- Data Science is
Vertical + Lateral!
- Colocate data sources
- Containerize services in the cloud
- Use R’s Rich Ecosystem
(or something easy to
cross-train other verticals on)

Más contenido relacionado

La actualidad más candente

Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
Open Analytics
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
Sri Ambati
 

La actualidad más candente (20)

BI + Big Data
BI + Big DataBI + Big Data
BI + Big Data
 
AI as a service
AI as a serviceAI as a service
AI as a service
 
Get Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on TrainingGet Started with Driverless AI Recipes - Hands-on Training
Get Started with Driverless AI Recipes - Hands-on Training
 
Big data-science-oanyc
Big data-science-oanycBig data-science-oanyc
Big data-science-oanyc
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
Drive Away Fraudsters With Driverless AI - Venkatesh Ramanathan, Senior Data ...
 
Vertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflowsVertex AI: Pipelines for your MLOps workflows
Vertex AI: Pipelines for your MLOps workflows
 
Resume
ResumeResume
Resume
 
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
 
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
Regulatory Reporting of Asset Trading Using Apache Spark-(Sudipto Shankar Das...
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Advanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time SpeedAdvanced Analytics for Any Data at Real-Time Speed
Advanced Analytics for Any Data at Real-Time Speed
 
Arindam Sengupta _ Resume
Arindam Sengupta _ ResumeArindam Sengupta _ Resume
Arindam Sengupta _ Resume
 
Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?Should a Graph Database Be in Your Next Data Warehouse Stack?
Should a Graph Database Be in Your Next Data Warehouse Stack?
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 
Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...Data engineering at the interface of art and analytics: the why, what, and ho...
Data engineering at the interface of art and analytics: the why, what, and ho...
 
American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)American Century (Revolution Analytics Customer Day)
American Century (Revolution Analytics Customer Day)
 
H2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray PeckH2O AutoML roadmap - Ray Peck
H2O AutoML roadmap - Ray Peck
 
Functional programming
 for optimization problems 
in Big Data
Functional programming
  for optimization problems 
in Big DataFunctional programming
  for optimization problems 
in Big Data
Functional programming
 for optimization problems 
in Big Data
 
Introducción al Machine Learning Automático
Introducción al Machine Learning AutomáticoIntroducción al Machine Learning Automático
Introducción al Machine Learning Automático
 

Similar a “Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
Andy Lathrop
 
Sean Java Arch
Sean Java ArchSean Java Arch
Sean Java Arch
Sean Bob
 

Similar a “Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools (20)

Architecting an Open Source AI Platform 2018 edition
Architecting an Open Source AI Platform   2018 editionArchitecting an Open Source AI Platform   2018 edition
Architecting an Open Source AI Platform 2018 edition
 
DDDP 2019 - Brown to Green
DDDP 2019  - Brown to GreenDDDP 2019  - Brown to Green
DDDP 2019 - Brown to Green
 
Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)Introdution to Dataops and AIOps (or MLOps)
Introdution to Dataops and AIOps (or MLOps)
 
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
Apache Spark – The New Enterprise Backbone for ETL, Batch Processing and Real...
 
Abhishek jaiswal
Abhishek jaiswalAbhishek jaiswal
Abhishek jaiswal
 
Bhadale group of companies projects portfolio
Bhadale group of companies  projects portfolioBhadale group of companies  projects portfolio
Bhadale group of companies projects portfolio
 
Democratization of Data @Indix
Democratization of Data @IndixDemocratization of Data @Indix
Democratization of Data @Indix
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...Developing and deploying AI solutions on the cloud using Team Data Science Pr...
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
 
Talend introduction v1
Talend introduction v1Talend introduction v1
Talend introduction v1
 
Bhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystemBhadale group of companies our technology ecosystem
Bhadale group of companies our technology ecosystem
 
Ravi Sundriyal
Ravi SundriyalRavi Sundriyal
Ravi Sundriyal
 
RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache SparkRedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
RedisConf17 - Real-time Intelligence with Redis-ML and Apache Spark
 
Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16Bluegranite AA Webinar FINAL 28JUN16
Bluegranite AA Webinar FINAL 28JUN16
 
Developing Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data PlatformsDeveloping Enterprise Consciousness: Building Modern Open Data Platforms
Developing Enterprise Consciousness: Building Modern Open Data Platforms
 
Sean Java Arch
Sean Java ArchSean Java Arch
Sean Java Arch
 
Mohamed-Rashad-Resume
Mohamed-Rashad-ResumeMohamed-Rashad-Resume
Mohamed-Rashad-Resume
 
USQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake EventUSQL Trivadis Azure Data Lake Event
USQL Trivadis Azure Data Lake Event
 

Más de IDEAS - Int'l Data Engineering and Science Association

Más de IDEAS - Int'l Data Engineering and Science Association (20)

How to deliver effective data science projects
How to deliver effective data science projectsHow to deliver effective data science projects
How to deliver effective data science projects
 
Digital cracks in banking--Sid Nandi
Digital cracks in banking--Sid NandiDigital cracks in banking--Sid Nandi
Digital cracks in banking--Sid Nandi
 
Battling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial IntelligenceBattling Skynet: The Role of Humanity in Artificial Intelligence
Battling Skynet: The Role of Humanity in Artificial Intelligence
 
Implementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big DataImplementing Artificial Intelligence with Big Data
Implementing Artificial Intelligence with Big Data
 
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...Data Architecture (i.e., normalization / relational algebra) and Database Sec...
Data Architecture (i.e., normalization / relational algebra) and Database Sec...
 
Blockchain Application in Real Estate Transactions
Blockchain Application in Real Estate TransactionsBlockchain Application in Real Estate Transactions
Blockchain Application in Real Estate Transactions
 
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
Learning to learn Model Behavior: How to use "human-in-the-loop" to explain d...
 
Practical Machine Learning at Work
Practical Machine Learning at WorkPractical Machine Learning at Work
Practical Machine Learning at Work
 
Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.Artificial Intelligence: Hype, Reality, Vision.
Artificial Intelligence: Hype, Reality, Vision.
 
Operationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced AnalyticsOperationalizing your Data Lake: Get Ready for Advanced Analytics
Operationalizing your Data Lake: Get Ready for Advanced Analytics
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Best Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and AcademiaBest Practices in Data Partnerships Between Mayor's Office and Academia
Best Practices in Data Partnerships Between Mayor's Office and Academia
 
Everything You Wish You Knew About Search
Everything You Wish You Knew About SearchEverything You Wish You Knew About Search
Everything You Wish You Knew About Search
 
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
AliMe Bot Platform Technical Practice - Alibaba`s Personal Intelligent Assist...
 
Data-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and HealthcareData-Driven AI for Entertainment and Healthcare
Data-Driven AI for Entertainment and Healthcare
 
Generating Creative Works with AI
Generating Creative Works with AIGenerating Creative Works with AI
Generating Creative Works with AI
 
Using AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care DataUsing AI to Tackle the Future of Health Care Data
Using AI to Tackle the Future of Health Care Data
 
State of AI/ML in Real Estate
State of AI/ML in Real EstateState of AI/ML in Real Estate
State of AI/ML in Real Estate
 
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
Hot Dog, Not Hot Dog! Generate new training data without taking more photos.
 
Machine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life ScienceMachine Learning in Healthcare and Life Science
Machine Learning in Healthcare and Life Science
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

“Full Stack” Data Science with R for Startups: Production-ready with Open-Source Tools

  • 1. "Full Stack" Data Science with R Startups: Production-Ready with Open Source Tools #rstats #SoCalDS17 #IDEAS17 Oct 22, 2017 Ajay Gopal 1
  • 2. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Me: (Data) Scientist, Technologist, Entrepreneur 2 Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab
  • 3. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Ajay Gopal, PhD : ajzz : @aj2z 2017: Chief Data Scientist, SelfScore Inc #FinTech #ML #Underwriting #Risk #rstats 2016: VP, Data Science & Growth, CARD.com #FinTech #MktgAutomation #BehavEcon #rstats 2012: Postdoc / Staff Researcher, UCLA #BioInformatics #GraphTheory #StatMech #Python 2005: PhD, Univ of Chicago #SurfacePhysics #BioPhysics #StatMech #Matlab SelfScore: Financial Education & Inclusion 3 SelfScore Industry FinTech Alt-Lending Startup, Menlo Park, CA What we do Use ML models with alternative financial signals to help deserving but underserved populations gain access to fair credit, started with international students (2 products in market) Differentiator Measure borrower’s potential instead of history (eg without SSN / FICO etc) Team ~ 30 (4 in Data Science + You?) Funding Series B, Founded in 2013
  • 4. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore ... was born on Twitter For Startups + New Teams 1) Evolving Data Science needs 2) What’s “Full Stack” DS? 3) Why use R (or Python)? 4) Cloud R-based DS Stack - Sample Infra - Open Source tools ------------------------- 5) Production Mindset 6) Buy or Build? This talk 4
  • 5. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science (VC) Expectations Evolve Innovation Vertical + Optimization Laterally 5 Data Science IP, AI, Innovation, R&D Operations Finance Compliance Technology Product CX Demand Gen Growth Infra Process Automation Product Optimization Ad / Comms Optim Considerations: ● Disruptive if relying on resources from other verticals ● More ad-hoc work ● R&D timelines not predictable ● Faster cadence for analytics Solution: ● “Full Stack” Infra & Teams! ● Tools & Training for others to self-serve Data Science in Modern (Gen-AI) Startups
  • 6. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore The “Full Stack” Analogy 6 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 7. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Full Stack” Web Services - Technologies 7 Front End Back End Data Store Devops APIs UX Technology Puppet, Chef, Ansible, AWS EC2, Docker, ECS/GCE, Heroku MySQL, PostGres, MongoDB, Redis, MemCached etc. PHP, JS, Python, Ruby, ORMs, CI, Git Restify, Django, Rails, ASP.net, Lambda HTML/CSS, JS (Node, React), Bootstrap, iOS, Android, Ionic, Cordova Email (SendGrid), SMS (Twilio), Push (SNS, Firebase), Msg Frmwks Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies Goal: Scalable, Engaging, Valuable Web Service
  • 8. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Technology rocker, EMIs, ECS, GCE, other cloud tools DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), Spark etc. Your internal pkgs, RServer, CI, Git, Chron, (most R packages), sparkR shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Generic: rapache, opencpu, plumber ML: h2o/steam, Domino Data Lab 8 Front End Back End Data Store Devops APIs UX Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science Goal: Scalable, Timely, Intelligence/Economic Services
  • 9. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore R is Sufficient For All Key Stack Functions 1) Retrieve Data - Ad / Marketing - Sales - Transaction - 3rd Party / Behavioral 2) Process (ETL) - Fetch, clean up, store 3) Analyze - Cross-Connectivity - Aggregation & Features - Algorithms 4) Predict - Models in batch - In-memory modeling - REST APIs 5) Inform - Customers (Services & API) - Partners Eg: Marketing, fulfillment - Internal Stakeholders Eg: Reporting / Dashboards 9
  • 10. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore 10 Front End Back End Data Store Devops APIs UX Technology rocker, EMIs, ECS, GCE, other cloud tools, Domino Data Lab, Azure DBI, RMySQL, RPostGreSQL, Redis, Hadoop, Kinesis (AWR), SparkR etc. Your internal pkgs, RServer, CI, Git, H2O, (most R packages), Spark shiny, HTML, CSV, rook, googlesheets, HtmlWidgets, shinyapps.io, Dropbox httr, curl - API interactions for Email, SMS, Push, Slack, OR via CI tool Function Multi-Channel Engagement Optimal Service Delivery Platform-agnostic function & information availability Business Logic Identities, Attribs, Relations Scaleable Services & Contingencies “Full Stack” Data Science with R Generic: rapache, opencpu, plumber ML: h2o/steam, Domino, Lambda Goal: Scalable, Timely, Intelligence/Economic Services
  • 11. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Detractors - Fewer hard-core devs - Only handful of dev shops; no serious bandwidth for hire - Memory mgmt (still?) R is great for startups! Top Drivers for Startups 1. Instant Reactive Web Visualizations via Shiny (Zero front-end dev) 2. Low barrier for cross-training 3. Fantastic IDE (RStudio) (single-point access to stack) 4. Large ecosystem of packages (modeling + viz + utils) 5. Great client libraries for ML frameworks 6. Statistically Trained Prospects (Python / Pandas odds good too) 11 So how do we build an R based stack?
  • 12. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Data Science Should Be This Easy 12 A U T O M A T I O N Data Science IDE Interactive Dashboards Predictive Models & APIs Alerts Notification, Files So how do we build this in the cloud?
  • 13. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Assembly of Cloud Container Services 1) Bastion - to connect to external world (small, low memory, public IP) 2) Scheduler - do things triggered by time & events (medium, run CI tools, invoke compute slaves) 3) Workers - heavy feature computations (highmem, multi core, stateless) 4) Storage - DBs, pipelines & message queues (distributed storage services or internal clusters) 5) Modeler - H2O Cluster, MLLib, Sci-Kit etc (multi-node cluster, available on demand) 6) Reporter - API Service / Shiny server (medium, autoscaled containers) 13
  • 14. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Sample AWS Infra 14
  • 15. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Choice of Tools 15
  • 16. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore “Staging” Shiny App 1. Git Commit App to “Dev” branch 2. Jenkins Sync Repo on Commit 3. Sync triggers next Jenkins job creates Docker container 4. Next job: AWS cli tools deploy Docker container to ECS 5. “Dev” Shiny app live on staging 6. API call to notify Slack channel Sample Production Workflows SEM Cost Forecaster 1. Rscript fetches Adwords spend & internal sales data every 5 minutes. 2. Rscript runs existing anomaly detection & forecast model 3. When check fails, API calls from R to SMS (eg Twilio) and Email (eg: SendGrid). 16
  • 17. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Building Full-Stack Data Science Teams People - Data / Backend Engineer - Data Scientist - Modeller / Statistician - Product Manager - Devops Engineer Team Output - EDA / ad-hoc - Scheduled Reporting - Batch Predictions - Stream Processing - Real-Time Prediction APIs Our “product” is scalable, actionable intelligence 17 … let’s adopt good software development practices
  • 18. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore BetteR habits: 1. Write inline and offline tests for your code (testthat, checkmate) 2. Generate informational logs so you can debug later (futile.logger) 3. Add versioning (github) 4. Save business logic as functions in package (selfscoRe) 5. Add examples (Rmd) 6. Write documentation (Rmd) 7. Create a web service (Shiny apps) 8. Put the service in a docker container The Production Mindset for Data Scientists 18
  • 19. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Should we buy or build? VS Should my company buy the infra? Should my team build it? 19
  • 20. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Buy vs Build Considerations BUY / RENT - If no dev/tech in-house - If time-to-market is key requires: - Custom Integrations - Higher Cost Tolerance - Niche engagements BUILD - If compliance is major factor (HIPAA, PCI) - If cost control is key - Full Control of Features Reqd requires: - In-house talent - Longer time-to-market? - Ongoing maintenance 20
  • 21. #SoCalDS17 #IDEAS17 | #rstats | @aj2z @SelfScore Thank You! 21 Img Credits: http://daemon.co.za/2014/04/what-does-full-stack-mean *ML Models Hiring Sr “Full Stack” Data Scientist In Summary - Data Science is Vertical + Lateral! - Colocate data sources - Containerize services in the cloud - Use R’s Rich Ecosystem (or something easy to cross-train other verticals on)