This document outlines a 10 step framework for developing data science applications. It begins with articulating the business problem and data questions. Next steps include developing a data acquisition and preparation strategy, exploring and formatting the data, defining the goal, and shortlisting techniques. Later steps evaluate constraints, establish evaluation criteria, fine tune algorithms, and plan for deployment and monitoring. The document also provides background on the speaker and organization. They offer data science, quant finance, and machine learning programs and consulting using Python, R, and MATLAB on their online sandbox platform.
1. Data Science in 10 steps:
A framework for developing Data science applications
2018 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
sri@quantuniversity.com
www.analyticscertificate.com
2. 2
About us:
• Data Science, Quant Finance and
Machine Learning Startup
• Technologies using MATLAB, Python
and R
• Programs
▫ Analytics Certificate Program
▫ Fintech programs
• Platform
3. • Founder of QuantUniversity LLC. and
www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics
• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy
customers.
• Regular Columnist for the Wilmott Magazine
• Author of forthcoming book
“Financial Modeling: A case study approach”
published by Wiley
• Charted Financial Analyst and Certified Analytics
Professional
• Teaches Analytics in the Babson College MBA
program and at Northeastern University, Boston
Sri Krishnamurthy
Founder and CEO
3
4. 4
Slides to be shared on https://researchhub.qusandbox.com
10. 10
The rise of Big Data and Data Science
Image Source: http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg
11. 11
Smarter Algorithms
Parallel and Distributing Computing Frameworks Deep Learning Frameworks
1. Our labeled datasets were thousands of times too
small.
2. Our computers were millions of times too slow.
3. We initialized the weights in a stupid way.
4. We used the wrong type of non-linearity.
- Geoff Hinton
“Capital One was able to determine fraudulent credit
card applications in 100 milliseconds”*
* http://go.databricks.com/hubfs/pdfs/Databricks-for-FinTech-170306.pdf
13. 13
Typical data science workflows
Data
cleansing
Feature
Engineering
Training and
Testing
Model
building
Model
selection
Model
Deployment
14. 14
The reality of working on data science problems
Data
cleansing
Feature
Engineering
Training and
Testing
Model
building
Model
selection
Model
Deployment
18. 18
2. The Data questions
1. Do you know what data you need ?
2. Do you know if the data is available?
3. Do you have the data ?
4. Do you have the right data?
5. Will you continue to have the data?
Data science in 10 steps
19. 19
3. Develop a data acquisition and data prep strategy
1. Do you know how to get the data ?
2. Who gets the data?
3. How do you process it?
4. How do you access it?
5. How do you version and govern the data?
Data science in 10 steps
20. 20
4. Explore and evaluate your data and get it in the right format
Data science in 10 steps
21. 21
5. Define your goal:
1. Summarization
2. Fact finding
3. Understanding relationships
4. Prediction
Data science in 10 steps
22. 22
6. Shortlist (not “Choose” ) the
techniques/methodologies/algorithms
Data science in 10 steps
23. 23
7. Evaluate/establish business constraints and narrow down your
choices of techniques/methodologies/algorithms
1. Cloud/Cost/Expertise/Cost-Value
2. Build/buy/access
Data science in 10 steps
Outcomes
Time
Quality
Cost
24. 24
6. Establish criteria to know if the methodology/models/algorithms
work
1. Is the process replicable?
2. What performance metrics do we choose?
3. Can you evaluate the performance and validate if the models meet
the criteria?
4. Does it provide business value?
Data science in 10 steps
25. 25
9. Fine tune your algorithms and algorithm selection
1. Hyper parameter tuning
2. Bias-variance tradeoff
3. Handling imbalanced class problems
4. Ensemble techniques
5. AutoML
Data science in 10 steps
https://support.sas.com/resources/papers/proceedings17/SAS0514-2017.pdf
26. 26
10 How will this process reach decision makers
1. Deployment choices (On-prem/Cloud)
2. Frequency of data/model updates
3. Governance/Role/Responsibilities
4. Speed, Scale, Availability, Disaster recovery, Rollback, Pull-Plug
Data science in 10 steps
27. 27
How do you monitor the efficacy of your solution?
1. Retuning
2. Monitoring
3. Model decay
4. Data augmentation
5. Newer innovations
Data science in 10 steps - Bonus
31. 31
The Veracity of Information also affects markets
"The goal of the securities law is to provide the capital markets with accurate
information, and people's motivation are really beside the point,"
- Prof. Jill Fisch, University of Pennsylvania Law School
39. 39
• If computers can understand language, opens huge possibilities
▫ Read and summarize
▫ Translate
▫ Describe what’s happening
▫ Understand commands
▫ Answer questions
▫ Respond in plain language
Language allows understanding
40. 40
• Describe rules of grammar
• Describe meanings of words and their
relationships
• …including all the special cases
• ...and idioms
• ...and special cases for the idioms
• ...
• ...understand language!
Traditional language AI
https://en.wikipedia.org/wiki/Formal_language
41. 41
What is NLP ?
Jumping NLP Curves
https://ieeexplore.ieee.org/document/6786458/
43. 43
• Ambiguity:
▫ “ground”
▫ “jaguar”
▫ “The car hit the pole while it was moving”
▫ “One morning I shot an elephant in my pajamas. How he got into my
pajamas, I’ll never know.”
▫ “The tank is full of soldiers.”
“The tank is full of nitrogen.”
Language is hard to deal with
45. 45
• Many ways to say the same thing
▫ “the same thing can be said in many ways”
▫ “language is versatile”
▫ “The same words can be arranged in many different ways to express
the same idea”
▫ …
Language is hard to deal with
46. 46
• APIs
• Human Insight
• Expert Knowledge
• Build your own
Options?
47. 47
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API
58. 58
NLP pipeline
Data Ingestion
from Edgar
Pre-Processing
Invoking APIs to
label data
Compare APIs
Build a new
model for
sentiment
Analysis
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5
• Amazon Comprehend API
• Google API
• Watson API
• Azure API