Most people think a successful data product requires just three things: data, the
right algorithm, and good execution. But as anyone who’s tried to create one
knows, an effective product requires much more. In this talk, Dr. Correa Bahnsen
will share his successes—and failures—in building data products for information
security, and why an isolated data science team is a recipe for failure.
How I Learned to Stop Worrying and Love Building Data Products
1. Behind the Scenes in
Building Data Products
From Data Science to Data Products
Experiences in Information Security
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
2. Who am I?
Chief Data Scientist at Easy Solutions
Industrial Engineer
PhD in Machine Learning from Luxembourg University
Scikit-Learn contributor
Organizer of Science Bogota Meetups
2
3. AboutEasySolutions®
3
A leading global provider of electronic fraud
prevention for financial institutions and
enterprise customers
430+ customers
In 30 countries
115 million
Users protected
30 billion
Online connections monitored
Industry recognition
4. Aims of this talk
Discuss what makes a data science project successful
4
10. Those are the pillars of data science: computing,
statistics, mathematics, and quantitative
disciplines, combined to analyze data for better
decision making
10
DataScienceIstheIntersectionofHacking
Skills,Math&StatisticsKnowledgeand
SubstantiveExpertise
11. Hacking Skills
Ability to build things and find clever solutions to
problems
• Programming/Coding: Python and R (and others)
• Databases: MySQL, PostgreSQL, Cassandra, MongoDB
and CouchDB.
• Visualization: D3, Tableau, Qlikview and Markdown.
• Big Data: Hadoop, MapReduce and Spark.
11
13. Math & Statistics
Being able understand the right solution to each
problem
• Linear algebra: Matrix manipulation
• Machine Learning: Random Forests, SVM, Boosting
• Descriptive statistics: Describe, Cluster
• Statistical inference: Generate new knowledge
13
14. Substantive Expertise
Ability to ask good questions requires domain
understanding, that’s why a data scientist can’t create
data based solutions without a good industry knowledge
• Is this A or B or C? (classification)
• Is this weird? (anomaly detection)
• How much/how many? (regression)
• How is it organized? (clustering)
• What should I do next? (reinforcement learning)
14
16. Research/ DataScienceSpectrum
16
• Maybe someday, someone can use this
Basic
Research
• I might be able to use this
Applied
Research
• I can use this (sometimes)
Working
Prototype
• Software engineers can use thisQuality Code
• People can use this
Tool or
Service
Innovation
practicality
24. IdealPhishingDetectionSystem - Issues
Issues with full content
analysis:
• Time consuming
• Impractical to process
millions of websites per day
• Hard to implement for small
devices
24
34. 34
RNN contains
a single layer
LSTM contains
four interacting
layers
Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long-ShortTermMemoryNetworksLSTM
38. 38
1. Lets build Swordphish
3. Random
Forest Classifier2. Data Collection
4. API
5. Product Evaluation
6. Recurrent
Neural
Networks
7. Distributed
API
8. Port to C++
9. Sales & Marketing
30 %
50 %
20 %
Total Effort
42. BrandID- Scope
• Create a learning ML engine to label attacks
against any brand
• Not limited to current customers or known layouts
• Apply ML techniques to extract knowledge
• Enhance predictive capabilities
49. 1. GetPhishingSiteInfo
Splash takes 5s to render one URL
BrandID receives 33,000 URLs per day
It would take 4.6 days to process one day of URLs
It’s expected to grow up to 1,000,000
49
51. Transfer Learning and Siamese Networks
Main idea: find a function that maps input patterns into
a target space such that a simple distance in the target
space (say the Euclidean distance) approximates the
“semantic” distance in the input space
.84
2. AnalyzeImages
59. 59
Business Case
Random Forest
Classifier
Data Collection
Product Evaluation
Image
Analysis
Distributed API
Splash JS
10 %
50 %
40 %
Total Effort
NLP
Spark
AKKA
Transfer
Learning
HTML
Analysis
60. At the end of the day,
there is much more to
Data Products than just
Machine Learning.
60
61. Any questions or comments, please let me know.
Alejandro Correa Bahnsen, PhD
Chief Data Scientist & Head of Research
acorrea@easysol.net
Thank you!