SlideShare una empresa de Scribd logo
1 de 54
Descargar para leer sin conexión
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
A SOBER LOOK AT MACHINE
LEARNING
DR. SVEN KRASSER CHIEF SCIENTIST
@SVENKRASSER
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Distinguishing Science…
Source: CERN, http://home.cern/sites/home.web.cern.ch/files/image/experiment/2013/01/cms_0.jpeg
…from FictionSource: “Chain Reaction,” 20th Century Fox
MACHINE LEARNING 101
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EXAMPLES OF MACHINE LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SPAM
FILTERING
MOVIE
RECOMMENDATIONS
SIRI
(iPHONE)
TODAY’S FOCUS: SUPERVISED LEARNING
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TODAY’S FOCUS: GEOMETRIC MODELS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EVERYTHING YOU WILL SEE TODAY
IS REAL WORLD DATA
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Some Data to Get Started:
1988 ANTHROPOMETRIC
SURVEY OF ARMY PERSONNEL
Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• Over 4000 soldiers surveyed
• Over 100 measurements
• Reported by gender
Test subjects are in better shape
than the rest of us...
Data
Selection Bias
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FIRST LOOK
Height [mm]
Density
• Difference in distribution
• Significant overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SECOND DIMENSION
Height [mm]
Weight[10-1
kg]
• Correlation
• Overlap
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FEATURE SELECTION
“Buttock Circumference” [mm]
Weight[10-1
kg]
• Correlation
• Gender-specific slope
• Reduced overlap
• Selection of features
matters
• How to make a
prediction?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
K-NEAREST NEIGHBOR
“Buttock Circumference” [mm]
Weight[10-1
kg]
m
f
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SUPPORT VECTOR MACHINE
“Buttock Circumference” [mm]
Weight[10-1
kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
SUPPORT VECTOR MACHINE
2016 CrowdStrike, Inc. All rights reserved.
“Buttock Circumference” [mm]
Weight[10-1
kg]
• Overfitting
• Classifier does not
generalize
• Let’s take a
closer look…
CROSS
VALIDATION
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TRAIN TRAIN TRAIN TEST
TRAIN TRAIN TEST TRAIN
TRAIN TEST TRAIN TRAIN
TEST TRAIN TRAIN TRAIN
• Divide data into k folds
• Train on k-1 folds, test
on the remaining one
• Repeat k times for
all folds
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight[10-1
kg]
• Classifier generalizes
• Note some
misclassifications
• Let’s assume we want
to detect males (blue)
§ I.e. “blue” is our
positive class
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight[10-1
kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight[10-1
kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight[10-1
kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight[10-1
kg]
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
LET’S CLASSIFY
“Buttock Circumference” [mm]
Weight  [10-­1
kg]
• Get more “blue” right
(true positives)
• Get more “red” wrong
(false positives)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
RECEIVER OPERATING CHARACTERISTICS CURVE
False Positive Rate
TruePositiveRate
Detect	
  more	
  by	
  accepting	
  more	
  false	
  positives
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
THREE DIMENSIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MORE DIMENSIONS
Decision Value
Density
• Linear model in ~160
dimensions
• Linearly separable
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
Source:Source: http://playground.tensorflow.org/
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
TREES AND TREE ENSEMBLES
SPARSE
FEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
400 401 402 403 404 405 406 407 408 409 410 411 412 413 414
area codes
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
N-GRAMS
43 72 6F 77 64 53 74 72 69 6B 65
43726F 776453 747269
726F77 645374 72696B
6F7764 537472 696B65
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
MISSION ACCOMPLISHED:
WE JUST ADD MORE DIMENSIONS…
RIGHT?
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
CURSE OF DIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
REDUCED
predictive
performance
INCREASED
training time
SLOWER
classification
LARGER
memory footprint
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
Source: https://commons.wikimedia.org/w/index.php?curid=2257082
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.
Height (mm)
Weight[10-1
kg]
DIMENSIONALITY AND SPARSENESS
2016 CrowdStrike, Inc. All rights reserved.
Height (mm)
Weight[10-1
kg]
MANAGING
DIMENSIONALITY
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• FEATURE ELIMINATION
– Feature ranking
– Stop words
• FEATURE REDUCTION
– Principal Component Analysis
– Autoencoders
– Points on lower-dimensional
manifold
– Stemming
• ENSEMBLE METHODS
– Classifier of classifiers, e.g. stacking
– Bagging and subspace sampling,
e.g. Random Forests
• And much, much more…
SECURITY APPLICATIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
FILE
ANALYSIS
AKA Static Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Relatively fast
– Scalable
– No need to detonate
– Platform independent, can be done at gateway
– Can support file similarity analysis
• THE BAD
– Limited insight due to narrow view
– Different file types require different techniques
– Different subtypes need special consideration
– Packed files
– .Net
– Installers
– EXEs vs DLLs
– Obfuscations (yet good if detectable)
– Ineffective against exploitation and malware-less attacks
– Asymmetry: a fraction of a second to decide for the
defender, months to craft for the attacker
EXAMPLE FEATURES
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
32/64BIT
EXECUTABLE
GUI
SUBSYSTEM
COMMAND
LINE
SUBSYSTEM
FILESIZE TIMESTAMP
DEBUG
INFORMATION
PRESENT
PACKERTYPE FILEENTROPY
NUMBEROF
SECTIONS
NUMBER
WRITABLE
NUMBER
READABLE
NUMBER
EXECUTABLE
DISTRIBUTION
OFSECTION
ENTROPY
IMPORTED
DLLNAMES
IMPORTED
FUNCTION
NAMES
COMPILER
ARTIFACTS
LINKER
ARTIFACTS
RESOURCE
DATA
EMBEDDED
PROTOCOL
STRINGS
EMBEDDED
IPS/DOMAINS
EMBEDDED
PATHS
EMBEDDED
PRODUCT
METADATA
DIGITAL
SIGNATURE
ICON
CONTENT
…
COMBINING FEATURES
• Projection to show
clusters
• For illustration, not
the space in that we
classify
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
EXECUTION
ANALYSIS
AKA Dynamic Analysis
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• THE GOOD
– Captures actual behavior of file
– Obfuscating behavior is hard
– Effective against exploitation
– Effective against malware-less attacks
– Not dependent on awareness of specific file
types
• THE BAD
– File needs to be executed
– Takes additional time to observe execution
– Execution depends on environment (e.g.
sandbox vs real world)
EXAMPLE: GLOBAL BEHAVIOR
§ Behavior across many executions
of a file
§ Conducted on event data centrally
located in the cloud
Krasser, S., Meyer, B., & Crenshaw, P. (2015). Valkyrie:
Behavioral Malware Detection using Global Kernel-
level Telemetry Data. In Proceedings of the 2015 IEEE
International Workshop on Machine Learning for Signal
Processing.
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
ML VS OTHER TECHNIQUES
§ ML output is probabilistic
§ Use other techniques where appropriate
§ Most ML-based engines use standard hashes or fuzzy hashes on top of a model
§ Example: credentials theft IoA
EVALUATING ML SOLUTIONS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
PRELIMINARIES
§ ML is not a feature, it is an implementation detail
§ Every solution must make trade-offs of conflicting objectives
§ FP vs TP
§ Speed vs accuracy
§ Memory footprint vs accuracy
§ Expressiveness vs explainability
§ Benchmarks under different assumptions are very hard to compare, even internally
§ Marchitecture
§ Looking at the right data: 60% of intrusions do not involve malware
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How much data is there to train on?
SCOPE: SCALE
§ Volume of data generated by
sources used
§ Aperture: footprint of deployment
§ Data collection
§ Point of analysis (endpoint, on-
prem, cloud)
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
How many data sources are used?
SCOPE: BREADTH
§ Varied sources and techniques
§ Static analysis
§ Behavioral analysis
§ Proliferation
§ Indicators from other techniques
§ Access to historical data
§ Baseline
§ Process lineage
§ “Number of characteristics” is not a useful metric
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
DETECTION RATE
§ Detection rate w/o false positive rate is
meaningless
§ Considering the base rate is important
§ System
§ 100k clean files, 1 malware file
§ 99% TPR at 0.1% FPR è 100 FPs, 1 TP
§ Downloads
§ 1k clean files, 1 malware file
§ 99% TPR at 0.1% FPR è 1 FP, 1 TP
§ Sourcing of test files skews results
§ Number of samples used to measure
(often too small)
§ False Positive Rate
§TruePositiveRate
APTS & 99% OF MALWARE DETECTED…
2016 CrowdStrike, Inc. All rights reserved.51
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
APTS (CONT.)
§ Combine techniques to offset tradeoffs
§ Static and behavioral
§ ML and non-ML
§ Lean local techniques and heavy-weight cloud techniques
§ Avoid silent failure: what happens when the adversary made it onto the system?
§ Avoid brittle techniques: does the solution depend on the attacker not having
access to detection results?
KEY POINTS
2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
• Machine Learning is an important part of the security tool chest
• Hidden untapped structure in your data
• Various trade-offs, most importantly between true and false positives
• Dimensionality is good…until it’s not
• Not all dimensions are created equal
• Comprehensive coverage by combining techniques
A Sober Look at Machine Learning

Más contenido relacionado

La actualidad más candente

Federated Storage Resources GCC2018 https://vimeo.com/291738189
Federated Storage Resources GCC2018 https://vimeo.com/291738189Federated Storage Resources GCC2018 https://vimeo.com/291738189
Federated Storage Resources GCC2018 https://vimeo.com/291738189
Vahid Jalili
 

La actualidad más candente (20)

Federated Storage Resources GCC2018 https://vimeo.com/291738189
Federated Storage Resources GCC2018 https://vimeo.com/291738189Federated Storage Resources GCC2018 https://vimeo.com/291738189
Federated Storage Resources GCC2018 https://vimeo.com/291738189
 
Elastic Stack Roadmap
Elastic Stack RoadmapElastic Stack Roadmap
Elastic Stack Roadmap
 
The Art and Science of Alert Triage
The Art and Science of Alert TriageThe Art and Science of Alert Triage
The Art and Science of Alert Triage
 
Au cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite ElasticAu cœur de la roadmap de la Suite Elastic
Au cœur de la roadmap de la Suite Elastic
 
Modernizing Your SOC: A CISO-led Training
Modernizing Your SOC: A CISO-led TrainingModernizing Your SOC: A CISO-led Training
Modernizing Your SOC: A CISO-led Training
 
Threat Hunting Platforms (Collaboration with SANS Institute)
Threat Hunting Platforms (Collaboration with SANS Institute)Threat Hunting Platforms (Collaboration with SANS Institute)
Threat Hunting Platforms (Collaboration with SANS Institute)
 
Machine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting StartedMachine Learning for Incident Detection: Getting Started
Machine Learning for Incident Detection: Getting Started
 
SQRRL threat hunting platform
SQRRL threat hunting platformSQRRL threat hunting platform
SQRRL threat hunting platform
 
Art into Science 2017 - Investigation Theory: A Cognitive Approach
Art into Science 2017 - Investigation Theory: A Cognitive ApproachArt into Science 2017 - Investigation Theory: A Cognitive Approach
Art into Science 2017 - Investigation Theory: A Cognitive Approach
 
Abstract Tools for Effective Threat Hunting
Abstract Tools for Effective Threat HuntingAbstract Tools for Effective Threat Hunting
Abstract Tools for Effective Threat Hunting
 
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineerTroubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
 
Optimizing Elastic for Search at McQueen Solutions
Optimizing Elastic for Search at McQueen SolutionsOptimizing Elastic for Search at McQueen Solutions
Optimizing Elastic for Search at McQueen Solutions
 
Threat Hunting 102: Beyond the Basics
Threat Hunting 102: Beyond the BasicsThreat Hunting 102: Beyond the Basics
Threat Hunting 102: Beyond the Basics
 
Scaling and Managing Big Data Apps in the Cloud
Scaling and Managing Big Data Apps in the CloudScaling and Managing Big Data Apps in the Cloud
Scaling and Managing Big Data Apps in the Cloud
 
SOC2016 - The Investigation Labyrinth
SOC2016 - The Investigation LabyrinthSOC2016 - The Investigation Labyrinth
SOC2016 - The Investigation Labyrinth
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
University of Oxford: building a next generation SIEM
University of Oxford: building a next generation SIEMUniversity of Oxford: building a next generation SIEM
University of Oxford: building a next generation SIEM
 
Leveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your HuntsLeveraging Threat Intelligence to Guide Your Hunts
Leveraging Threat Intelligence to Guide Your Hunts
 
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior GraphUser and Entity Behavior Analytics using the Sqrrl Behavior Graph
User and Entity Behavior Analytics using the Sqrrl Behavior Graph
 
In that case, we have an OWASP Top 10 opportunity...
In that case, we have an OWASP Top 10 opportunity...In that case, we have an OWASP Top 10 opportunity...
In that case, we have an OWASP Top 10 opportunity...
 

Similar a A Sober Look at Machine Learning

How to Replace Your Legacy Antivirus Solution with CrowdStrike
How to Replace Your Legacy Antivirus Solution with CrowdStrikeHow to Replace Your Legacy Antivirus Solution with CrowdStrike
How to Replace Your Legacy Antivirus Solution with CrowdStrike
Adam Barrera
 
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
CrowdStrike
 
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophisticationBSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
Paül Jaramillo
 
Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02
재구 김
 
Building private-clouds-qconsf
Building private-clouds-qconsfBuilding private-clouds-qconsf
Building private-clouds-qconsf
Andrew Shafer
 

Similar a A Sober Look at Machine Learning (20)

Battling Unknown Malware with Machine Learning
Battling Unknown Malware with Machine Learning Battling Unknown Malware with Machine Learning
Battling Unknown Malware with Machine Learning
 
Startupfest 2012 - Coefficients of friction
Startupfest 2012 - Coefficients of frictionStartupfest 2012 - Coefficients of friction
Startupfest 2012 - Coefficients of friction
 
MITRE ATTACKcon Power Hour - October
MITRE ATTACKcon Power Hour - OctoberMITRE ATTACKcon Power Hour - October
MITRE ATTACKcon Power Hour - October
 
How to Replace Your Legacy Antivirus Solution with CrowdStrike
How to Replace Your Legacy Antivirus Solution with CrowdStrikeHow to Replace Your Legacy Antivirus Solution with CrowdStrike
How to Replace Your Legacy Antivirus Solution with CrowdStrike
 
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
Runbook Automation: Old News or a Key to Unlock Performance? [DOES2020]
 
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
CrowdStrike CrowdCast: Is Ransomware Morphing Beyond The Ability Of Standard ...
 
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophisticationBSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
BSides San Diego 2017 - Sophisticuffs: The rumble over adversary sophistication
 
Bsides Chicago2017
Bsides Chicago2017Bsides Chicago2017
Bsides Chicago2017
 
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
Worldwide Public Sector Breakfast Hosted by Teresa Carlson (WPS01) - AWS re:I...
 
Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02Qconny2014dmarsh 140613080328-phpapp02
Qconny2014dmarsh 140613080328-phpapp02
 
My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
My Futuristic Vision of the Future of Cassandra's Future - NGCC 2015
 
Building private-clouds-qconsf
Building private-clouds-qconsfBuilding private-clouds-qconsf
Building private-clouds-qconsf
 
Fast Delivery DevOps Israel
Fast Delivery DevOps IsraelFast Delivery DevOps Israel
Fast Delivery DevOps Israel
 
DevOps: From Industry Buzzword to Real Implementation / Real Benefits
DevOps: From Industry Buzzword to Real Implementation / Real BenefitsDevOps: From Industry Buzzword to Real Implementation / Real Benefits
DevOps: From Industry Buzzword to Real Implementation / Real Benefits
 
Continuous Testing
Continuous TestingContinuous Testing
Continuous Testing
 
The Art of Visibility: Enabling Multi-Platform Management
The Art of Visibility: Enabling Multi-Platform ManagementThe Art of Visibility: Enabling Multi-Platform Management
The Art of Visibility: Enabling Multi-Platform Management
 
Genetic Malware
Genetic MalwareGenetic Malware
Genetic Malware
 
Genetic Malware
Genetic MalwareGenetic Malware
Genetic Malware
 
The New Normal: Managing the constant stream of new vulnerabilities
The New Normal: Managing the constant stream of new vulnerabilitiesThe New Normal: Managing the constant stream of new vulnerabilities
The New Normal: Managing the constant stream of new vulnerabilities
 
Microservices Manchester: Keynote. Microservices are so 2015, What's Next? By...
Microservices Manchester: Keynote. Microservices are so 2015, What's Next? By...Microservices Manchester: Keynote. Microservices are so 2015, What's Next? By...
Microservices Manchester: Keynote. Microservices are so 2015, What's Next? By...
 

Último

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 

Último (20)

Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 

A Sober Look at Machine Learning

  • 1. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. A SOBER LOOK AT MACHINE LEARNING DR. SVEN KRASSER CHIEF SCIENTIST @SVENKRASSER
  • 2. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. Distinguishing Science… Source: CERN, http://home.cern/sites/home.web.cern.ch/files/image/experiment/2013/01/cms_0.jpeg
  • 3. …from FictionSource: “Chain Reaction,” 20th Century Fox
  • 4. MACHINE LEARNING 101 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 5. EXAMPLES OF MACHINE LEARNING 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. SPAM FILTERING MOVIE RECOMMENDATIONS SIRI (iPHONE)
  • 6. TODAY’S FOCUS: SUPERVISED LEARNING 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 7. TODAY’S FOCUS: GEOMETRIC MODELS 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 8. EVERYTHING YOU WILL SEE TODAY IS REAL WORLD DATA 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 9. Some Data to Get Started: 1988 ANTHROPOMETRIC SURVEY OF ARMY PERSONNEL Source: http://mreed.umtri.umich.edu/mreed/downloads.html#anthro 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 10. • Over 4000 soldiers surveyed • Over 100 measurements • Reported by gender Test subjects are in better shape than the rest of us... Data Selection Bias 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 11. FIRST LOOK Height [mm] Density • Difference in distribution • Significant overlap 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 12. SECOND DIMENSION Height [mm] Weight[10-1 kg] • Correlation • Overlap 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 13. FEATURE SELECTION “Buttock Circumference” [mm] Weight[10-1 kg] • Correlation • Gender-specific slope • Reduced overlap • Selection of features matters • How to make a prediction? 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 14. K-NEAREST NEIGHBOR “Buttock Circumference” [mm] Weight[10-1 kg] m f 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 15. SUPPORT VECTOR MACHINE “Buttock Circumference” [mm] Weight[10-1 kg] 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 16. SUPPORT VECTOR MACHINE 2016 CrowdStrike, Inc. All rights reserved. “Buttock Circumference” [mm] Weight[10-1 kg] • Overfitting • Classifier does not generalize • Let’s take a closer look…
  • 17. CROSS VALIDATION 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. TRAIN TRAIN TRAIN TEST TRAIN TRAIN TEST TRAIN TRAIN TEST TRAIN TRAIN TEST TRAIN TRAIN TRAIN • Divide data into k folds • Train on k-1 folds, test on the remaining one • Repeat k times for all folds
  • 18. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1 kg] • Classifier generalizes • Note some misclassifications • Let’s assume we want to detect males (blue) § I.e. “blue” is our positive class 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 19. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1 kg] 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 20. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1 kg] 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 21. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1 kg] 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 22. LET’S CLASSIFY “Buttock Circumference” [mm] Weight[10-1 kg] 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 23. LET’S CLASSIFY “Buttock Circumference” [mm] Weight  [10-­1 kg] • Get more “blue” right (true positives) • Get more “red” wrong (false positives) 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 24. RECEIVER OPERATING CHARACTERISTICS CURVE False Positive Rate TruePositiveRate Detect  more  by  accepting  more  false  positives 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 25. THREE DIMENSIONS 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 26. MORE DIMENSIONS Decision Value Density • Linear model in ~160 dimensions • Linearly separable 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 28. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. TREES AND TREE ENSEMBLES
  • 29. SPARSE FEATURES 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 area codes 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
  • 30. N-GRAMS 43 72 6F 77 64 53 74 72 69 6B 65 43726F 776453 747269 726F77 645374 72696B 6F7764 537472 696B65 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 31. MISSION ACCOMPLISHED: WE JUST ADD MORE DIMENSIONS… RIGHT? 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 32. CURSE OF DIMENSIONALITY 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. REDUCED predictive performance INCREASED training time SLOWER classification LARGER memory footprint
  • 35.
  • 36. DIMENSIONALITY AND SPARSENESS 2016 CrowdStrike, Inc. All rights reserved. Height (mm) Weight[10-1 kg]
  • 37. DIMENSIONALITY AND SPARSENESS 2016 CrowdStrike, Inc. All rights reserved. Height (mm) Weight[10-1 kg]
  • 38. MANAGING DIMENSIONALITY 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. • FEATURE ELIMINATION – Feature ranking – Stop words • FEATURE REDUCTION – Principal Component Analysis – Autoencoders – Points on lower-dimensional manifold – Stemming • ENSEMBLE METHODS – Classifier of classifiers, e.g. stacking – Bagging and subspace sampling, e.g. Random Forests • And much, much more…
  • 39. SECURITY APPLICATIONS 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 40. FILE ANALYSIS AKA Static Analysis 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. • THE GOOD – Relatively fast – Scalable – No need to detonate – Platform independent, can be done at gateway – Can support file similarity analysis • THE BAD – Limited insight due to narrow view – Different file types require different techniques – Different subtypes need special consideration – Packed files – .Net – Installers – EXEs vs DLLs – Obfuscations (yet good if detectable) – Ineffective against exploitation and malware-less attacks – Asymmetry: a fraction of a second to decide for the defender, months to craft for the attacker
  • 41. EXAMPLE FEATURES 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. 32/64BIT EXECUTABLE GUI SUBSYSTEM COMMAND LINE SUBSYSTEM FILESIZE TIMESTAMP DEBUG INFORMATION PRESENT PACKERTYPE FILEENTROPY NUMBEROF SECTIONS NUMBER WRITABLE NUMBER READABLE NUMBER EXECUTABLE DISTRIBUTION OFSECTION ENTROPY IMPORTED DLLNAMES IMPORTED FUNCTION NAMES COMPILER ARTIFACTS LINKER ARTIFACTS RESOURCE DATA EMBEDDED PROTOCOL STRINGS EMBEDDED IPS/DOMAINS EMBEDDED PATHS EMBEDDED PRODUCT METADATA DIGITAL SIGNATURE ICON CONTENT …
  • 42. COMBINING FEATURES • Projection to show clusters • For illustration, not the space in that we classify 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 43. EXECUTION ANALYSIS AKA Dynamic Analysis 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. • THE GOOD – Captures actual behavior of file – Obfuscating behavior is hard – Effective against exploitation – Effective against malware-less attacks – Not dependent on awareness of specific file types • THE BAD – File needs to be executed – Takes additional time to observe execution – Execution depends on environment (e.g. sandbox vs real world)
  • 44. EXAMPLE: GLOBAL BEHAVIOR § Behavior across many executions of a file § Conducted on event data centrally located in the cloud Krasser, S., Meyer, B., & Crenshaw, P. (2015). Valkyrie: Behavioral Malware Detection using Global Kernel- level Telemetry Data. In Proceedings of the 2015 IEEE International Workshop on Machine Learning for Signal Processing.
  • 45. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. ML VS OTHER TECHNIQUES § ML output is probabilistic § Use other techniques where appropriate § Most ML-based engines use standard hashes or fuzzy hashes on top of a model § Example: credentials theft IoA
  • 46. EVALUATING ML SOLUTIONS 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED.
  • 47. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. PRELIMINARIES § ML is not a feature, it is an implementation detail § Every solution must make trade-offs of conflicting objectives § FP vs TP § Speed vs accuracy § Memory footprint vs accuracy § Expressiveness vs explainability § Benchmarks under different assumptions are very hard to compare, even internally § Marchitecture § Looking at the right data: 60% of intrusions do not involve malware
  • 48. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. How much data is there to train on? SCOPE: SCALE § Volume of data generated by sources used § Aperture: footprint of deployment § Data collection § Point of analysis (endpoint, on- prem, cloud)
  • 49. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. How many data sources are used? SCOPE: BREADTH § Varied sources and techniques § Static analysis § Behavioral analysis § Proliferation § Indicators from other techniques § Access to historical data § Baseline § Process lineage § “Number of characteristics” is not a useful metric
  • 50. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. DETECTION RATE § Detection rate w/o false positive rate is meaningless § Considering the base rate is important § System § 100k clean files, 1 malware file § 99% TPR at 0.1% FPR è 100 FPs, 1 TP § Downloads § 1k clean files, 1 malware file § 99% TPR at 0.1% FPR è 1 FP, 1 TP § Sourcing of test files skews results § Number of samples used to measure (often too small) § False Positive Rate §TruePositiveRate
  • 51. APTS & 99% OF MALWARE DETECTED… 2016 CrowdStrike, Inc. All rights reserved.51
  • 52. 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. APTS (CONT.) § Combine techniques to offset tradeoffs § Static and behavioral § ML and non-ML § Lean local techniques and heavy-weight cloud techniques § Avoid silent failure: what happens when the adversary made it onto the system? § Avoid brittle techniques: does the solution depend on the attacker not having access to detection results?
  • 53. KEY POINTS 2016 CROWDSTRIKE, INC. ALL RIGHTS RESERVED. • Machine Learning is an important part of the security tool chest • Hidden untapped structure in your data • Various trade-offs, most importantly between true and false positives • Dimensionality is good…until it’s not • Not all dimensions are created equal • Comprehensive coverage by combining techniques