SlideShare una empresa de Scribd logo
1 de 36
USING LANGUAGE MODELING
TO VERIFY USER IDENTITIES
Chris Poirel
Data Scientist
BlackHat USA | August 2018
Copyright © 2018 Forcepoint.
Eduardo Luiggi
Data Scientist
Copyright © 2018 Forcepoint. | 2
Problem statement
Overview of Language Modeling
Data sets and data preparation
Case studies
OUTLINE
USER AND ENTITY BEHAVIOR ANALYTICS
UEBA focuses on identifying
entities and assessing their risk to
an organization
Effort to recognize/prevent compromised
accounts, malicious activity, IP theft, etc.
Routinely investigating novel techniques
to extract additional analytic value from
existing data sources
Copyright © 2018 Forcepoint. | 3
Data challenges
Lack of "gold standard" data sets to
support traditional supervised ML
Increasing volume collected about the
entity – often requires pre-filtering or
"sessionizing" event sets
Requires highly accurate entity resolution
(e.g., email, IP, MAC, username)
Need to integrate and understand
structured vs. unstructured data
DEFINING MACHINE LEARNING
"Any computer program that improves performance at some task through experience."
"A computer program is said to learn from experience E with respect to some class of
tasks T and performance measure P, if its performance at tasks in T, as measured by P,
improves with experience E." - Tom Mitchell
Copyright © 2018 Forcepoint. | 4
In order to build an effective ML-based solution, we should clearly define…
The task we're trying to improve
Entity identificaiton
A method for measuring the performance of the solution
Quantitatively by defining true positives / negatives and assessing precision
The experiences we're using to improve performance
Unstructured content generated by the entity over time
IMPROVE ENTITY IDENTIFICATION
How can we improve entity
detection from unstructured,
human-generated content?
Reliably identify an individual from the
content of their email
Predict if a user's account has been
compromised based on
their language
Extend the same approaches
to "less structured" data like command
line activity
Copyright © 2018 Forcepoint. | 5
IMPROVE ENTITY IDENTIFICATION
Copyright © 2018 Forcepoint. | 6
Can we make use of advances in Language
Modeling to address these concerns?
A variety of biometrics have been used for verifying
user identities, including fingerprinting, facial
recognition, and keystroke analysis
NLP research has found that people's use of
language can also be uniquely identifying
Language modeling is a technique for measuring
how likely words and phrases are, given some
observations about previous language use
LM is used heavily in speaker and author
identification, as well as speech recognition and
machine translation
UNSUPERVISED TO THE RESCUE?
Copyright © 2018 Forcepoint. | 7
Techniques for assigning probabilities to sentences or phrases
Captures some level of syntax and semantics
P("the dog runs") > P("the dogs runs")
P("the dog runs") > P("the tables run")
Applications
Speech recognition:
P("Sweet dreams are made of cheese")
< P("Sweet dreams are made of this")
Language identification
P("Donde está la biblioteca" | Spanish)
> P("Donde está la biblioteca" | English)
Context-sensitive spell checking
P("Football is their favorite sport")
> P("Football is there favorite sport")
Machine translation
Autocomplete
N-GRAM LANGUAGE MODELS
N-gram models estimate the probability of a word
given the words that come before it
P("The boy walked to the circus") = P(The) *
P(boy | The) *
P(walked | The boy) *
P(to | boy walked) * …
Copyright © 2018 Forcepoint. | 8
Word n-grams are contiguous collections of n words:
Unigram Bigram Trigram ...
The The boy The boy walked
boy boy walked boy walked to
walked walked to walked to the
to to the to the circus.
CHALLENGES BUILDING ML MODELS
Parameter tuning
Must determine an appropriate value for n,
how much history to include
Small values of n are easy to estimate and
are more likely to estimate the probability
of seeing terms in a general context
Large values are computationally more
burdensome and risk overfitting the model,
which fails to generalize well
Copyright © 2018 Forcepoint. | 9
Training corpus
Depends heavily on the application and
the problem you're trying to solve
Learning a model for the industry (tech),
organization (Enron vs. Avocado), or
each individual
Each may require a different amount of
training history or different corpora to
achieve acceptable accuracy
CHALLENGES BUILDING ML MODELS
Copyright © 2018 Forcepoint. | 10
Prepare for the unknown
Must account for missing/incomplete data in the training corpus.
P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0
Smoothing is a common technique applied to ML models when the universe of observations can never be
fully trained, so we must assume any observation can occur
CHALLENGES BUILDING ML MODELS
Copyright © 2018 Forcepoint. | 11
Prepare for the unknown
Must account for missing/incomplete data in the training corpus.
P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0
Smoothing is a common technique applied to ML models when the universe of observations can never be
fully trained, so we must assume any observation can occur
Suppose we trained a unigram model on "The boy walked the dog"
Term Freq Prob Smoothed Freq Smoothed Prob
the 2 2/|W| = 0.4
boy 1 0.2
walked 1 0.2
dog 1 0.2
circus 0 0
TOTAL |C| = 5 1.0
CHALLENGES BUILDING ML MODELS
Copyright © 2018 Forcepoint. | 12
Prepare for the unknown
Must account for missing/incomplete data in the training corpus.
P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0
Smoothing is a common technique applied to ML models when the universe of observations can never be
fully trained, so we must assume any observation can occur
Suppose we trained a unigram model on "The boy walked the dog"
Laplace Smoothing helps solve this problem – there are numerous other smoothing techniques
Term Freq Prob Smoothed Freq Smoothed Prob
the 2 2/|W| = 0.4 2+1 3 / (|C| + |V|)
boy 1 0.2 1+1 2 / (|C| + |V|)
walked 1 0.2 1+1 2 / (|C| + |V|)
dog 1 0.2 1+1 2 / (|C| + |V|)
circus 0 0 1 1 / (|C| + |V|)
TOTAL |C| = 5 1.0 |C| + |V| 1.0
TEST DATASETS
Enron Corporation Corpus
~500k emails from
~150 internal employees
Heavily used in NLP and social-
networking research with several sources
of manual annotation
Significant amount of sensitive material
was removed before releasing to the
public, along with several redactions for
privacy concerns
Freely publicly available
Copyright © 2018 Forcepoint. | 13
Avocado Collection
~1M emails from 280 employees of
a defunct IT company fictitiously
named Avocado
Includes attachments, contact lists,
calendars from those employees'
personal folders
All data has been de-identified
Requires license
DATA CLEANUP: PRE-PROCESSING ANALYTICS
Copyright © 2018 Forcepoint. | 14
Analytics of
unstructured data can
be particularly
difficult due to corpus
inconsistencies,
missing data, and
biased view (internal
sender/recipient only)
This process is
usually messy and
requires multiple
iterations
Need to focus on
analyzing relevant
text introduced by
the sender
DATA CLEANUP: PRE-PROCESSING ANALYTICS
Copyright © 2018 Forcepoint. | 15
A few things we've
done to hone in on
personalized text:
Address various format issues – Strip extraneous markup (e.g., html,xml), header
info, and terse text blocks with no discernible human content
Noise filtering – Heuristics to identify bulk senders and exclude those mailboxes
from further analysis
Disclaimer detection – N-gram and bag of words models to classify paragraphs
as disclaimer text and exclude from personalized language models
Thread email
Remove signature blocks
LANGUAGE MODELING FOR USER IDENTIFICATION
Given a new block of unstructured content, can we …
Learn organizational language models to differentiate Enron vs. Avocado?
Learn personalized language models for individuals in each corpus?
Apply the same techniques to nontraditional "less structured" content?
The ability to address any of these and generalize
the techniques across industries can drastically
improve our ability to predict unusual activity.
Copyright © 2018 Forcepoint. | 16
LEARNING A LANGUAGE MODEL
Select sent emails from top-N senders in each of the datasets
12 entities/corpus
Each entity has between ~1000 and ~9000 sent emails
Not every email contributes to the model, e.g., forwarded emails are discarded
For each entity
Split their email set in training (~80%) and testing datasets (~20%)
Use KenLM Language Model Toolkit (https://kheafield.com/code/kenlm/)
to learn an n-gram language model on training dataset
Uses Kneser-Ney smoothing
Copyright © 2018 Forcepoint. | 17
CASE STUDY: ORGANIZATIONAL LANGUAGE MODELS
Types: 64370
Unigram tokens: 3950539
Types: 28937
Unigram tokens: 943170
Enron
Copyright © 2018 Forcepoint. | 18
Avocado
CASE STUDY: ORGANIZATIONAL LANGUAGE MODELS
Copyright © 2018 Forcepoint. | 19
Surprising?
Not really
What are the models learning?
Avocado Enron
please please
file power
application agreement
avocadoit enron
wireless attached
output state
activityname want
new new
Among top-25 terms
CASE STUDY: PERSONALIZED MODELS (AVOCADO)
Copyright © 2018 Forcepoint. | 20
CASE STUDY: PERSONALIZED MODELS (AVOCADO)
Copyright © 2018 Forcepoint. | 21
CASE STUDY: PERSONALIZED MODELS (AVOCADO)
Copyright © 2018 Forcepoint. | 22
CASE STUDY: PERSONALIZED MODELS (AVOCADO)
Copyright © 2018 Forcepoint. | 23
CASE STUDY: PERSONALIZED MODELS (ENRON)
Copyright © 2018 Forcepoint. | 24
CASE STUDY: PERSONALIZED MODELS (ENRON)
Copyright © 2018 Forcepoint. | 25
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 26
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 27
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 28
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 29
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 30
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
Enron dataset is ~50% the size of
Avocado
Enron modeling performance is similar to Avocado @
50%
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 31
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
Enron dataset is ~50% the size of
Avocado
• Enron modeling performance is similar to Avocado @
50%
HOW MUCH DATA DO WE NEED?
Copyright © 2018 Forcepoint. | 32
Changed the size of training
samples
0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0
Constant test size
Observations
With a small training sample we
predicted the model with the most
tokens most of the time
Enron dataset is ~50% the size of
Avocado
Enron modeling performance is similar to Avocado @
50%
SOMETIMES MODELS ARE JUST BAD
Jeff Dasovich
Second largest training set from Enron
Most unique tokens
We are more likely to guess Richard Sanders
as the author
Common top-25 tokens include
'know', 'like', 'call', 'get', 'time', 'would', 'thanks'
Why do we fail to identify Jeff?
SOMETIMES MODELS ARE JUST BAD
Jeff Dasovich
Second largest training set from Enron
Most unique tokens
We are more likely to guess Richard Sanders
as the author
Common top-25 tokens include
'know', 'like', 'call', 'get', 'time', 'would', 'thanks'
Why do we fail to identify Jeff?
He liked to embed news articles in his emails
… This article showed up on Wednesday . Thought
you might be interested .
Texas Journal -- Energy traders cite gains , but some
math is missing -- Volatile prices for natural gas and
electricity are creating high-voltage counting on these
gains could be in for a jolt down the road ...
EXTENDING TO OTHER
STRUCTURED CONTENT
Demonstrated a solution that
Addresses the task of entity identification
Increases performance according to
quantitative precision assessment
Improves performance over time with
additional experience
Potential future applications
Chat or phone transcript
Command line activity
Database / SIEM queries
Questions?
Chris Poirel
Data Scientist
BlackHat USA | August 2018
Copyright © 2018 Forcepoint.
Eduardo Luiggi
Data Scientist

Más contenido relacionado

La actualidad más candente

SEB Forcepoint Corporate Overview
SEB Forcepoint Corporate OverviewSEB Forcepoint Corporate Overview
SEB Forcepoint Corporate OverviewStephen Bates
 
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)Sam Kumarsamy
 
Stop Hackers with Integrated CASB & IDaaS Security
Stop Hackers with  Integrated CASB & IDaaS SecurityStop Hackers with  Integrated CASB & IDaaS Security
Stop Hackers with Integrated CASB & IDaaS SecurityOneLogin
 
Symantec Webinar | Tips for Successful CASB Projects
Symantec Webinar |  Tips for Successful CASB ProjectsSymantec Webinar |  Tips for Successful CASB Projects
Symantec Webinar | Tips for Successful CASB ProjectsSymantec
 
Introduction to Cloud Security
Introduction to Cloud SecurityIntroduction to Cloud Security
Introduction to Cloud SecuritySusanne Tedrick
 
Forrester Research: Securing the Cloud When Users are Left to Their Own Devices
Forrester Research: Securing the Cloud When Users are Left to Their Own DevicesForrester Research: Securing the Cloud When Users are Left to Their Own Devices
Forrester Research: Securing the Cloud When Users are Left to Their Own DevicesNetskope
 
5 Ways to Get Even More from Your IBM Security QRadar Investment in 2016
5 Ways to Get Even More from Your IBM Security QRadar Investment in 20165 Ways to Get Even More from Your IBM Security QRadar Investment in 2016
5 Ways to Get Even More from Your IBM Security QRadar Investment in 2016IBM Security
 
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat Prevention
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat PreventionIntroducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat Prevention
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat PreventionIBM Security
 
Netskope — Shadow IT Is A Good Thing
Netskope — Shadow IT Is A Good ThingNetskope — Shadow IT Is A Good Thing
Netskope — Shadow IT Is A Good ThingNetskope
 
Compete To Win: Don’t Just Be Compliant – Be Secure!
Compete To Win: Don’t Just Be Compliant – Be Secure!Compete To Win: Don’t Just Be Compliant – Be Secure!
Compete To Win: Don’t Just Be Compliant – Be Secure!IBM Security
 
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.Scalar Decisions
 
MalCon Future of Security
MalCon Future of SecurityMalCon Future of Security
MalCon Future of SecurityNetskope
 
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...Symantec
 
Symantec Cloud Security Threat Report
Symantec Cloud Security Threat ReportSymantec Cloud Security Threat Report
Symantec Cloud Security Threat ReportSymantec
 
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINAL
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINALDefending Against Advanced Threats-Addressing the Cyber Kill Chain_FINAL
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINALMichael Bunn
 
PaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPrime Infoserv
 
Cloud security enforcer - Quick steps to avoid the blind spots of shadow it
Cloud security enforcer - Quick steps to avoid the blind spots of shadow itCloud security enforcer - Quick steps to avoid the blind spots of shadow it
Cloud security enforcer - Quick steps to avoid the blind spots of shadow itIBM Security
 
Pivotal Data Lake Architecture & its role in security analytics
Pivotal Data Lake Architecture & its role in security analyticsPivotal Data Lake Architecture & its role in security analytics
Pivotal Data Lake Architecture & its role in security analyticsEMC
 
Protecting Mission-Critical Source Code from Application Security Vulnerabili...
Protecting Mission-Critical Source Code from Application Security Vulnerabili...Protecting Mission-Critical Source Code from Application Security Vulnerabili...
Protecting Mission-Critical Source Code from Application Security Vulnerabili...IBM Security
 

La actualidad más candente (20)

SEB Forcepoint Corporate Overview
SEB Forcepoint Corporate OverviewSEB Forcepoint Corporate Overview
SEB Forcepoint Corporate Overview
 
IBM Security QRadar
 IBM Security QRadar IBM Security QRadar
IBM Security QRadar
 
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)
bcs_sb_TechPartner_SAPlatform_Damballa_EN_v1a (2)
 
Stop Hackers with Integrated CASB & IDaaS Security
Stop Hackers with  Integrated CASB & IDaaS SecurityStop Hackers with  Integrated CASB & IDaaS Security
Stop Hackers with Integrated CASB & IDaaS Security
 
Symantec Webinar | Tips for Successful CASB Projects
Symantec Webinar |  Tips for Successful CASB ProjectsSymantec Webinar |  Tips for Successful CASB Projects
Symantec Webinar | Tips for Successful CASB Projects
 
Introduction to Cloud Security
Introduction to Cloud SecurityIntroduction to Cloud Security
Introduction to Cloud Security
 
Forrester Research: Securing the Cloud When Users are Left to Their Own Devices
Forrester Research: Securing the Cloud When Users are Left to Their Own DevicesForrester Research: Securing the Cloud When Users are Left to Their Own Devices
Forrester Research: Securing the Cloud When Users are Left to Their Own Devices
 
5 Ways to Get Even More from Your IBM Security QRadar Investment in 2016
5 Ways to Get Even More from Your IBM Security QRadar Investment in 20165 Ways to Get Even More from Your IBM Security QRadar Investment in 2016
5 Ways to Get Even More from Your IBM Security QRadar Investment in 2016
 
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat Prevention
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat PreventionIntroducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat Prevention
Introducing IBM Cloud Security Enforcer, CASB, IDaaS and Threat Prevention
 
Netskope — Shadow IT Is A Good Thing
Netskope — Shadow IT Is A Good ThingNetskope — Shadow IT Is A Good Thing
Netskope — Shadow IT Is A Good Thing
 
Compete To Win: Don’t Just Be Compliant – Be Secure!
Compete To Win: Don’t Just Be Compliant – Be Secure!Compete To Win: Don’t Just Be Compliant – Be Secure!
Compete To Win: Don’t Just Be Compliant – Be Secure!
 
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.
Disrupting the Malware Kill Chain - What's New from Palo Alto Networks.
 
MalCon Future of Security
MalCon Future of SecurityMalCon Future of Security
MalCon Future of Security
 
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...
Symantec - The Importance of Building Your Zero Trust Program on a Solid Plat...
 
Symantec Cloud Security Threat Report
Symantec Cloud Security Threat ReportSymantec Cloud Security Threat Report
Symantec Cloud Security Threat Report
 
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINAL
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINALDefending Against Advanced Threats-Addressing the Cyber Kill Chain_FINAL
Defending Against Advanced Threats-Addressing the Cyber Kill Chain_FINAL
 
PaloAlto Enterprise Security Solution
PaloAlto Enterprise Security SolutionPaloAlto Enterprise Security Solution
PaloAlto Enterprise Security Solution
 
Cloud security enforcer - Quick steps to avoid the blind spots of shadow it
Cloud security enforcer - Quick steps to avoid the blind spots of shadow itCloud security enforcer - Quick steps to avoid the blind spots of shadow it
Cloud security enforcer - Quick steps to avoid the blind spots of shadow it
 
Pivotal Data Lake Architecture & its role in security analytics
Pivotal Data Lake Architecture & its role in security analyticsPivotal Data Lake Architecture & its role in security analytics
Pivotal Data Lake Architecture & its role in security analytics
 
Protecting Mission-Critical Source Code from Application Security Vulnerabili...
Protecting Mission-Critical Source Code from Application Security Vulnerabili...Protecting Mission-Critical Source Code from Application Security Vulnerabili...
Protecting Mission-Critical Source Code from Application Security Vulnerabili...
 

Similar a Using Language Modeling to Verify User Identities

Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Amazon Web Services
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine LearningYuriy Guts
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfJedha Bootcamp
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTDr. Haxel Consult
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet SentimentLucinda Linde
 
IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.  IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era. Benoit Marolleau
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning BasicsSuresh Arora
 
Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016OMNETRIC
 
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...Interset
 
A quick peek into the word of AI
A quick peek into the word of AIA quick peek into the word of AI
A quick peek into the word of AISubhendu Dey
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon Web Services
 
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabadbharathtsofttech
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationAnkit Gupta
 

Similar a Using Language Modeling to Verify User Identities (20)

Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
Smarter Event-Driven Edge with Amazon SageMaker & Project Flogo (AIM204-S) - ...
 
Automated Machine Learning
Automated Machine LearningAutomated Machine Learning
Automated Machine Learning
 
Cv zamir siddiqui
Cv zamir siddiquiCv zamir siddiqui
Cv zamir siddiqui
 
DataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdfDataScientist Job : Between Myths and Reality.pdf
DataScientist Job : Between Myths and Reality.pdf
 
ICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPTICIC 2013 New Product Introductions CEPT
ICIC 2013 New Product Introductions CEPT
 
Predicting Tweet Sentiment
Predicting Tweet SentimentPredicting Tweet Sentiment
Predicting Tweet Sentiment
 
IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.  IBM i & Data Science in the AI era.
IBM i & Data Science in the AI era.
 
Machine Learning Basics
Machine Learning BasicsMachine Learning Basics
Machine Learning Basics
 
Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016Energy Central Webinar on June 14, 2016
Energy Central Webinar on June 14, 2016
 
Ml product page
Ml product pageMl product page
Ml product page
 
Ml product page
Ml product pageMl product page
Ml product page
 
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...
IANS Forum Seattle Technology Spotlight: Looking for and Finding the Inside...
 
A quick peek into the word of AI
A quick peek into the word of AIA quick peek into the word of AI
A quick peek into the word of AI
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)Amazon SageMaker 內建機器學習演算法 (Level 400)
Amazon SageMaker 內建機器學習演算法 (Level 400)
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdfPan Dhoni - Modernizing Data And Analytics using AI.pdf
Pan Dhoni - Modernizing Data And Analytics using AI.pdf
 
Best Data Science Online Training in Hyderabad
  Best Data Science Online Training in Hyderabad  Best Data Science Online Training in Hyderabad
Best Data Science Online Training in Hyderabad
 
Intro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning PresentationIntro/Overview on Machine Learning Presentation
Intro/Overview on Machine Learning Presentation
 

Más de Forcepoint LLC

Rethinking the concept of trust (DoDIIS 2019 presentation)
Rethinking the concept of trust (DoDIIS 2019 presentation)Rethinking the concept of trust (DoDIIS 2019 presentation)
Rethinking the concept of trust (DoDIIS 2019 presentation)Forcepoint LLC
 
Sparking Curiosity to Change Security Behaviors
Sparking Curiosity to Change Security BehaviorsSparking Curiosity to Change Security Behaviors
Sparking Curiosity to Change Security BehaviorsForcepoint LLC
 
Understanding the "Intelligence" in AI
Understanding the "Intelligence" in AIUnderstanding the "Intelligence" in AI
Understanding the "Intelligence" in AIForcepoint LLC
 
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...Forcepoint LLC
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in CybersecurityForcepoint LLC
 
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...Forcepoint LLC
 
Addressing Future Risks and Legal Challenges of Insider Threats
Addressing Future Risks and Legal Challenges of Insider ThreatsAddressing Future Risks and Legal Challenges of Insider Threats
Addressing Future Risks and Legal Challenges of Insider ThreatsForcepoint LLC
 
A Predictive “Precrime” Approach Requires a Human Focus
A Predictive “Precrime” Approach Requires a Human FocusA Predictive “Precrime” Approach Requires a Human Focus
A Predictive “Precrime” Approach Requires a Human FocusForcepoint LLC
 
Cyber Convergence, Warfare and You
Cyber Convergence, Warfare and YouCyber Convergence, Warfare and You
Cyber Convergence, Warfare and YouForcepoint LLC
 
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)Forcepoint LLC
 
Security Insights for Mission-Critical Networks
Security Insights for Mission-Critical NetworksSecurity Insights for Mission-Critical Networks
Security Insights for Mission-Critical NetworksForcepoint LLC
 
Maintaining Visibility and Control as Workers and Apps Scatter
Maintaining Visibility and Control as Workers and Apps ScatterMaintaining Visibility and Control as Workers and Apps Scatter
Maintaining Visibility and Control as Workers and Apps ScatterForcepoint LLC
 
Embracing the Millennial Tsunami
Embracing the Millennial TsunamiEmbracing the Millennial Tsunami
Embracing the Millennial TsunamiForcepoint LLC
 
Revolutionary, Not Evolutionary
Revolutionary, Not EvolutionaryRevolutionary, Not Evolutionary
Revolutionary, Not EvolutionaryForcepoint LLC
 
Cybersecurity and the Human Psyche
Cybersecurity and the Human PsycheCybersecurity and the Human Psyche
Cybersecurity and the Human PsycheForcepoint LLC
 
An Inside-Out Approach to Security in Financial Services
An Inside-Out Approach to Security in Financial ServicesAn Inside-Out Approach to Security in Financial Services
An Inside-Out Approach to Security in Financial ServicesForcepoint LLC
 
Cloudy with a Chance of...Visibility, Accountability & Security
Cloudy with a Chance of...Visibility, Accountability & SecurityCloudy with a Chance of...Visibility, Accountability & Security
Cloudy with a Chance of...Visibility, Accountability & SecurityForcepoint LLC
 

Más de Forcepoint LLC (19)

Rethinking the concept of trust (DoDIIS 2019 presentation)
Rethinking the concept of trust (DoDIIS 2019 presentation)Rethinking the concept of trust (DoDIIS 2019 presentation)
Rethinking the concept of trust (DoDIIS 2019 presentation)
 
Sparking Curiosity to Change Security Behaviors
Sparking Curiosity to Change Security BehaviorsSparking Curiosity to Change Security Behaviors
Sparking Curiosity to Change Security Behaviors
 
Understanding the "Intelligence" in AI
Understanding the "Intelligence" in AIUnderstanding the "Intelligence" in AI
Understanding the "Intelligence" in AI
 
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...
Weary Warriors: Reducing the Impact of Wishful Thinking & Fatigue on Informat...
 
AI and ML in Cybersecurity
AI and ML in CybersecurityAI and ML in Cybersecurity
AI and ML in Cybersecurity
 
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...
One Year After WannaCry - Has Anything Changed? A Root Cause Analysis of Data...
 
Addressing Future Risks and Legal Challenges of Insider Threats
Addressing Future Risks and Legal Challenges of Insider ThreatsAddressing Future Risks and Legal Challenges of Insider Threats
Addressing Future Risks and Legal Challenges of Insider Threats
 
A Predictive “Precrime” Approach Requires a Human Focus
A Predictive “Precrime” Approach Requires a Human FocusA Predictive “Precrime” Approach Requires a Human Focus
A Predictive “Precrime” Approach Requires a Human Focus
 
Cyber Convergence, Warfare and You
Cyber Convergence, Warfare and YouCyber Convergence, Warfare and You
Cyber Convergence, Warfare and You
 
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)
Securing the Global Mission: Enabling Effective Information Sharing (DoD MPE-IS)
 
Security Insights for Mission-Critical Networks
Security Insights for Mission-Critical NetworksSecurity Insights for Mission-Critical Networks
Security Insights for Mission-Critical Networks
 
Maintaining Visibility and Control as Workers and Apps Scatter
Maintaining Visibility and Control as Workers and Apps ScatterMaintaining Visibility and Control as Workers and Apps Scatter
Maintaining Visibility and Control as Workers and Apps Scatter
 
Embracing the Millennial Tsunami
Embracing the Millennial TsunamiEmbracing the Millennial Tsunami
Embracing the Millennial Tsunami
 
Shift the Burden
Shift the BurdenShift the Burden
Shift the Burden
 
Revolutionary, Not Evolutionary
Revolutionary, Not EvolutionaryRevolutionary, Not Evolutionary
Revolutionary, Not Evolutionary
 
Cybersecurity and the Human Psyche
Cybersecurity and the Human PsycheCybersecurity and the Human Psyche
Cybersecurity and the Human Psyche
 
The Human Point
The Human PointThe Human Point
The Human Point
 
An Inside-Out Approach to Security in Financial Services
An Inside-Out Approach to Security in Financial ServicesAn Inside-Out Approach to Security in Financial Services
An Inside-Out Approach to Security in Financial Services
 
Cloudy with a Chance of...Visibility, Accountability & Security
Cloudy with a Chance of...Visibility, Accountability & SecurityCloudy with a Chance of...Visibility, Accountability & Security
Cloudy with a Chance of...Visibility, Accountability & Security
 

Último

FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607dollysharma2066
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaoncallgirls2057
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfShashank Mehta
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...ictsugar
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfrichard876048
 
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCRashishs7044
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy Verified Accounts
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africaictsugar
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCRashishs7044
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Kirill Klimov
 
Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environmentelijahj01012
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Servicecallgirls2057
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFChandresh Chudasama
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxsaniyaimamuddin
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCRashishs7044
 
PSCC - Capability Statement Presentation
PSCC - Capability Statement PresentationPSCC - Capability Statement Presentation
PSCC - Capability Statement PresentationAnamaria Contreras
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckHajeJanKamps
 

Último (20)

FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607FULL ENJOY Call girls in Paharganj Delhi | 8377087607
FULL ENJOY Call girls in Paharganj Delhi | 8377087607
 
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City GurgaonCall Us 📲8800102216📞 Call Girls In DLF City Gurgaon
Call Us 📲8800102216📞 Call Girls In DLF City Gurgaon
 
Darshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdfDarshan Hiranandani [News About Next CEO].pdf
Darshan Hiranandani [News About Next CEO].pdf
 
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...Global Scenario On Sustainable  and Resilient Coconut Industry by Dr. Jelfina...
Global Scenario On Sustainable and Resilient Coconut Industry by Dr. Jelfina...
 
Innovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdfInnovation Conference 5th March 2024.pdf
Innovation Conference 5th March 2024.pdf
 
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
8447779800, Low rate Call girls in New Ashok Nagar Delhi NCR
 
Buy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail AccountsBuy gmail accounts.pdf Buy Old Gmail Accounts
Buy gmail accounts.pdf Buy Old Gmail Accounts
 
Kenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby AfricaKenya’s Coconut Value Chain by Gatsby Africa
Kenya’s Coconut Value Chain by Gatsby Africa
 
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
8447779800, Low rate Call girls in Shivaji Enclave Delhi NCR
 
Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024Flow Your Strategy at Flight Levels Day 2024
Flow Your Strategy at Flight Levels Day 2024
 
Cyber Security Training in Office Environment
Cyber Security Training in Office EnvironmentCyber Security Training in Office Environment
Cyber Security Training in Office Environment
 
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort ServiceCall US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
Call US-88OO1O2216 Call Girls In Mahipalpur Female Escort Service
 
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCREnjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
Enjoy ➥8448380779▻ Call Girls In Sector 18 Noida Escorts Delhi NCR
 
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
No-1 Call Girls In Goa 93193 VIP 73153 Escort service In North Goa Panaji, Ca...
 
Guide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDFGuide Complete Set of Residential Architectural Drawings PDF
Guide Complete Set of Residential Architectural Drawings PDF
 
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptxFinancial-Statement-Analysis-of-Coca-cola-Company.pptx
Financial-Statement-Analysis-of-Coca-cola-Company.pptx
 
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
8447779800, Low rate Call girls in Uttam Nagar Delhi NCR
 
PSCC - Capability Statement Presentation
PSCC - Capability Statement PresentationPSCC - Capability Statement Presentation
PSCC - Capability Statement Presentation
 
Corporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information TechnologyCorporate Profile 47Billion Information Technology
Corporate Profile 47Billion Information Technology
 
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deckPitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
Pitch Deck Teardown: Geodesic.Life's $500k Pre-seed deck
 

Using Language Modeling to Verify User Identities

  • 1. USING LANGUAGE MODELING TO VERIFY USER IDENTITIES Chris Poirel Data Scientist BlackHat USA | August 2018 Copyright © 2018 Forcepoint. Eduardo Luiggi Data Scientist
  • 2. Copyright © 2018 Forcepoint. | 2 Problem statement Overview of Language Modeling Data sets and data preparation Case studies OUTLINE
  • 3. USER AND ENTITY BEHAVIOR ANALYTICS UEBA focuses on identifying entities and assessing their risk to an organization Effort to recognize/prevent compromised accounts, malicious activity, IP theft, etc. Routinely investigating novel techniques to extract additional analytic value from existing data sources Copyright © 2018 Forcepoint. | 3 Data challenges Lack of "gold standard" data sets to support traditional supervised ML Increasing volume collected about the entity – often requires pre-filtering or "sessionizing" event sets Requires highly accurate entity resolution (e.g., email, IP, MAC, username) Need to integrate and understand structured vs. unstructured data
  • 4. DEFINING MACHINE LEARNING "Any computer program that improves performance at some task through experience." "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." - Tom Mitchell Copyright © 2018 Forcepoint. | 4 In order to build an effective ML-based solution, we should clearly define… The task we're trying to improve Entity identificaiton A method for measuring the performance of the solution Quantitatively by defining true positives / negatives and assessing precision The experiences we're using to improve performance Unstructured content generated by the entity over time
  • 5. IMPROVE ENTITY IDENTIFICATION How can we improve entity detection from unstructured, human-generated content? Reliably identify an individual from the content of their email Predict if a user's account has been compromised based on their language Extend the same approaches to "less structured" data like command line activity Copyright © 2018 Forcepoint. | 5
  • 6. IMPROVE ENTITY IDENTIFICATION Copyright © 2018 Forcepoint. | 6 Can we make use of advances in Language Modeling to address these concerns? A variety of biometrics have been used for verifying user identities, including fingerprinting, facial recognition, and keystroke analysis NLP research has found that people's use of language can also be uniquely identifying Language modeling is a technique for measuring how likely words and phrases are, given some observations about previous language use LM is used heavily in speaker and author identification, as well as speech recognition and machine translation
  • 7. UNSUPERVISED TO THE RESCUE? Copyright © 2018 Forcepoint. | 7 Techniques for assigning probabilities to sentences or phrases Captures some level of syntax and semantics P("the dog runs") > P("the dogs runs") P("the dog runs") > P("the tables run") Applications Speech recognition: P("Sweet dreams are made of cheese") < P("Sweet dreams are made of this") Language identification P("Donde está la biblioteca" | Spanish) > P("Donde está la biblioteca" | English) Context-sensitive spell checking P("Football is their favorite sport") > P("Football is there favorite sport") Machine translation Autocomplete
  • 8. N-GRAM LANGUAGE MODELS N-gram models estimate the probability of a word given the words that come before it P("The boy walked to the circus") = P(The) * P(boy | The) * P(walked | The boy) * P(to | boy walked) * … Copyright © 2018 Forcepoint. | 8 Word n-grams are contiguous collections of n words: Unigram Bigram Trigram ... The The boy The boy walked boy boy walked boy walked to walked walked to walked to the to to the to the circus.
  • 9. CHALLENGES BUILDING ML MODELS Parameter tuning Must determine an appropriate value for n, how much history to include Small values of n are easy to estimate and are more likely to estimate the probability of seeing terms in a general context Large values are computationally more burdensome and risk overfitting the model, which fails to generalize well Copyright © 2018 Forcepoint. | 9 Training corpus Depends heavily on the application and the problem you're trying to solve Learning a model for the industry (tech), organization (Enron vs. Avocado), or each individual Each may require a different amount of training history or different corpora to achieve acceptable accuracy
  • 10. CHALLENGES BUILDING ML MODELS Copyright © 2018 Forcepoint. | 10 Prepare for the unknown Must account for missing/incomplete data in the training corpus. P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0 Smoothing is a common technique applied to ML models when the universe of observations can never be fully trained, so we must assume any observation can occur
  • 11. CHALLENGES BUILDING ML MODELS Copyright © 2018 Forcepoint. | 11 Prepare for the unknown Must account for missing/incomplete data in the training corpus. P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0 Smoothing is a common technique applied to ML models when the universe of observations can never be fully trained, so we must assume any observation can occur Suppose we trained a unigram model on "The boy walked the dog" Term Freq Prob Smoothed Freq Smoothed Prob the 2 2/|W| = 0.4 boy 1 0.2 walked 1 0.2 dog 1 0.2 circus 0 0 TOTAL |C| = 5 1.0
  • 12. CHALLENGES BUILDING ML MODELS Copyright © 2018 Forcepoint. | 12 Prepare for the unknown Must account for missing/incomplete data in the training corpus. P("The boy walked to the circus") = P(walked | the boy) * … * P(circus | to the) = 0 Smoothing is a common technique applied to ML models when the universe of observations can never be fully trained, so we must assume any observation can occur Suppose we trained a unigram model on "The boy walked the dog" Laplace Smoothing helps solve this problem – there are numerous other smoothing techniques Term Freq Prob Smoothed Freq Smoothed Prob the 2 2/|W| = 0.4 2+1 3 / (|C| + |V|) boy 1 0.2 1+1 2 / (|C| + |V|) walked 1 0.2 1+1 2 / (|C| + |V|) dog 1 0.2 1+1 2 / (|C| + |V|) circus 0 0 1 1 / (|C| + |V|) TOTAL |C| = 5 1.0 |C| + |V| 1.0
  • 13. TEST DATASETS Enron Corporation Corpus ~500k emails from ~150 internal employees Heavily used in NLP and social- networking research with several sources of manual annotation Significant amount of sensitive material was removed before releasing to the public, along with several redactions for privacy concerns Freely publicly available Copyright © 2018 Forcepoint. | 13 Avocado Collection ~1M emails from 280 employees of a defunct IT company fictitiously named Avocado Includes attachments, contact lists, calendars from those employees' personal folders All data has been de-identified Requires license
  • 14. DATA CLEANUP: PRE-PROCESSING ANALYTICS Copyright © 2018 Forcepoint. | 14 Analytics of unstructured data can be particularly difficult due to corpus inconsistencies, missing data, and biased view (internal sender/recipient only) This process is usually messy and requires multiple iterations Need to focus on analyzing relevant text introduced by the sender
  • 15. DATA CLEANUP: PRE-PROCESSING ANALYTICS Copyright © 2018 Forcepoint. | 15 A few things we've done to hone in on personalized text: Address various format issues – Strip extraneous markup (e.g., html,xml), header info, and terse text blocks with no discernible human content Noise filtering – Heuristics to identify bulk senders and exclude those mailboxes from further analysis Disclaimer detection – N-gram and bag of words models to classify paragraphs as disclaimer text and exclude from personalized language models Thread email Remove signature blocks
  • 16. LANGUAGE MODELING FOR USER IDENTIFICATION Given a new block of unstructured content, can we … Learn organizational language models to differentiate Enron vs. Avocado? Learn personalized language models for individuals in each corpus? Apply the same techniques to nontraditional "less structured" content? The ability to address any of these and generalize the techniques across industries can drastically improve our ability to predict unusual activity. Copyright © 2018 Forcepoint. | 16
  • 17. LEARNING A LANGUAGE MODEL Select sent emails from top-N senders in each of the datasets 12 entities/corpus Each entity has between ~1000 and ~9000 sent emails Not every email contributes to the model, e.g., forwarded emails are discarded For each entity Split their email set in training (~80%) and testing datasets (~20%) Use KenLM Language Model Toolkit (https://kheafield.com/code/kenlm/) to learn an n-gram language model on training dataset Uses Kneser-Ney smoothing Copyright © 2018 Forcepoint. | 17
  • 18. CASE STUDY: ORGANIZATIONAL LANGUAGE MODELS Types: 64370 Unigram tokens: 3950539 Types: 28937 Unigram tokens: 943170 Enron Copyright © 2018 Forcepoint. | 18 Avocado
  • 19. CASE STUDY: ORGANIZATIONAL LANGUAGE MODELS Copyright © 2018 Forcepoint. | 19 Surprising? Not really What are the models learning? Avocado Enron please please file power application agreement avocadoit enron wireless attached output state activityname want new new Among top-25 terms
  • 20. CASE STUDY: PERSONALIZED MODELS (AVOCADO) Copyright © 2018 Forcepoint. | 20
  • 21. CASE STUDY: PERSONALIZED MODELS (AVOCADO) Copyright © 2018 Forcepoint. | 21
  • 22. CASE STUDY: PERSONALIZED MODELS (AVOCADO) Copyright © 2018 Forcepoint. | 22
  • 23. CASE STUDY: PERSONALIZED MODELS (AVOCADO) Copyright © 2018 Forcepoint. | 23
  • 24. CASE STUDY: PERSONALIZED MODELS (ENRON) Copyright © 2018 Forcepoint. | 24
  • 25. CASE STUDY: PERSONALIZED MODELS (ENRON) Copyright © 2018 Forcepoint. | 25
  • 26. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 26 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time
  • 27. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 27 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time
  • 28. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 28 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time
  • 29. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 29 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time
  • 30. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 30 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time Enron dataset is ~50% the size of Avocado Enron modeling performance is similar to Avocado @ 50%
  • 31. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 31 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time Enron dataset is ~50% the size of Avocado • Enron modeling performance is similar to Avocado @ 50%
  • 32. HOW MUCH DATA DO WE NEED? Copyright © 2018 Forcepoint. | 32 Changed the size of training samples 0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 1.0 Constant test size Observations With a small training sample we predicted the model with the most tokens most of the time Enron dataset is ~50% the size of Avocado Enron modeling performance is similar to Avocado @ 50%
  • 33. SOMETIMES MODELS ARE JUST BAD Jeff Dasovich Second largest training set from Enron Most unique tokens We are more likely to guess Richard Sanders as the author Common top-25 tokens include 'know', 'like', 'call', 'get', 'time', 'would', 'thanks' Why do we fail to identify Jeff?
  • 34. SOMETIMES MODELS ARE JUST BAD Jeff Dasovich Second largest training set from Enron Most unique tokens We are more likely to guess Richard Sanders as the author Common top-25 tokens include 'know', 'like', 'call', 'get', 'time', 'would', 'thanks' Why do we fail to identify Jeff? He liked to embed news articles in his emails … This article showed up on Wednesday . Thought you might be interested . Texas Journal -- Energy traders cite gains , but some math is missing -- Volatile prices for natural gas and electricity are creating high-voltage counting on these gains could be in for a jolt down the road ...
  • 35. EXTENDING TO OTHER STRUCTURED CONTENT Demonstrated a solution that Addresses the task of entity identification Increases performance according to quantitative precision assessment Improves performance over time with additional experience Potential future applications Chat or phone transcript Command line activity Database / SIEM queries
  • 36. Questions? Chris Poirel Data Scientist BlackHat USA | August 2018 Copyright © 2018 Forcepoint. Eduardo Luiggi Data Scientist

Notas del editor

  1. Enron Unigram tokens: 943170 Types: 28937 Avocado Unigram tokens: 3950539 Types: 64370
  2. Enron Unigram tokens: 943170 Types: 28937 Avocado Unigram tokens: 3950539 Types: 64370