SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Visualizing Clustered Botnet
Traffic Using t-SNE on
Aggregated NetFlows
Muayyad Alsadi Dr. Ali Hadi
Studying high
volume flood of
network traffic,
droplet by droplet
Objective
Visualize network traffic by clustering
similar behaviors together so that
researchers can easily analyse and
study interesting clusters.
An ideal clustering would look like
this figure: →
Objective is to visualize
not to detect
Analysis of the visualization might help in
detection
Introduction
Rule-based Detection
● Rule examples:
○ Blacklist of IPs
○ Blacklist of content checksums
● Require previous knowledge of malware
○ Known previous compromise
○ Can’t detect new, different or adaptive malware
● Botnets are dynamic, their IPs can change (as nodes join and leave)
● Malware payloads might include random data regions (resulting in
changing checksums)
Anomaly Detection
● Hard to define baseline in dynamic changing environment
● Require previous knowledge of environment or baseline
High-level Methodology
Collect
Capture traffic
Collect network traffic
and store it. In this
research we do not rely
on contents but rather on
the flow.
Aggregate
Group many rows
Reducing number of
rows into collective
figures (total/mean/...)
Embed & Visualize
Dimensionality reduction
Reducing number of
columns from higher
dimensions into 2D or 3D
coordinates in order to
visualize the data.
Netflows
● Like phone bill
○ Who contacted who
○ When
○ For how long
○ But NOT: what was said
How does Netflows file look like?
We have used the same file format as the data published in:
"An empirical comparison of botnet detection methods" Sebastian Garcia, Martin Grill, Honza
Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp
100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011
It’s 15 column CSV file that looks like this:
StartTime, Dur, Proto, SrcAddr, Sport, Dir, DstAddr, Dport, State,
sTos, dTos, TotPkts, TotBytes, SrcBytes, Label
2011/08/10 09:46:53.047277, 3550.182373, udp, 212.50.71.179, 39678,
<->, 147.32.84.229, 13363, CON, 0, 0, 12, 875, 413,
flow=Background-UDP-Established
Are Netflows Indicators Effective?
● Without looking at content:
○ periodic, small, fixed-sized,
short-duration, single destination
○ Single source, multiple
destinations, fixed-size
○ Non-periodic, variable sized,
variable duration, ..etc.
Aggregation of rows
● Taking the input netflows, and
picking some aggregation key like
source IP address then aggregate
some numbers like
○ Count of netflows
○ Average bytes
○ Summation of packets
○ ...etc
Choice of aggregations
● Choice of aggregation key
○ Should we include source port?
● Being robust and independent of
varying factors
○ Capture duration
○ Incremental or Real-time
Dimensionality reduction
● Given many dimensions for a single
data point
● We can study them by looking at only
two of those dimensions at a time
(ignoring others) which is called
projection
● Or use PCA or t-SNE to transform data
from high dimension to low dimension
(2D) preserving as much
Dimensionality reduction
● Principal component analysis (PCA)
○ Invented in 1901
○ Statistical method
○ orthogonal linear transformation
○ can be thought of as fitting an
n-dimensional ellipsoid to the data,
where each axis of the ellipsoid
represents a principal component.
Dimensionality reduction
● t-distributed Stochastic Neighbor
Embedding (t-SNE)
○ Modern 2008
○ Stochastic (depends on initial
random state)
○ Not only linear relation
○ Tries to preserve distances
Introduced methodology
Choice of aggregation: aggregation key
● Our aggregation key
○ the type of transport layer protocol (TCP/UDP/ICMP)
○ source address
○ destination address
○ destination port
● Source port is not included in our aggregation key but it’s not
ignored (later)
● Other papers used different keys like only IP Address or a fixed
moving window of 10 minutes traffic …etc. (see “Literature
review” section in our paper)
● Our choice to include destination port allow us to analyse
process behaviors not just hosts
Choice of aggregated fields: before normalization
● Number of Netflows
● Number of Distinct source ports
● Summation of total packets
● Summation of total bytes
● Summation of source bytes (bytes in forward direction)
● Summation of destination bytes (bytes in backward direction)
● Average bytes balance ratio (where bytes balance ratio is source bytes / total
bytes )
● Average wait time between flows
● Average flow duration
But they are not robust
● The scale along different dimensions is not the same
○ having 1 byte off in total bytes is not the same as
having 1 second off in daley between flows
○ IMPORTANT: t-SNE tries to preserve distances
● The longer the capture period, the more bytes we capture
and aggregate
○ Can’t be real-time
○ Can’t be incremental
Normalization: flows relative metrics
● Make numbers relative to number of flows, for example
○ Average number of packets per flow
○ Average total bytes per flow
● The ratio of number of distinct source ports over number of flows
○ A client that uses different ephemeral source port each time would have the ratio
to be 1.0
○ A client the uses the same port over and over would have a value that approaches
0.0
● No matter how long we capture or how many flows we have
aggregated
● The scale is not yet normalized
Normalization: scale and center
● Range normalized
○ x` = (x – xmin
) / (xmax
– xmin
)
○ Pros: scale [0 - 1.0]
○ Cons: not average centered
● Make metrics average centered
○ x` = (x – xavg
)
○ Pros: center is 0
○ Cons: Scale is (-∞, ∞)
● Make metrics average centered and
inverse max scale
○ x` = (x – xavg
) / xmax
○ Pros: center is 0
○ Cons: Scale
● Use log
○ x` = log(x+1)
○ Make 1,10,100,1000 into 0,1,2,3
● Standard Score
○ x`= ( x – xavg
) / σ
○ Pros: center is 0, step is σ
○ Cons: open-ended range (-∞, +∞)
● Sigmoid-family squashing
○ x`= tanh( x – xavg
)
○ Pros: center is 0, range is (-1.0, 1.0)
Evaluation of normalization
methods
TOP: PCA with no
normalization
(left) standard
score (right)
BOTTOM: range
normalized (left),
tanh-normalized
(right)
PCA applied to scenario
5 of CTU-13 dataset with
different normalization
methods
TOP: t-SNE with
no normalization
(left) standard
score (right)
BOTTOM: range
normalized (left),
tanh-normalized
(right)
t-SNE applied to scenario 5
of CTU-13 dataset
Evaluation of embedding
methods
Scenario 1 in the
CTU-13 dataset
PCA (top) vs. t-SNE
(bottom) applied on both
range normalized (left)
and tanh normalized
(right)
Include variance as a feature
Botnets tend to be periodic, being able to identify periodic
behaviors is desired.
Some papers try to calculate frequencies, but our paper use
“variance” as a metric or feature which would be 0 for
uniform values and increase when data is not uniform.
Final Normalized Aggregated Features
1. tanh(flowsi
- flowsavg
)
2. tanh(srcPortsi
- srcPortsavg
)
3. tanh(avgPacketsPerFlowi
- avgPacketsPerFlowavg
)
4. tanh(varPacketsPerFlowi
- varPacketsPerFlowavg
)
5. tanh(avgBytesLogPerFlowi
- avgBytesLogPerFlowavg
)
6. tanh(varBytesLogPerFlowi
-varBytesLogPerFlowavg
)
7. tanh(avgSrcBytesLogPerFlowi
- avgSrcBytesLogPerFlowavg
)
8. tanh(varSrcBytesLogPerFlowi
- varSrcBytesLogPerFlowavg
)
9. tanh(avgDstBytesLogPerFlowi
- avgDstBytesLogPerFlowavg
)
10. tanh(varDstBytesLogPerFlowi
- varDstBytesLogPerFlowavg
)
11. tanh(avgBalanceRatioi
- avgBalanceRatioavg
)
12. tanh(varBalanceRatioi
-varBalanceRatioavg
)
13. tanh(avgWaitTimei
- avgWaitTimeavg
)
14. tanh(varWaitTimei
- varWaitTimeavg
)
15. tanh(avgDurationi - avgDurationavg
)
16. tanh(varDurationi
- varDurationavg
)
Results
t-SNE applied to
scenario 1 of CTU-13
with colored ports
and protocols
With a glance at the
figure, this botnet
easily identified as a
spam botnet that
send mass number
of email possibly
based on IRC C2C
trigger. Beside IRC
this botnet also rely
heavily on DNS and it
rarely use ICMP.
t-SNE applied to
scenario 2 of CTU-13
with colored ports
and protocols
Like previous botnet
except that it does
not rely on DNS but
mainly on IRC
t-SNE applied to
scenario 4 of CTU-13
with colored ports
and protocols
Clearly this is a spam
botnet which uses
ICMP for its
operation
t-SNE applied to
scenario 5 of CTU-13
with colored ports
and protocols
Unlike previous
botnets, it’s not a
spam botnet, it
operates mainly over
IRC and it uses port
3128 (squid proxy
port) found bottom
left
Interactive tool published on
github
https://github.com/muayyad-alsadi/flowvis
Q & A

Más contenido relacionado

Similar a Visualizing botnets with t-SNE

Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
Riyad Parvez
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
Ali Usman
 

Similar a Visualizing botnets with t-SNE (20)

Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
 
Machine learning Investigative Reporting NorthBaySolutions.pdf
Machine learning Investigative Reporting NorthBaySolutions.pdfMachine learning Investigative Reporting NorthBaySolutions.pdf
Machine learning Investigative Reporting NorthBaySolutions.pdf
 
Forecasting time series powerful and simple
Forecasting time series powerful and simpleForecasting time series powerful and simple
Forecasting time series powerful and simple
 
Network-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity SwitchesNetwork-Wide Heavy-Hitter Detection with Commodity Switches
Network-Wide Heavy-Hitter Detection with Commodity Switches
 
Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]Data Structures - Lecture 1 [introduction]
Data Structures - Lecture 1 [introduction]
 
MANET Routing Protocols , a case study
MANET Routing Protocols , a case studyMANET Routing Protocols , a case study
MANET Routing Protocols , a case study
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Ledingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @LendingkartLedingkart Meetup #2: Scaling Search @Lendingkart
Ledingkart Meetup #2: Scaling Search @Lendingkart
 
Network Statistics for OpenFlow
Network Statistics for OpenFlowNetwork Statistics for OpenFlow
Network Statistics for OpenFlow
 
1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf1.0 Descriptive statistics.pdf
1.0 Descriptive statistics.pdf
 
Pregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph ProcessingPregel: A System For Large Scale Graph Processing
Pregel: A System For Large Scale Graph Processing
 
Searching Algorithms
Searching AlgorithmsSearching Algorithms
Searching Algorithms
 
Setting up an A/B-testing framework
Setting up an A/B-testing frameworkSetting up an A/B-testing framework
Setting up an A/B-testing framework
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Analysis of algorithms
Analysis of algorithmsAnalysis of algorithms
Analysis of algorithms
 
Network Bandwidth Allocation.ppt
Network Bandwidth Allocation.pptNetwork Bandwidth Allocation.ppt
Network Bandwidth Allocation.ppt
 
Data_structure using C-Adi.pdf
Data_structure using C-Adi.pdfData_structure using C-Adi.pdf
Data_structure using C-Adi.pdf
 
Database , 8 Query Optimization
Database , 8 Query OptimizationDatabase , 8 Query Optimization
Database , 8 Query Optimization
 
IDS for IoT.pptx
IDS for IoT.pptxIDS for IoT.pptx
IDS for IoT.pptx
 
Tracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First PrinciplesTracking the tracker: Time Series Analysis in Python from First Principles
Tracking the tracker: Time Series Analysis in Python from First Principles
 

Más de muayyad alsadi (7)

Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3Accelerating stochastic gradient descent using adaptive mini batch size3
Accelerating stochastic gradient descent using adaptive mini batch size3
 
Taking your code to production
Taking your code to productionTaking your code to production
Taking your code to production
 
Introduction to Raft algorithm
Introduction to Raft algorithmIntroduction to Raft algorithm
Introduction to Raft algorithm
 
Techtalks: taking docker to production
Techtalks: taking docker to productionTechtalks: taking docker to production
Techtalks: taking docker to production
 
How to think like hardware hacker
How to think like hardware hackerHow to think like hardware hacker
How to think like hardware hacker
 
الاختيار بين التقنيات
الاختيار بين التقنياتالاختيار بين التقنيات
الاختيار بين التقنيات
 
ملتقى الصناع هيا نصنع أردوينو وندخل إلى خفاياه
ملتقى الصناع  هيا نصنع أردوينو وندخل إلى خفاياهملتقى الصناع  هيا نصنع أردوينو وندخل إلى خفاياه
ملتقى الصناع هيا نصنع أردوينو وندخل إلى خفاياه
 

Último

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Último (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 

Visualizing botnets with t-SNE

  • 1. Visualizing Clustered Botnet Traffic Using t-SNE on Aggregated NetFlows Muayyad Alsadi Dr. Ali Hadi
  • 2. Studying high volume flood of network traffic, droplet by droplet
  • 3. Objective Visualize network traffic by clustering similar behaviors together so that researchers can easily analyse and study interesting clusters. An ideal clustering would look like this figure: →
  • 4. Objective is to visualize not to detect Analysis of the visualization might help in detection
  • 6. Rule-based Detection ● Rule examples: ○ Blacklist of IPs ○ Blacklist of content checksums ● Require previous knowledge of malware ○ Known previous compromise ○ Can’t detect new, different or adaptive malware ● Botnets are dynamic, their IPs can change (as nodes join and leave) ● Malware payloads might include random data regions (resulting in changing checksums)
  • 7. Anomaly Detection ● Hard to define baseline in dynamic changing environment ● Require previous knowledge of environment or baseline
  • 8. High-level Methodology Collect Capture traffic Collect network traffic and store it. In this research we do not rely on contents but rather on the flow. Aggregate Group many rows Reducing number of rows into collective figures (total/mean/...) Embed & Visualize Dimensionality reduction Reducing number of columns from higher dimensions into 2D or 3D coordinates in order to visualize the data.
  • 9. Netflows ● Like phone bill ○ Who contacted who ○ When ○ For how long ○ But NOT: what was said
  • 10. How does Netflows file look like? We have used the same file format as the data published in: "An empirical comparison of botnet detection methods" Sebastian Garcia, Martin Grill, Honza Stiborek and Alejandro Zunino. Computers and Security Journal, Elsevier. 2014. Vol 45, pp 100-123. http://dx.doi.org/10.1016/j.cose.2014.05.011 It’s 15 column CSV file that looks like this: StartTime, Dur, Proto, SrcAddr, Sport, Dir, DstAddr, Dport, State, sTos, dTos, TotPkts, TotBytes, SrcBytes, Label 2011/08/10 09:46:53.047277, 3550.182373, udp, 212.50.71.179, 39678, <->, 147.32.84.229, 13363, CON, 0, 0, 12, 875, 413, flow=Background-UDP-Established
  • 11. Are Netflows Indicators Effective? ● Without looking at content: ○ periodic, small, fixed-sized, short-duration, single destination ○ Single source, multiple destinations, fixed-size ○ Non-periodic, variable sized, variable duration, ..etc.
  • 12. Aggregation of rows ● Taking the input netflows, and picking some aggregation key like source IP address then aggregate some numbers like ○ Count of netflows ○ Average bytes ○ Summation of packets ○ ...etc
  • 13. Choice of aggregations ● Choice of aggregation key ○ Should we include source port? ● Being robust and independent of varying factors ○ Capture duration ○ Incremental or Real-time
  • 14. Dimensionality reduction ● Given many dimensions for a single data point ● We can study them by looking at only two of those dimensions at a time (ignoring others) which is called projection ● Or use PCA or t-SNE to transform data from high dimension to low dimension (2D) preserving as much
  • 15. Dimensionality reduction ● Principal component analysis (PCA) ○ Invented in 1901 ○ Statistical method ○ orthogonal linear transformation ○ can be thought of as fitting an n-dimensional ellipsoid to the data, where each axis of the ellipsoid represents a principal component.
  • 16. Dimensionality reduction ● t-distributed Stochastic Neighbor Embedding (t-SNE) ○ Modern 2008 ○ Stochastic (depends on initial random state) ○ Not only linear relation ○ Tries to preserve distances
  • 18. Choice of aggregation: aggregation key ● Our aggregation key ○ the type of transport layer protocol (TCP/UDP/ICMP) ○ source address ○ destination address ○ destination port ● Source port is not included in our aggregation key but it’s not ignored (later) ● Other papers used different keys like only IP Address or a fixed moving window of 10 minutes traffic …etc. (see “Literature review” section in our paper) ● Our choice to include destination port allow us to analyse process behaviors not just hosts
  • 19. Choice of aggregated fields: before normalization ● Number of Netflows ● Number of Distinct source ports ● Summation of total packets ● Summation of total bytes ● Summation of source bytes (bytes in forward direction) ● Summation of destination bytes (bytes in backward direction) ● Average bytes balance ratio (where bytes balance ratio is source bytes / total bytes ) ● Average wait time between flows ● Average flow duration
  • 20. But they are not robust ● The scale along different dimensions is not the same ○ having 1 byte off in total bytes is not the same as having 1 second off in daley between flows ○ IMPORTANT: t-SNE tries to preserve distances ● The longer the capture period, the more bytes we capture and aggregate ○ Can’t be real-time ○ Can’t be incremental
  • 21. Normalization: flows relative metrics ● Make numbers relative to number of flows, for example ○ Average number of packets per flow ○ Average total bytes per flow ● The ratio of number of distinct source ports over number of flows ○ A client that uses different ephemeral source port each time would have the ratio to be 1.0 ○ A client the uses the same port over and over would have a value that approaches 0.0 ● No matter how long we capture or how many flows we have aggregated ● The scale is not yet normalized
  • 22. Normalization: scale and center ● Range normalized ○ x` = (x – xmin ) / (xmax – xmin ) ○ Pros: scale [0 - 1.0] ○ Cons: not average centered ● Make metrics average centered ○ x` = (x – xavg ) ○ Pros: center is 0 ○ Cons: Scale is (-∞, ∞) ● Make metrics average centered and inverse max scale ○ x` = (x – xavg ) / xmax ○ Pros: center is 0 ○ Cons: Scale ● Use log ○ x` = log(x+1) ○ Make 1,10,100,1000 into 0,1,2,3 ● Standard Score ○ x`= ( x – xavg ) / σ ○ Pros: center is 0, step is σ ○ Cons: open-ended range (-∞, +∞) ● Sigmoid-family squashing ○ x`= tanh( x – xavg ) ○ Pros: center is 0, range is (-1.0, 1.0)
  • 24. TOP: PCA with no normalization (left) standard score (right) BOTTOM: range normalized (left), tanh-normalized (right) PCA applied to scenario 5 of CTU-13 dataset with different normalization methods
  • 25. TOP: t-SNE with no normalization (left) standard score (right) BOTTOM: range normalized (left), tanh-normalized (right) t-SNE applied to scenario 5 of CTU-13 dataset
  • 27. Scenario 1 in the CTU-13 dataset PCA (top) vs. t-SNE (bottom) applied on both range normalized (left) and tanh normalized (right)
  • 28. Include variance as a feature Botnets tend to be periodic, being able to identify periodic behaviors is desired. Some papers try to calculate frequencies, but our paper use “variance” as a metric or feature which would be 0 for uniform values and increase when data is not uniform.
  • 29. Final Normalized Aggregated Features 1. tanh(flowsi - flowsavg ) 2. tanh(srcPortsi - srcPortsavg ) 3. tanh(avgPacketsPerFlowi - avgPacketsPerFlowavg ) 4. tanh(varPacketsPerFlowi - varPacketsPerFlowavg ) 5. tanh(avgBytesLogPerFlowi - avgBytesLogPerFlowavg ) 6. tanh(varBytesLogPerFlowi -varBytesLogPerFlowavg ) 7. tanh(avgSrcBytesLogPerFlowi - avgSrcBytesLogPerFlowavg ) 8. tanh(varSrcBytesLogPerFlowi - varSrcBytesLogPerFlowavg ) 9. tanh(avgDstBytesLogPerFlowi - avgDstBytesLogPerFlowavg ) 10. tanh(varDstBytesLogPerFlowi - varDstBytesLogPerFlowavg ) 11. tanh(avgBalanceRatioi - avgBalanceRatioavg ) 12. tanh(varBalanceRatioi -varBalanceRatioavg ) 13. tanh(avgWaitTimei - avgWaitTimeavg ) 14. tanh(varWaitTimei - varWaitTimeavg ) 15. tanh(avgDurationi - avgDurationavg ) 16. tanh(varDurationi - varDurationavg )
  • 31. t-SNE applied to scenario 1 of CTU-13 with colored ports and protocols With a glance at the figure, this botnet easily identified as a spam botnet that send mass number of email possibly based on IRC C2C trigger. Beside IRC this botnet also rely heavily on DNS and it rarely use ICMP.
  • 32. t-SNE applied to scenario 2 of CTU-13 with colored ports and protocols Like previous botnet except that it does not rely on DNS but mainly on IRC
  • 33. t-SNE applied to scenario 4 of CTU-13 with colored ports and protocols Clearly this is a spam botnet which uses ICMP for its operation
  • 34. t-SNE applied to scenario 5 of CTU-13 with colored ports and protocols Unlike previous botnets, it’s not a spam botnet, it operates mainly over IRC and it uses port 3128 (squid proxy port) found bottom left
  • 35. Interactive tool published on github https://github.com/muayyad-alsadi/flowvis
  • 36. Q & A