How to Troubleshoot Apps for the Modern Connected Worker
A WHIRLWIND TOUR OF ACADEMIC TECHNIQUES FOR REAL-WORLD SECURITY RESEARCHERS
1. A WHIRLWIND TOUR
OF ACADEMIC TECHNIQUES
FOR REAL-WORLD SECURITY RESEARCHERS
Silvio Cesare, Deakin University
2. Introduction
Started off in industry (Qualys, now Volvent).
Have a Masters by Research.
About to receive a PhD from Deakin University.
Last 5 years in post-graduate University research.
Learnt some cool things along the way.
3. What did I do at University?
Malwise v1 (Masters)
Malwise v2
Binary comparison and visualization service.
Clonewise
Binary clustering service.
Simseer
More improved malware variant search service.
Simseer Cluster
Improved version.
Simseer Search
Malware variant detection system.
Automated detection of embedded libraries in source.
Bugalyze
Detection of bugs using data flow analysis.
5. An incomplete list of mathematical
objects
Strings
Vectors
Sets
Sets of Objects
Trees
Graphs
6. Objects
Objects have different performance.
Example
Comparing
two vectors is fairly fast.
Exact matching two strings is fairly fast.
Inexact matching two strings is medium slow/fast.
Comparing two graphs is slow.
A K T KT K
| | | | | sequence alignment O(mn)
A TK TT T K
7. Transforming one object to another
Problem
Comparing
two 100kb strings using the edit distance is
impractically slow.
Solution
ed(“hello”, “ggello”) = 2
Transform
the strings into vectors.
Then, use a vector comparison – which is fast.
Examples
Comparing
malware samples
Finding near duplicate web pages
Comparing E-Mails
8. N-Grams
Extract all N-length substrings (N-Grams) from
original string.
From training set of strings, choose best N-Grams.
Each unique N-Gram is an index in a vector.
The value of the element is the number of times it
occurs.
W|IEH}R
W|IE
|IEH
IEH}
EH}R
10. A Graph problem
Graph problems like approximate similarity are slow to
solve.
Decompose graph into subgraphs of at most k-nodes.
Canonicalize small graphs, represent by adjacency
matrix, transform to string.
Graph is now a ‘Set of Strings’.
Optionally represent as vector of ‘important ksubgraphs’.
Use Vector distance metrics to compare, index, and
search.
12. Graphs – Case Study
Implemented in Malwise and Simseer
Take control flow graphs of programs.
Decompile into strings.
One:
Consider
program as a vector of N-Grams of
decompiled strings.
L_0
Two:
L_3
true
Consider
program as a set of strings.
L_6
true
L_1
L_7
true
L_2
L_4
true
L_5
true
proc(){
L_0:
while (v1 || v2) {
L_1:
if (v3) {
L_2:
} else {
L_4:
}
L_5:
}
L_7:
return;
}
13. Final Remarks on Objects
Know how to represent your problem.
Look into how the representation can be
approximated
By
transforming it into another object
Vectors are often a good choice.
14. Comparing
Problem
Measure
the similarity (or distance between) two
objects.
Solution
Represent
objects mathematically.
Use multitude of mathematical measures.
Examples
Malware
similarity
Near duplicate web pages
15. Comparing Sets
A set is a collection of elements.
Given an equality function between elements, we
can measure set similarity.
Inexact matching
index
Dice coefficient
Jaccard
s
2 A B
AB
J ( A, B)
A B
A B
16. Comparing Vectors – Ugh, math.
Euclidean Distance
d ( p, q )
(qi pi)
n
2
i 1
Manhattan Distance
n
d ( p, q ) q
i 1
Cosine Similarity
i
similarity cos( )
p
i
A B
A B
17. Vector distance – a different look
A vector is an n-dimensional point in space.
E.g., a 2-d vector is <x,y>
18. Cosine similarity
Line from origin to n-dimensional point.
Given 2 lines, what’s the angle (theta) between
them?
The smaller the angle, the more similar.
Point A
Point B
Theta
19. Comparing Vectors – Case Study
Malwise v2
Feature
vector of N-Grams of decompiled flowgraphs
Manhattan Distance
Simseer Search
Same
feature vector
Euclidean Distance
20. Comparing Sets – Case Study
Malwise v1
An element is a graph invariant of the control flow
graph, represented as an integer.
A program is a set of integers.
Compare similarity between two programs using
Dice coefficient.
21. Malwise v1 - Comparing Sets
1
T
F
2
(1 -> 2), (1 -> 4)
(2 -> 3), ()
(), ()
(4 -> 3), ()
4
T
T
3
s ( A, B)
2 wi x Ai Bi
i
w x A w x B
i
i
i
i
i
i
22. Comparing Sets of Strings in Malwise
v2 – Case Study
String is a decompiled flowgraph.
Program is a set of strings.
Edit distance between strings.
Construct 1:1 mapping between elements of sets:
Such
that the sum of distances is minimized.
Solved using ‘combinatorial optimisation’
Assignment
Problem
Solution by “graph matching”
24. Final Remarks on Comparing
Inexact matching is your friend.
Try to use known distance metrics.
They
have useful properties and index better.
If it’s too slow to compare, transform the object.
25. Similarity Searching
Problem
Find
all ‘similar’ objects to my query in a database
Example
Find
all words in a dictionary with at most 3 differences
to my query word.
This problem is known as a ‘similarity search’
Solution
Naive
exhaustive search.
Better to use ‘Metric Trees’
26. Similarity Search Constraints
Variations
K-nearest
neighbours – the k closests objects to the
query.
All objects within a specific distance to the query.
Search based on using a ‘metric distance’.
Metric distances satisfy mathematical properties.
Examples
Euclidean
Distance
Jaccard Distance
Cosine Distance is not metric
27. Searching – Case Study
Malwise v2
Distance
metric is Manhattan Distance.
Use VP-Trees to index and search in stage 1.
Use DBM-Trees to index and search in stage 2.
Implemented using open source GBDI Arboretum
library.
Query Benign
r
q
d(p,q)
p
Query Malicious
Query
Malware
28. Final Remarks on Searching
Searching for inexact matches is useful.
Use good distance metrics.
Use open source libraries.
29. Classification
The problem:
Given
a set of N classes.
And a query object.
Assign one of the classes to the object.
Class A
Class B
Examples
Is
this binary (malicious, not malicious)?
Is this gmail email (primary, social, promotional)?
Is this web page (defaced, not defaced)?
30. Classification Methodology
Supervised Learning
Given
a training set of objects labelled by their class.
Build a model.
Then use the model to classify unknown objects.
Unsupervised Learning
No
labelled data exists.
“Cluster” objects into classes.
Use clusters to train model.
Then classify as per-normal.
31. Classification – What do I have to do?
Represent objects using “feature vectors”
A vector is an array.
Each element represents a “feature”.
The value of the element tends to be a count of
something, or a size.
Feature examples
The
number of times a dictionary word such as “Hello”
appears in an E-Mail.
The size of a binary.
The number of times LoadLibraryA is executed.
32. Classification – WEKA?
Put the feature vectors into the text-based ARFF file
format.
Plug into the WEKA machine learning toolkit.
Experiment with different classifiers.
Part of your labelled data can be used to evaluate
the accuracy.
35. Classification – Case Study
Clonewise
Feature
vector is set of features extracted from a pair
of packages.
Classify - do these packages share code (yes, no)?
Classify – is the 1st package embedded in the 2nd
package (yes, no)?
36. Final Remarks on Classification
Lots of problems can be considered as this.
Learn how to use WEKA.
Vectors are very good representations.
37. Clustering
Problem
To
group together “similar” objects under some notion
of similarity.
Easy solution
Represent
objects using “feature vectors”.
Plug into WEKA.
Packages in Fedora Linux
38. Clustering - Case Study
Simseer Cluster
Represent
binaries using N-Grams of decompiled
flowgraphs.
Use most frequent N-Grams as features.
Distance measure is cosine distance.
39. Final Remarks on Clustering
A classic machine learning problem.
Again, learn to use WEKA.
40. Program Analysis
An incredibly large and deep field.
This section skims the surface.
Main approaches
Proving
Model Checking
Abstract Interpretation
Data Flow Analysis
Theorem
41. Model Checking
Looks at program states generated by a program.
Some states indicate bugs.
Try BLAST, a model checker for small C programs.
Caveat
- it’s pretty old now.
42. Theorem Proving - SMT
SMT – what is it?
An equation solver that covers the types of operations seen
in machine code.
Approach for Bug Detection
User input can be anything generally, so treat this as a
“symbolic” variable.
The rest is concrete.
Simulate execution of the program, plugging all the machine
code that is executed into the solver formuli.
Concolic execution
Combining symbolic execution with concrete execution.
43. Concolic Execution
At branches, can we have user input that forces us
to go down each path?
Use the SMT solver to tell us.
Launch execution down ‘feasible’ paths.
Use the solver to tell us if bugs are present.
What
user input, if any, can make this pointer NULL?
45. Abstract Interpretation
Abstract the execution of the program.
Example
Only
consider the sign of a variable, not the actual
value.
Requires a transfer function
What
an instruction does to the abstract data.
And a Join/Meet function
How
data is combined when it meets from different
control flow.
46. Data Flow Analysis
Similar to abstract interpretation.
Uses
a transfer function, a join.
Implement both using a monotone framework.
Data Flow analysis is used by compilers.
Classic data flow problems
The
reach of defining or assigning to a variable.
Knowing if a variable will be read again before being
assigned a new value.
47. Data Flow Analysis – Case Study
Implemented in Bugalyze.
Example bug detection
In
free(ptr), where is ptr used before it is reassigned,
and is it used in a free?
Has found real bugs in Debian Linux.
Still a work-in-progress.
49. Final Remarks on Program Analysis
A wide and deep field.
Good to know the basic approaches.
Reversing is becoming more rigourous (think
HexRays).
50. Conclusion
Academia has some useful techniques.
It’s good to know some of the basic methods.
Will improve industrial programs.
Any questions?