Mathieu Letourneau, Andrei Saygo, Eoin Ward, Microsoft
This talk will present our research project on .Net file clustering based on their respective basic blocks and the parallel that can be made with DNA sequence variation analysis. We implemented a system that extracts the basic blocks on each file and creates clusters based on them. We also developed an IDA plugin to make use of that data and speed up our analysis of .Net files.
Andrei Saygo, Eoin Ward and Mathieu Letourneau all work as Anti-Malware Security Engineers in the AM Scan team of Microsoft’s Product Release & Security Services group in Dublin, Ireland.
4. About us
We analyse files on a daily basis to determine if they are
malicious and that includes Windows 8 Apps and Windows
Phone apps.
For the past few years we have been involved in fields like
bioinformatics, molecular biology and genetics allowing us to
extrapolate some of the ideas/algorithms used in the bio field
and apply them to malware classification and detection
purposes.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
5.
6. About DNA
- DNA is made of four chemical building blocks called
nucleotides: adenine (A), thymine (T), cytosine (C) and
guanine (G).
- A three-nucleotide series (called codon) in a DNA
sequence specifies a single amino acid.
- The DNA sequences are translated to amino acids that
produce proteins.
- Each DNA sequence that contains instructions to make a
protein is known as a gene.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
moleculesoflife2010.wikispaces.com/Protein+Structure
7. About DNA sequence variation
The human genome comprises about 3 billion base pairs of DNA.
Due to various factors, mutations occur so the DNA sequence may change.
Single nucleotide polymorphisms, frequently called SNPs (pronounced “snips”), are
the most common type of genetic variation among people.
Each SNP represents a difference in a single DNA building block.
They can act as biological markers, helping scientists locate genes that are
associated with disease..
About us
About DNA
.NET disassembler
Clustering
IDA plugin
8. About GWAS
A genome-wide association study (GWAS) is an approach used in genetics research to
associate specific genetic variations with particular diseases.
The method involves scanning the genomes (1 million SNPs) from many different
people (healthy and carriers) and looking for genetic markers that can be used to
predict the presence of a disease.
The results of a GWAS are often displayed in a scatter plot (called a Manhattan plot),
in which the peaks indicate regions of the genome associated with that disease.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
Manhattan plot showing the −log10 P values of
606,164 SNPs in the GWAS for 1,472 Japanese
atopic dermatitis (also known as atopic eczema,
is a non-contagious itchy skin disorder) cases
and 7,971 controls plotted against their
respective positions on autosomes and the X
chromosome
www.nature.com/ng/journal/v44/n11/fig_tab/ng.2438_F1.html
9. The DNA code is read three letters at a time
(these DNA triplets are called codons)
Most of the codons correspond to a specific
amino acid. However some of the 64 codons
code for the same amino acid.
Also three of the codons are used as 'stop'
signals (STOP codon) and another is the
'start' signal (START codon).
This resembles the way a disassembler
works. Here the binary machine code is the
DNA sequence and the assembly code are
the amino acids.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CCCTGTGGAGCCACACCCTAG
CCC TGT GGA GCC ACA CCC TAG
Amino acids CIL(MSIL) instructions
CCC - Proline 288B00000A call
TGT - Cysteine 03 ldarg.1
GGA – Glycine 7D52000004 stfld
GCC - Alanine 02 ldarg.0
ACA – Threonine 04 ldarg.2
CCC - Proline 288B00000A call
TAG -STOP 2A ret
10.
11. The CLR header can be reached from the IMAGE_DATA_DIRECTORY structure.
Then we have access to the offset to the MetaData header that holds the number
of streams.
Immediately after, we have the headers for each stream contained inside the file.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct CLR_HEADER
{
DWORD SizeOfStructure;
WORD MajorRuntimeVersion;
WORD MinorRuntimeVersion;
IMAGE_DATA_DIRECTORY MetaData;
…..
typedef struct METADATA_HEADER
{
…
IMAGE_DATA_DIRECTORY NoOfStreams;
…..
typedef struct STREAM_HEADERSR
{
DWORD Offset;
DWORD Size;
unsigned char * Name;
…..
12. We are interested in #~ (the metadata stream) because it contains the
information about the methods.
- The #~ table header contains a bitmask-QWORD that tells us the tables
present in this stream. (For example we can have the TypeRef, TypeDef,
MethodDef, Field, etc. tables). Out of all, we are interested in the MethodDef
table because it contains the RVAs of the method bodies.
- Following the #~ header we have a set of DWORDs specifying the number of
rows for each table that is present.
- After them we have the actual Metadata tables.
- The RVA within the MethodDef table tells us where the body of the method
can be found.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
typedef struct TABLE_HEADER
{
DWORD Reserved;
WORD MajorVersion;
WORD MinorVersion;
…
QWORD ValidMask;
…..
typedef struct TABLE_METHODDEF
{
DWORD RVA;
WORD ImplFlags;
WORD Flags;
WORD NameIndex;
…..
13. For each method the RVA is the offset to the first instruction.
The Common Intermediate Language (CIL), formerly MSIL, instructions are
encoded using a variable-length instruction encoding, where 1 or 2 bytes are
used to represent the instruction.
We continue to disassemble from the first instruction until we reach RET (opcode
0x2A in CIL).
All the instructions are split into basic blocks and we pick only
the first operand (FOP).
We have a set of rules that will filter out garbage instructions.
We then do a CRC on the list of FOPs and add it in the database.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
CIL(MSIL) FOPs
288B00000A call
03 ldarg.1
7D52000004 stfld
02 ldarg.0
04 ldarg.2
288B00000A call
2A ret
15. Clustering - basics
Feature set:
- CRCIDs representing the hashes of
each FOPS present in a given file
- Double[ ] file1 = [1, 32, 5673, 5674,
5675, 18001, …, 18607];
Distance measure:
- Jaccard index: size of intersection
divided by the size of the union of two
sets.
- Derivate we use: size of smallest of
the two sets divided by the size of
the union.
- Gives a similarity value between 0
and 1, subtracting that to 1 gives us
a distance measure.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
16. Assume 0.01s on average per distance computation
A simplistic implementation would give a complexity of O(n2)
- Computing the distance for every possible pair of files
- For example, imagine having to cluster 1500 files:
(1500) 2 * 0.01 = 22500s (6.25 hours)
Clearly doesn’t scale well
About us
About DNA
.NET disassembler
Clustering
IDA plugin
17. Our mitigation techniques to improve speed:
Loading all the files in memory and ordering them by amount of FOPs they
contain.
Only compute distance when size ratio is within the threshold value, possible
due to properties of our distance computation function.
Use of prototypes for agglomerative clustering
- In each cluster, the smallest file is elected as “prototype” to represent that
cluster.
- When doing agglomerative clustering, new files to the prototypes of each
clusters until we find a distance within the threshold, or alternatively put the
file in a new cluster.
About us
About DNA
.NET disassembler
Clustering
IDA plugin
19. Clustering animation – Threshold = 30%
90 35 88 87 40 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
20. Clustering animation – Threshold = 30%
9035 888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
21. Clustering animation – Threshold = 30%
9035 888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
22. Clustering animation – Threshold = 30%
90
35
888740 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
23. Clustering animation – Threshold = 30%
90
35
8887 92
About us
About DNA
.NET disassembler
Clustering
IDA plugin
24. Clustering animation – Threshold = 30%
908887 92
35
above threshold!
About us
About DNA
.NET disassembler
Clustering
IDA plugin
25. Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
26. Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
27. Clustering animation – Threshold = 30%
9088
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
28. Clustering animation – Threshold = 30%
90
8887
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
29. Clustering animation – Threshold = 30%
90
88
87
92
35
About us
About DNA
.NET disassembler
Clustering
IDA plugin
30. Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
31. Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
32. Clustering animation – Threshold = 30%
88
92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
33. Clustering animation – Threshold = 30%
88 92
35
87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
34. Clustering animation – Threshold = 30%
88 92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
35. Clustering animation – Threshold = 30%
88
92
35 87
About us
About DNA
.NET disassembler
Clustering
IDA plugin
36. Clustering animation – Threshold = 30%
35 87
88
About us
About DNA
.NET disassembler
Clustering
IDA plugin
37. Clustering animation – Threshold = 30%
35 87
88
About us
About DNA
.NET disassembler
Clustering
IDA plugin
38. About us
About DNA
.NET disassembler
Clustering
IDA plugin
312
1000
1500
4604
7380
6.655 81.644
759.799
945.557
1941.852
Clustering speed (Threshold of 80%)
Number of files to cluster Time taken to complete (seconds)
840
1500
7380
3.058 14 35.475
Clustering speed (Threshold of 20%)
Number of files to cluster Time taken to complete (seconds)
39. Time taken to cluster the same 1500 files from the previous example is now
drastically improved and follow the threshold value:
- With the simplistic approach:
22500s
- With mitigation techniques and threshold of 80%:
760s
- With mitigation techniques and threshold of 20%:
14s
About us
About DNA
.NET disassembler
Clustering
IDA plugin
41. We need:
- a file from the database that we know is malicious (we’ve selected
Pameseg/ArchSMS)
- a loose cluster that the file is part of (we’ve selected a cluster that had 399
files)
Algorithm:
- for each CRC present in the target file, we extract the number of files where
that CRC is present
- calculate the median and remove everything that’s above based on the
assumption that most prevalent CRCs are clean (they are also found in
clean files). After this step we got 285 files.
- use the following formula to get the CRCs that are most probably
malicious.
k – total number of CRCs
Nfi – number of files containing a specific CRC
p – the default p-value (0.05)
Di – distance of the specific CRC
About us
About DNA
.NET disassembler
Clustering
IDA plugin
42. - Using the set of data from
gettinggeneticsdone.blogspot.com/2011/04/annotated-manhattan-plots-and-
qq-plots.html,,(200,000 SNPs) and applying the same approach we get:
About us
About DNA
.NET disassembler
Clustering
IDA plugin
43. Applying the formula on our example dataset of 285 files (that was left after we
applied the median) we got a similar result with the GWAS data.
We took the first two CRCs and ran a query for each one in order to see which
files contain them. The result was a set of 10 files, all of which were found to be
malicious and from the same family (Pameseg/ ArchSMS).
About us
About DNA
.NET disassembler
Clustering
IDA plugin
51. Similar to what geneticists are doing in order to analyse
genetic variants and identify their link to various diseases, we
have implemented a similar approach so it can help us to
automatically identify malicious files.
52. The IDA plugin shows the areas of the code that
require more attention. This will reduce the time for
manual analysis.
We can extend the clustering algorithm to other
features like instructions, behaviour data, etc.
In the future we plan to extend the approach to other
type of files and other platforms.
53. Will this method be effective with packed files ?
Weill this method be effective with obfuscated .NET files ?
Does the plugin improve analysis time ?
Can the CRCs be used as part of generic detections / family
classification ?
The effect of the speed mitigation strategies and the used a
derivative of the Jaccard index ?
Other questions, thoughts, etc…
Notas del editor
First we have to talk a bit about how DNA is organised. As you may already know, the DNA is made of 4 bases (A – adenine, T – thymine, C – cytosine and G – guanine). Three of these bases join together form an amino acid.
There must be a precise order or sequence in the DNA, because this is used to make proteins that are responsible for most of the things that happen in our bodies.
The DNA sequence that codes for protein is known as a gene.
As a whole, the human genome is comprised of about 3 billion base pairs. I’ve said base pairs because DNA has two complementary strands. Each nucleotide can pair up only with it’s complement (Adenine can be paired only with Thymine and Cytosine only with Guanine).
Now, everything is nice when things are working properly, but due to various factors mutations appear and the DNA sequence changes, it’s dynamic.
Today we will discuss the single nucleotide polymorphism (aka SNPs). It’s the most common type of genetic variation and is represented by a change in a single nucleotide. For example Adenine changes to Cytosine.
These kind of changes may actually be involved in various diseases, because now the sequence that codes for a particular protein has changed and so the gene function is changed. Researchers are using these SNPs as biological markers to locate genes that are associated with disease.
One way the scientists are searching for bio markers is by using the genome-wide association study (aka GWAS) approach. They scan hundreds of human genomes from both healthy people (aka controls) and carriers and they are looking for SNPs that may predict the presence of a disease.
The results are then displayed in a scatter plot, called a Manhattan plot because it resembles Manhattan’s skyline (kind of).
The peaks indicate the SNPs that are most likely associated with a specific disease. For example, in this graph 600k SNPs were tested for more than1400 Japanese suffering of atopic dermatitis (an itchy skin disorder) and almost 8000 controls. The peaks indicate SNPs that are associated with that disease.
Now we know why the DNA sequence is important and that any change in the nucleotides can modify the instructions that code for protein.
In a way it’s similar to having a sequence of opcodes where any change can modify some instruction and the result can be a different behaviour of a program.
Based on this similarity we built a .NET disassembler and developed a clustering algorithm that allows us to identify malicious code.
We are going to have a quick overview of the .NET file structure and see how do we actually get to the code.
From the IMAGE_DATA_DIRECTORY of a .NET file we get to the CLR_HEADER that holds a similar type of structure for the MetaData. There we can get the number of streams and immediately after we have the structure for each stream. The STREAM_HEADER structure contains the name, offset and size for each stream.
Out of all streams we are interested in the metadata stream. This stream contains the Method Definition table that can get us to the code for each method.
Each Method Definition structure that follows the metadata stream contains a virtual address (RVA) which points to the first instruction for each method defined in the executable.
Using the Common Intermediate Language specs we developed the disassembler. This one starts at the beginning of each method and continues to disassemble the current method until we reach the RET instruction. Afterwards we split everything into basic blocks (a set of instructions that contains one entry point and one exit point so the instructions are executed exactly once, in order). From each instruction in a basic block we extract only the first operand and once we have all of them, we do a CRC that will be later added into a database.
There’s 2 things to consider when we want to do clustering: Features to cluster on and the distance measure to use.
In our case, in line with what Andrei just explained, we used the CRCID of each FOPS in our database.
Basically all the FOPS are linked to their CRC hash / length in our database and therefore all have a unique CRCID.
We build a list of CRCID for each, so they can be represent as an array like we see here.
This intentionally keeps the feature representation generic so that we could end up using any other kind of feature to test different clustering approach without having the change the clustering engine itself.
As for the distance measure, we used a derivate of the Jaccard Index. Instead of divided the size of the intersection by the size of the union of two sets, we divide the size of the smallest set by the size of the union of both sets.
This gives us a similarity value between 0 and 1, then we subtract that to 1 to get a distance measure.
You can see the basic function there with a simple implementation in C#.
The speed of the getDistance operation will of course vary depending on the amount of FOPS in each file, meaning the amount of CRCID in each arrays.
Based on various tests, we can assume an average speed of 0.01s per distance computation.
A normal simplistic implementation of clustering where we want to calculate the distance between every pair of files before then using a different clustering algorithm on this NxN matrix of distances would give a complexity of n-squared.
For example, if we have to cluster 1500 files we would get 1500 square * 0.01 = 22500 seconds .. 6.25 hours.
That clearly doesn’t scale well and as a result…
..we get some very bored analysts
Luckily, we don’t need to calculate the distance for each pair of file.
We implemented a few mitigation techniques that help us improve the speed considerably.
First, we load all the files in memory and order them amount of Fops they contain. By loading the files, I mean loading the arrays of CRCIDs associated with each files.
Then, we compute the distance of two files only if their size ratio is within the threshold value. For example, if we want clusters with a maximum distance of 30% (0.30), the biggest a the two file can be maximum 30% bigger than the smallest one for us to bother computing the distance for these two files. This is possible because the file size ratio is directly used in the distance measure, so we know in advance that a file size ratio bigger than the threshold will never return a distance measure within the threshold.
Also, we support agglomerative clustering, meaning we can periodically add new files to the previously known clusters or create new ones if needed. This could be daily, hourly or every time a new file enters our system. To avoid having to reset the clusters each time, we use prototypes to represent each cluster and we will only compare the new file to the prototypes of previously created clusters. We define as prototype of a cluster the smallest (fewest amount of Fops/CRCID) file it contain. You can see it as the smallest common denominator.
Here’s a basic overview of the clustering algorithm:
We first load all the Files IDs and their respective CRCIDs in a dictionary.
We then sort that dictionary by amount of CRCIDs each file contain. Smallest file first.
This operation is very quick and will be the stepping stone of most of the time savings done later.
We can take the first file (File1) in the dictionary and put it in a new cluster.
- We then loop on every subsequent files until the size ratio gets greater than the threshold value and calculate their distance to the selected file.
- If we obtain a distance within the threshold value on a valid file (File2), we put it in the same cluster as File1 and remove it from the dictionary of files to cluster.
Once we reach the end of the possible matches for File1, we remove it from the dictionary and go back to the third step until all files have been checked.
Now here’s this algorithm explained with an animation.
Let’s say we want to cluster these 8 files with a distance threshold of 30% (or 0.30).
First step is to order them by size (or amount of Fops the file contain).
Then we can start clustering..
We ran a few tests to see the impact of the threshold value on the overall speed of clustering.
We can see here that with a “loose” threshold of 80%, the speed curve is still looking exponential but is leaning towards a more linear form.
The same 1500 files from the first example are now taking 760 seconds to cluster.
Then, with a more realistic threshold of 20% we can see massive improvement in terms of speed, which is now almost linear.
The set of 1500 files are now taking 14 seconds to cluster.
This slide demonstrates well the impact of the mitigation technique we implemented to speed things up.
Just to recap on the initial example of clustering on the set of 1500 files that was originally taking 22500 seconds with the simplistic approach in n-square complexity.
With the mitigation techniques and a threshold value of 80% it takes 760 seconds to cluster, whereas it only take 14 seconds with a threshold of 20%.
If you remember, in the first part of the presentation we talked about how scientists are using the Manhattan plot to identify the SNPs or bio markers used to predict a disease. Based on that approach and using the data that we’ve collected through our clustering algorithm, we’ve devised a similar technique.
Here is an example of how starting from a known malicious file that is part of a cluster of a few hundred files, we managed to identify only the ones that are malicious and even more, how we identified the code that is unique to the malicious files.
First we’ve got the list of CRCs that are present in the target file and for each one we’ve counted the number of files that contain the CRC
The second step was to calculate the median and remove everything that’s above based on the assumption that most prevalent CRCs are clean.
For each remaining CRC, using a formula similar to a chi-squared distribution with k degrees of freedom, we’ve calculated the probability that the CRC is malicious.
But before we show you the results, here is an example using the same approach on a real set of genetic data, comprised of 200k SNPs. We can easily spot a few peaks that show that they are relevant to the disease.
In our set of 285 files we got a similar result. A few CRCs (which are actually basic blocks) were identified as possibly malicious. In order to verify our result, we’ve queried the database to get the files that contain those CRCs. The query returned another 9 files and after the analysis we saw that they are part of the same malware family.
We haven’t stopped here though. In the next slides we will see how we developed a plugin for IDA to help and speed up our analysis.
The Idea behind the IDA Python Plugin is to use the information generated by the clustering to assist analysis.
The plugin does this by calculating the basic blocks and then the fops of these basic blocks.
It then takes these information and poles the database on the determinations on these basic blocks.
It uses these determinations to colour the basic block and also adds a comment that gives the breakdown of the determination.
Finally it appends the comment with the fop length and Fop CRC this can be used to get the md5 of files with this basic block by polling the database.
You can then use these files for comparative analysis.
This are the basic steps.
First Pass -> we retrieve the basic blocks, FOPs and calculate the length and CRCs on the FOPs.
Poll the database for the comments and for each CRC we get the determination for all the files that contain it.
Second Pass -> recalculate the basic block and this time colour and comment each basic block.
If you have looked at the dot net Intermediate Language you will constantly see pushes and pops to the stack.
This is the evaluation stack and it is used to managing state information in .Net similar to how x86 uses registers.
It is best to really think of CLR as a stack machine.
In this vain .NET uses the maxstack function to create space on the stack for its calculations.
The space created is not so relevant just that there is enough space.
For that reason the maxstack instruction is not that relevant, to the programs functionality so IDA Pro does not disassemble it.
Our native C disassembler did parse it, so we had discrepancies between or IDA plugin output and our native C output.
After some investigation we discover that there is little value in the maxstack instructions with regard to the functionality of the basic block.
So we Ignore them and we also Ignore .net “NOP”’s for similar reasons.
The second major issue came with the time taken to query the database.
To improve performance we added a table to give us a single query for each CRC.
We believe we can improve the performance further with store procedures and an ordering of this table, as well as improvement in the python code itself.