3. 3
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions
25. Section Names, Imports, Imported Functions.
Extracted these features with regular expressions.
Features were (awkwardly) selected in the same
step as instruction ngrams.
Named Features
25
26. Named Features
26
import re
re_features = {
"imports" : {
"re" : re.compile("Imports from w.+"),
"extract" : lambda m : m.group().split()[-1],
"filter" : lambda m : True
},
"imported_functions" : {
"re" : re.compile("__stdcall w.+("),
"extract" : lambda m : m.group().split()[-1][:-1],
"filter" : lambda m : not m.startswith("sub_")
},
"section_names" : {
"re" : re.compile("^S+?:"),
"extract" : lambda m : m.group()[:-1],
"filter" : lambda m : True
}
}
27. Named Features
27
from toolz import pipe, unique
from tools.curried import map, filter
def process_re_feature(lines, re_dict) :
return pipe(
lines,
map(re_dict["re"].search),
filter(lambda m : m is not None),
map(re_dict["extract"]),
filter(re_dict["filter"]),
unique
)
34. xgboost
malware as an image
compression ratio as a feature
other expanded feature sets
probability calibration
semi supervised learning
Winning Strategies
34
usable in a product
specific to
competitions
35. 35
ida ******************************
CV Scores: [ 0.03800 0.02551 0.05283 0.03953 0.0350 ]
mean: 0.03817940685733493 std: 0.008799619405211161
capstone ******************************
CV Scores: [ 0.05065 0.0451 0.06953 0.05583 0.05089]
mean: 0.05441113231562615 std: 0.008283830117670508
code = bytes(bytearray.fromhex("".join(map(
lambda l : "".join(l.split()[1:]).replace("?", ""),
open("data/sample/0A32eTdBKayjCWhZqDOQ.bytes", "r")
))))
from capstone import Cs, CS_ARCH_X86, CS_MODE_32
md = Cs(CS_ARCH_X86, CS_MODE_32)
instructions = " ".join(
[t[2] for t in md.disasm_lite(code, 0x1000) if t[2] != "int3"]
)
Using Capstone
36. IDA not (easily) batch distributable
capstone single pass produces suboptimal results
radare2 Python scriptable reversing framework
vivisect pure Python, largely undocumented
disassembler and analysis project
Disassemblers
36
37. Other Projects
37
pefile extracts header information from executables
binglide visualizations of entropy and byte ngrams
cuckoo automated dynamic analysis
barf binary analysis framework with code analysis
38. 38
Python tools for text classification can easily be
adopted for malware classification.
When using instruction ngrams, your disassembler
and analysis passes are very important.
references: http://bit.ly/scipy-malware
Conclusions