9. Pre-‐BinaryPig
-‐
Processing
Issues
• • No Data Locality.
• • Node failures were catastrophic.
• • Hard to add new analysis scripts.
10. Pre-‐Binary
Pig
-‐
Data
Explora=on
Issues
• • How can I share my findings for greater fame and glory?
• • Create a table schema for every analysis script?
• • RDBMS failure is worse than zombie apocalypse.
11. We
needed
a
system
that...
• • Scales to our historical data
• • Recovers from failures
• • Grows through scripting
• • Supports dynamic schemas
• • Searchable via the web
16. BinaryPig
-‐
Processing
• • Hadoop - robust, distributed, with data locality
• • Apache Pig - Extensible, simple
17. BinaryPig
-‐
Results
Explora=on
• • UI - turns your grandma into a data scientist
• • Elasticsearch - schemaless, replicated, awesome
18. Yet
Another
Framework?
• Malware
tools
didn't
scale
• Hadoop
does
not
play
well
with
small
binary
files
• Hadoop
did
not
integrate
exis5ng
malware
analysis
tools
22. BinaryPig
Loaders
• Converts
raw
data
to
a
tuple
• Execu5ng
Loader
o Executes
a
specific
script/program
on
a
file
wriIen
to
a
logical
path
o Example:
Hashing
• Daemon
Loader
o Writes
binaries
to
a
path,
and
provides
those
paths
to
an
already
running
analysis
process
o Example:
Clamd
23. Op=miza=ons
in
BinaryPig
• To
leverage
pre-‐exis5ng
tools,
we
had
to
write
malware
binaries
to
the
local
filesystem
on
the
worker
nodes
o Note:
local
copy,
not
network
copy
o We
op5mized
this
to
use
/dev/shm/
instead
• Quick
scripts
are
great
for
rapid
itera5on,
but...
o Interpreter
startup
5me
can
dominate
execu5on
5me
o Crea5ng
small,
long
running,
analy5c
daemons
provides
a
huge
speedup
for
frequently
used
tasks
o i.e.
the
clamscand
model
of
execu5on
25. strings.sh:
#!/bin/bash
strings "$@"
BinaryPig:
Scrip=ng
strings.pig:
define Loader com.endgame.binarypig.loaders.ExecutingTextLoader;
data = LOAD '$INPUT' USING Loader('strings.sh');
DUMP data;
26. BinaryPig
supports
non-‐PE32
files
• Handles more than just malware...
o Image analysis
o PDF data extraction
o APK extraction
o Any small binary files
31. Feature
Extrac=on
• Our
core
mo5va5on
was
to
dras5cally
improve
the
experience
of
valida5ng
research.
• Packer
iden5fica5on
o Overall
and
Sec5onal
Entropy
o Kolmogorov
Complexity
o Sec5on
and
resource
names
o Sec5on
flags
• Import
tables
• Func5on
Calls
• Resource
hashes
and
subfeatures
32. Feature
Depth
• PEHeaders
are
shallow
o Easy
to
manipulate
o Less
resolu5on
than
reverse
engineering
features
o File
metadata
is
also
low
resolu5on
• Headers
provide
excellent
fast
features
• Headers
are
oben
ignored
• Work
the
analysis
around
the
feature
resolu5on
o Ignore
5ght
clusters,
go
for
wide
ones
o Triage,
not
true
classifica5on
33. Clustering
Results
• Triage
for
dynamic
analysis
winnowing
• Largest
cluster:
377,882
samples
o Three
malware
families
contained
within
o Second
largest:
124,894
samples
• Valida5on
is
tricky
o Manual
valida5on
cannot
be
en5rely
avoided
o Cluster
meanings
change
with
feature
sets
o Cannot
just
go
off
of
AV
results
35. Icon
Features
• Pixel
based
features
o Brightness
o Color
values
o Pixel
density
• Cryptographic
and
fuzzy
hashing
o Perceptual
hashes
• Edge
detec5on
36. Icon
Results
• Icon
clustering
o Groups
do
not
just
include
family
lines
o Copycat
malware
is
shown
as
well
o Clear
indica5ons
of
malicious
intent
• Method
of
infec5on
can
be
extrapolated
o Phishing
o Obfuscated
executables
o Adware
(more
than
we
expected)
o False
posi5ves
-‐
popular
sobware
detec5on
37. Lessons
Learned
• Feature
Selec5on
o Over
500
features
in
PEheader
alone
o Abundance
of
features
requires
pruning
• Interpreta5on
and
Valida5on
of
Results
o Manual
valida5on
is
an
unfortunate
reality
o Care
has
to
be
taken
to
ensure
that
unsupervised
learning
provides
meaningful
results
40. • Rapid Iteration
• Feature extraction
• Clustering analysis for rapid malware triage
• Enables weekly AV scans with latest signatures over
previous malware.
• Created binary classifier to improve sample collection
and categorize new samples
So
What?
41. Future
work
• • Compatibility with Pig 0.10.* and 0.11.*
• • EC2 tutorial
• • More examples/starter scripts
o • Inclusion of some of our Mahout tasks
o • Open source process for that is moving forward
• • Better error logging and handling
o Messages should be stored in a separate DB
• • Easier deployments
o • Analytic daemons
o • Dependency libs
o • Fabric/Salt/Puppet/Chef
42. BinaryPig
is
Open
Source!
https://github.com/endgameinc/binarypig
Apache 2 License