Releases the blar.py tool which creates a genomic encoding from text files. This encoding results in a lossy, highly compressible representation of the original file that can be used for rapid anomaly detection and forensic analysis.
5. Basics
• Letters (nucleotides)
– 4 in DNA, A,G,C,T
• Codons
– Triplets of nucleotides e.g. GAA
• Genomes have coding regions (proteins)
& non-coding regions (other)
• One strand can be read forward, the other
in reverse
6. It’s all about the Codons
• The Genetic Code is a dictionary of
Codons
• 64 entries (4^3)
7.
8. Analyzing Genomes
• Compare them to each other
– Alignments (e.g. Smith-Waterman, etc.)
– Distances
• Levenshtein (edit) distance (metric)
• Longest Common Subsequence distance (metric)
• Normalized Compression Distance (metric)
– Optimal Grammars
• Pisa.c: Optimal sequence grammar search using
hyperstring encodings
9. Analyzing Genomes
• Look for interesting regions
– Information gain (Kullback-Leibler Div)
– Coding Costs (Kolmogorov Complexity)
– Decaying Coding Costs (Lossy Kolmogorov
Complexity)
14. Don’t say that again
• Sections of DNA that do not repeat are the
most important
• Protein coding genes and RNA coding
genes are non-repetitive
• Higher-order creatures are largely
repetitive
16. Putting the squeeze on
• Normal compressors ~ 2bit codes
• Special genetic compressors exist
• Compressibility equates to sequence
predictability for the model in use
18. A Question
If we could convert sequences of logs,
packets, etc. to a genomic encoding, could
we use genomic analysis to dramatically
speed up & improve forensics, incident
response and anomaly detection?
19.
20. How?
• Step 1: Convert events into alphabet
• Step 2: Convert stream into string of
letters
• Step 3: Money bath
21. A Naïve Solution
• Step 1: Hash each input, use hash value
as a letter
• Step 2: Create stream of hash values
• Step 3: #fail
Why?
22. Answer
• The alphabet is too big
• The stream will need at least
2^(2^<hash_key_size) examples
• Stream is virtually unpredictable
24. WTF is a ‘blarp’?
• Let’s ask Google
• The sound a fat person makes being fat
• The sound of taking big fat data and
making it useful & efficient small data
• A cool little python tool for creating and
analyzing genomic encodings
• The last two will not be found on Google…yet
25. Idea
• We want similar events to be represented
by a single letter
• Hashes are random projections
• Let’s use geometry instead
26. Position in space
• To precisely locate something in space D,
you need dist. to n=D+1 reference points
• Key notion: To get something’s general
area you can use n<<D+1 reference
points
27. Locality-Sensitive Hashing
• Created by Yahoo in late 90’s
• Used within indexing for text lookups on
massive data sets
• Many hashes; data-type dependent
• Question: What if you thought about it as a
‘general area’ hash instead?
28. How it works
• Basic type: Random Projection
• Given a numeric vector (e.g. 1, 15, 3,
14.8) calculate its dot product vs. a
random vector
• If result is positive, call it a ‘1’
• If negative, call it a ‘0’
• Repeat
• Concatenate binary together, result is LSH
30. Vectorizing
• Idea: Count things that matter, take
measurements, etc. and create an array to
hold that information
• Where the rubber meets the road
• Lots of chances for domain expertise
31. Basic Vectorizing in Blar.py
• Basic model: character n-grams
• Also known as Markov chains or Bag of
Letters
• Counts up sliding windows of text
• E.G. 2-grams for ‘sassyfrassy’
sa: 1 as: 2 ss: 2 sy: 2 yf: 1 fr: 1 ra: 1
For 256^2 length array
(1,0…0,2,0…0,2,0…
32. Let’s Vectorize Better
• Use Feature Hashing otherwise known as
the hashing trick
• Find hash mod length and increment
counter for each model pattern
• Permits lossy counting with graceful
random collisions
• Blar.py uses length 64 by default and
xxHash
33. Blar.py code
1. def feature_hash_string(s, window, dim):
2. # Generate window-char Markov chains & create feature hashes
3. chains = [(xxhash.xxh32(s[i:i+window]) % dim) for i in
xrange(len(s)-(window-1))]
4. # Initialize counter array
5. counters = numpy.zeros(dim)
6. # Count instances of feature hashes
7. for i in range(len(chains)):
8. counters[chains[i]] += 1
9. # Return feature hash count vector
10. return counters
34. Now let’s find the LSH
1. # Use random projection for LSH and output a UTF char for
the locality-sensitive hash
2. def locality_hash_vector(v, width):
3. hash = numpy.zeros(width, dtype=int)
4. for x in range(0, width - 1):
5. projection = numpy.dot(COMP_VECTORS[x], v)
6. if projection < 0:
7. hash[x] = 0
8. else:
9. hash[x] = 1
10. # Return unicode char equal to the LSH
11. return unichr(int(''.join(map(str, hash)),2))
35. Blar.py analysis
• Analyzes 4 character sequences and
assigns a decaying version of the optimal
coding cost to each line
• Tells you how interesting a certain event is
relative to everything else in the genome,
accounting for ordering
• Blar.py Genomes are extremely
compressible using bzip especially
36. Blar.py defaults (ATM)
• 4 character sliding windows
• 4 bit hashes
• 64d feature hashes
• Outputs a list of the most interesting
scores
• Outputs a few bad charts
37. Blar.py vs. Toy File
1. Mary had a little lamb whose fleece was white as snow.
2. Mary had a little lamb whose fleece was white as snow.
3. Mary had a little lamb whose fleece was white as snow.
4. Mary had a little lamb whose fleece was white as snow.
5. Mary had a little lamb whose fleece was white as snow.
6. Gary had a little hand whose hair was as white as blow.
7. some more strings
8. some more strings
9. some more strings
10.some more strings
11.some more strings
12.John McAfee was the keynote for Skytalks.
13.John McAfee was the keynote for Skytalks.
14.John McAfee was the keynote for Skytalks.
15.some more strings
16.some more strings
17.some more strings
18.John McAfee was the keynote for Skytalks.
19.John McAfee was the keynote for Skytalks.
20.FOO BAR BAS
39. Blar.py vs. Toy File
(Look Raffy, I’m using the completely inappropriate chart type)
40. Blar.py vs. BlueGene/L
• From the Usenix Computer Failure Data
Repository
• 1.2GB combined log file from 131,072
processors for six months
• 119MB compressed with gzip
• 9.4MB blar.py genome
• Blar.py ~1000 lines/sec
43. TL;DR
• Fast, accurate, free: Blar.py genomic
encoding tool provides very fast, low noise
anomaly detection
• Stop searching in a crisis: Great way to
quickly explore data for IR, forensics, etc.,
especially from unknown sources
• Want it? Follow me @conduit242 for the
GitHub posting announcement