This is the presentation I gave at VizSec 2014 on our information-theoretic method for anomaly detection. The conference was held in Paris in November 2014.
3. Motivation
• Anomaly detection is often hard, and context sensitive
• We usually don’t have enough annotated training data,
and annotation itself is uncertain
• Many different techniques exist
• The human ideally should be in the loop
• The visual analytics loop!
4. Aims
• To develop an anomaly detection method that
• Is context-sensitive
• Does not rely on supervised learning
• Can be expanded and refined easily by the
user when needed
• Is not cost-prohibitive to run, and is linearly
scalable
6. Information is Additive
• Notion: the number of all possible answers is
the amount of information
• Roll a fair dice: 6 outcomes, equiprobable
• What if I roll it n times?
• We can make information additive:
7. But… few things are equiprobable!
• Most die are biased
• Most coins, too
8. Let’s Play a Game
• 1/3 chance of getting the ball
• What is the amount of information then in
the answer?
9. Defining the Total Information
• Average of all outcomes - i.e. weight according to their
probabilities:
• More generally,
16. Goals
• An effective UI for designing QCATs
• The visual analytics loop (right) is
ideal for this
• Primarily this system would be used
by the model designer
• A modified version for the analyst,
with additional tool support
• A simplified visualisation (e.g.
time-series) for the observers
Visualisation
Knowledge
Models
(QCATs)
Data
17. CMU-CERT Dataset
• http://www.cert.org/insider-
threat/tools/index.cfm
• Contains known ground-truth for
insider threat scenarios
• Each event linked to a user
Email: 20m
time, user, machine, to (inc. CC,
BCC), from, size, number of
attachments, content
Web: 3.5m
time, user, machine, url
Device: 1.24m
time, user, machine id,
[insert/remove]
Logon/off: 2.6m
time, user, machine, [logon/logoff]
30. Discussion and Future Work
• Future work
• Understanding how mutual information can be represented
• Choice of information-theoretic measure still an issue
• Binning strategies and assisted bin design