In 2004, Bill Gates told a select group of participants in the World Economic Forum that "two years from now, the spam issue will be solved.” Eight years later, the spam problem is only getting worse, with no sign of relief. Big Data technologies such as Hadoop, MapReduce, Cassandra, and real-time stream processing can be leveraged to develop new approaches to fight spam, phishing, and other email-borne threats more effectively than ever before. This session will focus on the development of radical new “spam anomalytics” techniques whereby billions of messages and message-related events are analyzed daily to find statistical norms- and identify deviations from those norms- in order to better detect and defend against email threats as they emerge.
4. Spam Technology is better …
Spam detection effectiveness has vastly improved to ~99.5%.
4
5. Spam Volumes are Down …
Overall Message Volume - August 2010 to August 2011
Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11 Apr-11 May-11 Jun-11 Jul-11 Aug-11
Spam volumes near 12 month low
5
6. But the game is changing …
Spam is still an annoyance, but companies
are seeing
• An increase in sophisticated, malicious, targeted
attacks
• More diverse organizations being hit with
targeted, smaller-scale attacks
Traditional Security Architectures aren’t
keeping pace with evolving malicious attacks
• Outdated detection and prevention technology
• Lack of innovation from large vendors
6
7. International Data Trafficking: Stolen
Data is Now a Marketable Commodity
In past 5 years a Data Type Street Value
sophisticated market in Valid Email Address $1 per 10,000
stolen data has emerged Username / Password / Emails
from Compromised Website
$1 per 1,000
Credit Card # $2-$90
Diverse buyers Medical Record $8-$20
Bank Credentials $80+
Sophisticated suppliers Rent 100 Botnet Infected
Machines
$700+/month
Celebrity Medical Record $1000+
Vast impact of Admin Access to High-traffic $3,000+
cybercrime, industrial Compromised Website
Company Financials, $10,000+ - ???
espionage Intellectual Property US intelligence agencies estimate cost of
lost business due to theft of technology
and business ideas $100 - $250
billion/year
7
8. Result: An Epidemic of Breaches
A Small Sampling of Recent Breaches
• DOE Laboratories:
April & July 2011
• International Monetary Fund:
June 2011
• Epsilon:
April 2011
• RSA:
March 2011
• Securities and Exchange Commission:
May 2011
• Human Services Agency of San Francisco:
Feb 2011
• Austrian Police Agencies:
August 2011
• Hyundai Capital (South Korea):
April 2011
8
12. What are the Best of Breed Solutions?
•IP, URL, Domain, Registrar, Sender, Receiver, …
Reputation •Local and Global automation
•Super fast, but blunt edged
•Words, Phrases, Patterns, RegEx …
Content •Definitions built using machine-learning and Data Analysts
•Trainable by Language
•Millions of messages
SPAM TRAPS
•Thousands of domains
•Recipient, DKIM, …
Verification •Local integration
•Rejects lessen load on system and provide evidence
12
13. The nasty little secret:
You have to see it
to defend against it
via: spam traps • honey points • honey pots • reported
13
14. The nasty little problem:
The time difference between the
email hitting your server and your
vendor’s spam trap introduces risk
to your organization
14
15. Anomalytics
Current Solutions/Problems
Introducing Anomalytics
Details, details …
15
16. The new challenge:
Not a needle in a hay stack . . .
Organizations see things
Vendors never see (extremely
. . . a needle in a targeted attacks)
needle stack
Organizations see new threats
before Vendors can train on it
Organizations have amazingly
rich data as a by-product of
processing their email …
… that can be used to help stop
new threats the first time they
appear
16
17. We have to flip the model …
Existing techniques try to
understand “Bad”
• Always a half-step behind
• Can be defeated by changing
pattern
Anomalytics: Model “Good”
to find the abnormal
• Find faster - anything outside
normal is suspect
• Hard to defeat – Normal is both
dynamic and variable
17
18. Surface level data from email messages …
Attribute Value Attribute Value
spf_result Fail attachment_count 1
attachment_size 255 charset WINDOWS-1252
country ua ip 94.153.252.70
InsertionDate 2011-04-26 12:37:05 SmtpHelo 94-153-252-70-kh.ip.kyivstar.net
SmtpHostIp 94.153.252.70 msgsize 261
ip reputation 100
virus score 18 number recipients 1
adult score 81 bulk score 1
phish score 0 spam score 97
sender <onewayhash>@domain.com recipient: <onewayhash>@bestspecials.biz
Evidence gathered: "HELO_DYNAMIC_IPADDR2”, "MISSING_HEADERS", "MISSING_SUBJECT”,
"PP_ATTACHMENT_TXT”, "PP_FROM_NOANGLES”, "PP_HAS_RCVD",
"PP_IMG_COUNT_0”, "PP_IP_COUNTRY_UA”, "PP_IP_SCORE_100",
"PP_MIME_PLAINTEXT_ONLY”, "PP_NO_CTE”, "PP_NO_CTYPE”, "PP_NO_MSGID”, "PP_NO_MUA",
"PP_RCVD_FROM_HOME_ISP”, "PP_TO_NOANGLES”, "TO_CC_NONE",
18
19. Hadoop/Anomalytics enables …
Senders/Recei
vers
Understanding User Trends/Behavior
• Circle of Trust
• Sender Analytics
• Receiver Analytics
• …
Domain/Comp
any
Understanding Domain Trends/Behavior
• Domain to sending IP mappings
• Domain % spam sent
• Domain forensics (SPF/DKIM/Headers/etc.)
• …
Infrastructure Understanding Infrastructure Trends/Behavior
• % email spam from IP
• Average message size from IP (and message size distribution)
• Average number of recipients per message from IP
• …
19
20. Circle of Trust: Build a Ledger
Build explicit counts:
• Sender Receiver
Sender • Sender Group
• Sender Company
Domain
Group • Receiver Sender
User
• Receiver Sender Domain
Receiver
User 0/3 n/a 0/56 • Receiver Group Sender
Group 0/21 n/a 0/127 • Receiver Group Sender
Domain
Domain 1/79 n/a 4/9,215
• Receiver Domain Sender
• Receiver Domain Sender
Domain
20
21. Circle of Trust: A Friend of a Friend
No
C Trust D
Trust
Extra Credit:
B Well known machine
Trust learning algorithms allow
you to build “friend of
A friends” solution
“A Trusts C:
A friend of a friend is
my friend.”
21
22. Circle of Trust (cont.)
The fact of whether you, your group, or your company has sent
email to the sender of an incoming message is a strong
indicator of whether something is “normal” or not
22
23. Anomalytics: Looking for norms…
Use historical data to build what is normal for any feature for a
specific time of day based on the day of the week
23
24. Applying Big Data Analysis . . .
Spear phishing example:
You, no one on your team, or
anyone in your company has ever
sent email to the sender or the
sender’s domain
The sender’s IP has unknown
reputation AND is associated with a
suspicious registrar AND was just
published less than 24 hours ago
This sender has sent 5 emails in 5
minutes to your company, all to your
group
The content contains a URL that has
never been seen before and has an
extremely low Alexa ranking
24
25. Behavioral Analysis using Big Data
IP, URL, Domain, Registrar, Sender,
IP, URL, Domain, Registrar, Sender,
Receiver, …
Receiver, …
Reputation
Local and Global automation
Local and Global automation
Behavioral Super fast, but blunt edged
Super fast, but blunt edged
Word, Phrases, Patterns, RegEx…
Cloud/Big Data
solutions leveraged Content Definitions built using machine-
learning and Data Analysts
to catch evolving
threats Trainable by Language
Can leverage any
high level facet of a Millions of messages
Spam Traps
message to compute Thousands of domains
rates, norms,
deviations, clusters Recipient, DKIM,…
Verification Local integration
Rejects lessen load on system
and provide evidence
25
26. Overview
Current Solutions/Problems
Introducing Anomalytics
Details, details …
26
27. Architectural Overview
Customer Datacenters Proofpoint Datacenters
Legacy
EC2 for compute
Spam Filter FN/FP events
Appliance Systems
email traffic events
S3 for long-term storage
Hosted Spam Aggregator
Filter
Scoring
Servers and applications built
request
on top of Proofpoint Platform
FN/FP events
Scoring
request
email traffic events
HTTP-based APIs
Amazon AWS
Deployments and application
Scorer Collector
lifecycle managed via Galaxy
Hive + Hadoop
MR
Other AWS technologies: ELB,
S3 ElasticMapReduce (+ Hive),
Model
Repository
Event
Repository
CloudFormation
27
28. Architectural Overview
Transform from legacy
hierarchical XML format
Email traffic events
FN/FP events
(Legacy XML)
Normalizer json
Event
Collector into json
Local
Spool
Snappy-compressed Canonicalize URLs, email
Event Repository (S3) json
addresses
Email traffic
Staging Area
(S3)
FN/FP
Annotate with additional
features (ASN,
nameserver for sender
Combiner IP)
Forward to generic event
collection layer
28
29. Data Collection and Storage
Tradeoff
• S3 files are immutable, write-once and not available for reads until
"complete”
• Ability to process new data as soon as possible requires writing small
files
• … but, Hadoop more efficient at processing large files
Solution:
• Local spool in collectors (1 minute or 512 MB)
• Upload to staging area in S3
• Compressed using Snappy
– Framing format supports concatenated compressed files
– Pure java implementation: https://github.com/dain/snappy
• Simulate "append" by repeatedly concatenating staged files into hourly
buckets (or 512 MB)
• S3 multipart upload API with references to existing files in S3
29
30. Processing
Elastic MapReduce
Custom MR jobs over S3 files
Hive jobs external tables
• JSON Serde: https://github.com/proofpoint/hive-serde
Final output into S3
30
31. Building RESTful Services
Toolkit for building Java-based web services and applications
Mostly "glue" for common Java technologies JAX-RS (Jersey), HTTP
(Jetty), JSON (Jackson), JMX
Some abstractions to produce applications with uniform:
• service discovery
• configuration
• logging
• monitoring hooks
• event generation
• packaging and deployment
Applications deployable via Galaxy
• https://github.com/dain/galaxy-server
Support for Rails apps (via JRuby)
https://github.com/proofpoint/platform
31
This is 1. Over the 2010 holidays, we saw spammers seeming to take a vacation as spam levels dropped precipitously. 2. Botnets seems to have come back online shortly after the new year3. Large drop in observed spam volume in March coincides with takedown of the “Rustock” botnet4. Spam levels continue bursty, and have started to climb again in the past quarter5. Most recently, Proofpoint has observed the same increase in infected attachments reported by some other vendors
http://www.guardian.co.uk/technology/2011/sep/21/cybercrime-spam-phishing-viruses-malware?INTCMP=SRCH“In 2009, the White House suggested that cybercrime and industrial espionage inflicted damage of around $1tn (£640bn) a year – almost 1.75% of global GDP. Can it be true? The answer is that, whatever anyone may say, nobody has the faintest idea. The $1tn could be a wildly exaggerated figure put out there by the cyber security industry in order to generate sales. Or it could be the result of some hyperactive algorithms. Or it could be true. But nobody can assert with any confidence which it is.” --- MishaGlennyRef: MishaGlenny, presentation to the RSA London, 9/15/11, “Dark Market: Cyber thieves, cyber cops and you”http://www.thersa.org/events/audio-and-past-events/2011/dark-market-cyber-thieves,-cyber-cops-and-youOne of the most authoritative sources: 2011 Verizon Business Cybercrime Report: “A study conducted by the Verizon RISK Team with cooperation from the U.S. Secret Service and the Dutch High Tech Crime Unit.”, April 2011http://www.verizonbusiness.com/resources/reports/rp_data-breach-investigations-report-2011_en_xg.pdfPanda Security: "The Cyber-Crime Black Market: Uncovered", January 2011, http://press.pandasecurity.com/wp-content/uploads/2011/01/The-Cyber-Crime-Black-Market.pdf* Coverage here: http://www.creditcards.com/credit-card-news/credit-card-fraud-price-list-1282.phphttp://krebsonsecurity.com/2011/02/eharmony-hacked/ http://www.ft.com/intl/cms/s/0/ba6c82c0-2e44-11e0-8733-00144feabdc0.html#axzz1Z69RPtcC
The DOE, IMF and May 18, 2011The Securities and Exchange CommissionDenver, Colorado GOV DISC 4,000On May 4, a contractor working for the Interior Department's National Business Center accidentally sent an unencrypted email. There was a security feature in the system software that was designed to prevent such mistakes, but it failed to stop the email from going through. Any information in the unencrypted email was vulnerable for about 60 seconds. The email contained agency employee Social Security numbers and other payroll information. Information Source:Databreaches.netFebruary 5, 2011Human Services Agency of San FranciscoSan Francisco, California GOV INSD 2,400A former city employee emailed the information of her caseload to her personal computer, two attorneys and two union representatives. The former employee wanted proof that she was fired for low performance because she had been given an unusually high number of cases. Certain MediCal recipients in San Francisco had their names, Social Security numbers and other personal information exposed.Information Source:PHIPrivacy.netrecords from this breach used in our total: 2,400Austrian Police Agencies: August 2011Personal data on 25,000 police officials published by AnonymousHyundai Capital (South Korea): April 2011Korea’s Financial Supervisory Service slaps Hyundai Capital with prohibition on buying stakes in other companies for 3 years after personal info on 1.8 million customers compromised.
OBJECTIVE: Mobile Devices are real risks for phishing attacks