Sessão apresentada no Mind The Sec no dia 26 de Agosto de 2015.
Esta sessão vai fazer uma exploração tecnológica bem-humorada de feeds de threat intelligence abertos e comerciais que têm sido tratados pelo mercado de segurança como a nova panacéia para resolver os desafios de monitoramento e resposta a incidentes. Mesmo que nem todo o mercado de threat intelligence possa ser reduzido a feeds de indicadores, eles têm atraído atenção suficiente do mercado para merecer uma análise científica e factual para que os tomadores de decisão possam maximizar os resultados obtidos com os dados disponíveis.
Nos últimos 18 meses, a Niddel vem coletando indicadores de threat intelligence de múltiplas fontes, visando entender o ecossistema e desenvolver métricas de eficiência e qualidade para avaliar os diferentes feeds. Serão apresentadas análises factuais e baseadas em dados do viés estatístico, sobreposição, representatividade, idade de indicadores e unicidade entre diferentes feeds. Todos os dados utilizados será publicado e o código para geração dos gráficos está disponível em projetos open source chamados Combine e TIQ-Test. Estas são as mesmas técnicas e análises por trás da contribuição da Niddel no Verizon Data Breach Incident Report (DBIR) de 2015, um dos mais respeitados relatórios de segurança da informação do mundo.
Esta apresentação também irá apresentar dados agregados de uso de comunidades de compartilhamento de threat intelligence, de forma a identificar os padrões reais de uso e as preocupações e benefícios que os gestores podem esperar deste tipo de iniciativa.
Threat Intelligence Baseada em Dados: Métricas de Disseminação e Compartilhamento de Indicadores
1. THREAT INTELLIGENCE BASEADA EM
DADOS : MÉTRICAS DE DISSEMINAÇÃO
E COMPARTILHAMENTO DE
INDICADORES
Alexandre Sieira
Alex Pinto
2. • Cyber War… Threat Intel –
What is it good for?
• Combine and TIQ-test
• Measuring indicators
• Threat Intelligence Sharing
• Future research direction
(i.e. will work for data)
Agenda
HT to @RCISCwendy
3. 50-ish Slides
3 Key Takeaways
2 Heartfelt and genuine defenses of Threat
Intelligence Providers
1 Prediction on “The Future of Threat
Intelligence Sharing”
Presentation Metrics!!
5. What is TI good for anyway?
TY to @bfist for his work on http://sony.attributed.to
6. What is TI good for (2) – Cyber Maps!!
TY to @hrbrmstr for his work on
https://github.com/hrbrmstr/pewpew
7. What is TI good for anyway?
• (3) How about actual defense?
• Strategic and tactical: planning
• Technical indicators: DFIR and monitoring
8. Affirming the Consequent Fallacy
1. If A, then B.
2. B.
3. Therefore, A.
1. Evil malware talks to 8.8.8.8.
2. I see traffic to 8.8.8.8.
3. ZOMG, APT!!!
10. Combine and TIQ-Test
• Combine (https://github.com/mlsecproject/combine)
• Gathers TI data (ip/host) from Internet and local files
• Normalizes the data and enriches it (AS / Geo / pDNS)
• Can export to CSV, “tiq-test format” and CRITs
• Coming Soon™: CybOX / STIX / SILK /ArcSight CEF
• TIQ-Test (https://github.com/mlsecproject/tiq-test)
• Runs statistical summaries and tests on TI feeds
• Generates charts based on the tests and summaries
• Written in R (because you should learn a stat language)
12. Using TIQ-TEST – Feeds Selected
• Dataset was separated into “inbound” and “outbound”
TY to @kafeine and John Bambenek for access to their feeds
13. Using TIQ-TEST – Data Prep
• Extract the “raw” information from indicator feeds
• Both IP addresses and hostnames were extracted
14. Using TIQ-TEST – Data Prep
• Convert the hostname data to IP addresses:
• Active IP addresses for the respective date (“A” query)
• Passive DNS from Farsight Security (DNSDB)
• For each IP record (including the ones from hostnames):
• Add asnumber and asname (from MaxMind ASN DB)
• Add country (from MaxMind GeoLite DB)
• Add rhost (again from DNSDB) – most popular “PTR”
21. Population Test
• Let us use the ASN and
GeoIP databases that we
used to enrich our data as a
reference of the “true”
population.
• But, but, human beings are
unpredictable! We will
never be able to forecast
this!
24. Can we get a better look?
• Statistical inference-based comparison models
(hypothesis testing)
• Exact binomial tests (when we have the “true” pop)
• Chi-squared proportion tests (similar to
independence tests)
30. Uniqueness Test
• “Domain-based indicators are unique to one list between 96.16% and
97.37%”
• “IP-based indicators are unique to one list between 82.46% and
95.24% of the time”
39. Herd Immunity…
… would imply that others in your sharing community being
immune to malware A meant you wouldn’t get it even if you were
still vulnerable to it.
40. Threat Intelligence Sharing
• How many indicators are being
shared?
• How many members do actually
share and how many just leech?
• Can we measure that? What a
super-deeee-duper idea!
41. Threat Intelligence Sharing
We would like to thank the kind contribution of data from the fine
folks at Facebook Threat Exchange and Threat Connect…
… and also the sharing communities that chose to remain
anonymous. You know who you are, and we ❤ you too.
42. Threat Intelligence Sharing – Data
From a period of 2015-03-01 to 2015-05-31:
- Number of Indicators Shared
Per day
Per member
Not sharing this data – privacy concerns for
the members and communities
54. More Takeaways (I lied)
• Analyze your data. Extract more value from it!
• If you ABSOLUTELY HAVE TO buy Threat Intelligence
or data, evaluate it first.
• Try the sample data, replicate the experiments:
• https://github.com/mlsecproject/tiq-test-Summer2015
• http://rpubs.com/alexcpsec/tiq-test-Summer2015
• Share data with us. I’ll make sure it gets proper exercise!
55. Alex Pinto
Chief Data Scientist
MLSec Project
@alexcpsec
@MLSecProject
Alexandre Sieira
CTO
Niddel
@AlexandreSieira
@NiddelCorp
”The measure of intelligence is the ability to change."
- Albert Einstein
OBRIGADO!
Editor's Notes
During this presentation I’ll be making a quick introduction on some important concepts and our views on what threat intelligence is and the many useful things that can and cannot be done with it.
Then, Mr. Pinto will present a couple of open source tools made available at the MLSec Project GitHub repository and an analysis of the metrics we gathered from a set of publicly available and private threat intelligence feeds.
During this presentation I’ll be making a quick introduction on some important concepts and our views on what threat intelligence is and the many useful things that can and cannot be done with it.
Then, Mr. Pinto will present a couple of open source tools made available at the MLSec Project GitHub repository and an analysis of the metrics we gathered from a set of publicly available and private threat intelligence feeds.
Based on what you read on the media and one some online marketing material, attribution is one of the sexiest parts about threat intelligence.
If you understand your adversaries, their intent and capabilities, there is a lot you can do within your risk management and even business decisions to better prepare yourself.
However, if you want to be realistic you’ll have to admit that as in the case demonstrated here, attribution is really hard to do.
This reflects both on the cost of of getting good CTI data with attribution.
Also it is not reasonable to expect that any significant proportion of attacks out there will be attributed at all, given the sheer amount of smaller threat actors out there.
This is beautifully illustrated by the Sony breach controversy.
So data science to the rescue! The good humored sony.attributed.to website creates plausible attribution reports for the Sony hack.
It is based on data sampled from actual attacks data in the VERIS and DBIR databases consistent with how frequently threat actors, locations and methods are actually used.
One other really common application of threat intelligence is building a threat map.
After all, how else will upper management know that your team really has what it takes to prevent pandas, bears and even maybe capybaras from infiltrating your networks?
Who knows what those samba-dancing Brazilian hackers are up to, after all?
Fear not, now there’s an open source threat map that you can use called Pew pew.
And you don’t even need to stand up your own honeypots. Pew pew displays plausible attack patterns that are sampled randomly from public data made available by Arbor Networks.
So it is exactly as useful as the real thing.
But seriously, what about using the threat intelligence data to actually defend organizations?
The high level data is awesome to help with the high level decisions, of course.
However, how do you go about using the technical indicators?
It’s straightforward enough to pull a list of IP addresses, domain names and URLs into SIEMs or IDS rules.
As Gavin Reid mentioned in his talk yesterday, this can be a great way to reduce the time it takes to detect a novel threat and bypass the detection, rule writing and policy update cycle of your sensors.
However, this has to be done with great care.
Threat intelligence feeds are mostly providing indicators of things that malware does: for example IP addresses, domain names and URLs they communicate with.
Knowing this can be invaluable for all sorts of investigation and forensics activities.
But we really believe the holy grail is detection.
A breach detection example the fallacy: some malware observed talking to a destination does NOT mean that all communication towards that destination is indicative of malware.
Also you need evaluate the quality and the applicability of the indicators you are consuming (and perhaps paying for) to decide which combination of sources is optimal for your organization.
Now Mr. Pinto will tell you a bit about two open source projects we develop to help you perform such an evaluation.
2014 was not a leap year
150k outbound
300k inbound
450k X 365 -> 164,250,000 / 165 MILLION
28 feeds
- For the hostname / domains feeds:
- We extracted the "active" IP addresses for those hostnames on the dates they were reported (using pDNS from Farsight Security)
- Passive DNS query of active "A" responses on the reported day (from 00:00 to 23:59)
- For this experiment, we got rid of "non-public IPs" (localhost, RFC1918)
- Then for each IP record (including the ones got above) enrich it with:
- asnumber and asname (from MaxMind ASN DB)
- country (from MaxMind GeoLite DB)
- rhost - the more popular reverse DNS entry ("PTR") from passive DNS on that date
- Then for each IP record (including the ones got above) enrich it with:
- asnumber and asname (from MaxMind ASN DB)
- country (from MaxMind GeoLite DB)
- rhost - the more popular reverse DNS entry ("PTR") from passive DNS on that date
- we are not playing around with this on this talk, data is too sparse
NOVELTY:
Always request a trial of the data feed (15/30 days)
Measure addition and churn
OVERLAP
OVERLAP
OVERLAP
POPULATION:
We will use the ASN and GEO databases as our population
- They should cover all the existing IPv4 space, give or take a few anomalies
With this, we can simulate a draw from this population for people that are going to be attacking us because human beings are unpredictable and will never be able to forecast where the attacks are coming from right? :troll:
-----
COUNTRY / ASN
(graphs of country proportions on different feeds)
Inbound – Turkey WINS!
US is highly above the population average, and CN is slightly below
----
HYPOTHESIS TESTING OF PROPORTIONS
(explain exact binom test vs. Chi-squared)
----
HYPOTHESIS TESTING OF PROPORTIONS
(explain exact binom test vs. Chi-squared)
-> Explain the diferences
- These differences describe a higher probability of specific actors targeting you from specific locations (GEO/ASN) in relation to a completely random actor.
- Was one of the 1st features I used for MLSec
Get 100 fish in a pond
Tag all
Get 100 more fish – how many were tagged? 5?-> 20x more fish
Intermission
LARGE is 36x bigger than SMALL
Could we be in a sharing community and not have paid feeds?