Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Beyond Blacklists: Malicious Url Detection Using
Machine Learning
Who am I ?
• Info security Investigator @ Cisco.
• Completed Mtech from IIT Jodhpur in 2014.
• Areas of interest include m...
Malicious websites
Phishing : which one is real ??
Visiting Malicious Websites
What we want ?
Problem in a Nutshell
6
 URL features to identify malicious Web sites
 No context, no content
 Different classes of URL...
Information about new websites
State of the Practice
8
 Current approaches
 Blacklists [SORBS, URIBL, SURBL, Spamhaus]
 Learning on hand-tuned feature...
URL Classification System
9
Label Example Hypothesis
Data Sets
10
 Malicious URLs
 5,000 from PhishTank (phishing)
 15,000 from Spamscatter (spam, phishing, etc)
 Benign U...
Algorithms
11
 Logistic regression w/ L1-norm regularization
 Other models
 Naive Bayes
 Support vector machines (line...
Feature vector construction
Features to consider?
14
1) Blacklists
2) Simple heuristics
3) Domain name registration
4) Host properties
5) Lexical
(1) Blacklist Queries
15
 List of known malicious sites
 Providers: SORBS, URIBL, SURBL,
Spamhaus
http://www.bfuduuioo1f...
(2) Manually-Selected Features
16
 Considered by previous studies
 IP address in hostname?
 Number of dots in URL
 WHO...
(3) WHOIS Features
17
 Domain name registration
 Date of registration, update, expiration
 Registrant: Who registered d...
(4) Host-Based Features
18
 Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)
 WHOIS: registrar, registrant, dates
 IP addre...
(5) Lexical Features
19
 Tokens in URL hostname + path
 Length of URL
 Entropy of the domain name
http://www.bfuduuioo1...
Which feature sets?
20
Blacklist
Manual
WHOIS
Host-based
Lexical
Full
w/o WHOIS/Blacklist
4,000
# Features
13,000
4
3
17,0...
Beyond Blacklists
21
Blacklist
Full features
Yahoo-PhishTank
Higher detection rate for
given false positive rate
Limitations
22
 False positives
 Sites hosted in disreputable ISP
 Guilt by association
 False negatives
 Compromised...
Conclusion
23
 Detect malicious URLs with high accuracy
 Only using URL
 Diverse feature set helps: 86.5% w/ 18,000+
fe...
References
 Ma, Justin, et al. "Beyond blacklists: learning
to detect malicious web sites from suspicious
URLs." Proceedi...
Q & A
Próxima SlideShare
Cargando en…5
×

Malicious Url Detection Using Machine Learning

Presented by Satyam Saxena in SecurityXploded Cyber security meet. visit: http://www.securitytrainings.net for more information.

  • Inicia sesión para ver los comentarios

Malicious Url Detection Using Machine Learning

  1. 1. Beyond Blacklists: Malicious Url Detection Using Machine Learning
  2. 2. Who am I ? • Info security Investigator @ Cisco. • Completed Mtech from IIT Jodhpur in 2014. • Areas of interest include machine learning, computer vision and A.I. • Email : satyamiitj89@gmail.com
  3. 3. Malicious websites Phishing : which one is real ??
  4. 4. Visiting Malicious Websites
  5. 5. What we want ?
  6. 6. Problem in a Nutshell 6  URL features to identify malicious Web sites  No context, no content  Different classes of URLs  Benign, spam, phishing, exploits, scams...  For now, distinguish benign vs. malicious facebook.com fblight.com
  7. 7. Information about new websites
  8. 8. State of the Practice 8  Current approaches  Blacklists [SORBS, URIBL, SURBL, Spamhaus]  Learning on hand-tuned features [Garera et al, 2007]  Limitations  Cannot predict unlisted sites  Cannot account for new features  Arms race: Fast feedback cycle is critical More automated approach?
  9. 9. URL Classification System 9 Label Example Hypothesis
  10. 10. Data Sets 10  Malicious URLs  5,000 from PhishTank (phishing)  15,000 from Spamscatter (spam, phishing, etc)  Benign URLs  15,000 from Yahoo Web directory  15,000 from DMOZ directory  Malicious x Benign → 4 Data Sets  30,000 – 55,000 features per data set
  11. 11. Algorithms 11  Logistic regression w/ L1-norm regularization  Other models  Naive Bayes  Support vector machines (linear, RBF kernels)  Implicit feature selection  Easier to interpret
  12. 12. Feature vector construction
  13. 13. Features to consider? 14 1) Blacklists 2) Simple heuristics 3) Domain name registration 4) Host properties 5) Lexical
  14. 14. (1) Blacklist Queries 15  List of known malicious sites  Providers: SORBS, URIBL, SURBL, Spamhaus http://www.bfuduuioo1fp.mobi In blacklist? Yes http://fblight.com No In blacklist? http://www.bfuduuioo1fp.mobi Blacklist queries as features ........................................ ........................................
  15. 15. (2) Manually-Selected Features 16  Considered by previous studies  IP address in hostname?  Number of dots in URL  WHOIS (domain name) registration date stopgap.cn registered 28 June 2009 http://72.23.5.122/www.bankofamerica.com/ http://www.bankofamerica.com.qytrpbcw.stopgap.cn/
  16. 16. (3) WHOIS Features 17  Domain name registration  Date of registration, update, expiration  Registrant: Who registered domain?  Registrar: Who manages registration? http://sleazysalmon.com http://angryalbacore.com http://mangymackerel.com http://yammeringyellowtail.com Registered on 29 June 2009 By SpamMedia
  17. 17. (4) Host-Based Features 18  Blacklisted? (SORBS, URIBL, SURBL, Spamhaus)  WHOIS: registrar, registrant, dates  IP address: Which ASes/IP prefixes?  DNS: TTL? PTR record exists/resolves?  Geography-related: Locale? Connection speed? 75.102.60.0/2269.63.176.0/20 facebook.com fblight.com
  18. 18. (5) Lexical Features 19  Tokens in URL hostname + path  Length of URL  Entropy of the domain name http://www.bfuduuioo1fp.mobi/ws/ebayisapi.dll
  19. 19. Which feature sets? 20 Blacklist Manual WHOIS Host-based Lexical Full w/o WHOIS/Blacklist 4,000 # Features 13,000 4 3 17,000 30,000 26,000
  20. 20. Beyond Blacklists 21 Blacklist Full features Yahoo-PhishTank Higher detection rate for given false positive rate
  21. 21. Limitations 22  False positives  Sites hosted in disreputable ISP  Guilt by association  False negatives  Compromised sites  Free hosting sites  Hosted in reputable ISP  Future work: Web page content
  22. 22. Conclusion 23  Detect malicious URLs with high accuracy  Only using URL  Diverse feature set helps: 86.5% w/ 18,000+ features  Proof concept working in lab  Future work  Scaling up for deployment
  23. 23. References  Ma, Justin, et al. "Beyond blacklists: learning to detect malicious web sites from suspicious URLs." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009.
  24. 24. Q & A

×