SlideShare a Scribd company logo
1 of 35
Going a Step Beyond the Black 
and White Lists for URL Accesses 
in the Enterprise by means of 
Categorical Classifiers 
Authors: 
Antonio Miguel Mora García 
Paloma de las Cuevas Delgado 
Juan Julián Merelo Guervós 
ECTA 2014, Rome, Italy
MUSES is an EU funded research project 
1
Bring Your Own Device 
What happens to corporate assets in a BYOD 
environment? 
2
Structure of the MUSES server 
3
Underlying Problem 
Enterprise Security applied to employees’ connections to the 
Internet (URL requests). 
4 
www 
● Proxies 
● Firewalls 
● Corporate Security Policies (CSP) which may 
include Blacklists and Whitelists
What do Black and White lists cover? 
● Every URL inside a Blacklist is denied, if not, it is allowed. 
What if something is directly allowed but it should not be? 
● Every URL inside a Whitelist is allowed, if not, it is denied. 
What if something is directly denied but it should not be? 
Therefore, we want to go a step beyond. 
5
● Objective → to obtain a tool for automatically making an 
allowance or denial decision with respect to URLs that are 
not included in the black/whitelists. 
o This decision would be based in the one made for similar URL 
accesses (those with similar features). 
o The tool should consider other parameters of the request in 
addition to the URL string. 
Objectives 
6
Followed Schema 
Unlabelled Labelled requests 
requests 
Classification 
accuracies and Rules Classification 
methods 
7 
Data Mining Labelling Process 
Analysis of results Machine Learning
Working Scenario 
Employees requesting accesses to URLs (records from an actual 
Spanish company - around 100 employees) from 8 to 10 am. 
8 
www 
● Log File of 100k entries (patterns). CSV file format. 
● A set of rules (specification of the security policies 
on if-then clauses).
Data description: Entries in the Log 
● An Entry (unlabelled) 
● It has 7 categorical fields and 3 numerical fields. 
● Leads to classification which support both types: 
o Rule based classifiers 
o Tree based classifiers 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:0 
8 
DEFAULT_PARENT 10696 
1 
http://www.on 
e.example.com 
X.X.X.X 
9
Data description: Policies and Rules 
● A Policy and a Rule 
“Video streamings cannot be reproduced” 
rule "policy-1 MP4" 
attributes 
when 
squid:Squid(dif_MCT=="video",bytes>1000000, 
content_type matches "*.application.*, 
url matches "*.p2p.* ) 
then 
PolicyDecisionPoint.deny(); 
end 
● It has a set of conditions, and a decision (ALLOW/DENY). 
● Each condition has: Data Type, Relationship, Value. 
10
Labelling Process 
● The two data sets are compared during the labelling process. 
● Conditions of each rule are checked in each entry/request. 
● If an entry meets all conditions, it is labelled with the 
corresponding decision of the rule. 
When 
- Entry meets conditions of a rule that allows making the request. 
AND - Entry meets conditions of a rule that denies making the request. 
THEN - DENY is chosen. 
11
Data Summary 
● The CSV file, now with all the patterns that could be labelled 
(the others were not covered by the rules), has 57502 
entries/patterns: 
o 38972 with an ALLOW label. 
o 18530 with a DENY label. 
2:1 ratio 
● Application of data balancing techniques: 
o Undersampling: random removal of patterns in majority class. 
o Oversampling: duplication of each pattern in minority class. 
12
Experimental Setup 
● The classifiers are tested, firstly, with a 10-fold cross-validation 
process. 
o Top five classifiers in accuracy, are chosen for the following 
experiments. 
o Also, Naïve Bayes classifier is taking as a reference. 
● Secondly, a division process is performed over the initial 
(labelled) log file, into both training and test files. 
● These training and test files are created with different ratios 
and either taking the entries randomly or sequentially. 
13
Flow Diagram 
1) Initial labelling process. 
Experiments with unbalanced, and balanced 
data. From those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
Randomly, and sequentially. 
3) Enhancing the creation of training and test files. 
Experiments with unbalanced data. From those, divisions 
are made, patterns randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
2) Removal of duplicated requests. 
Experiments with unbalanced data. From 
those, divisions are made: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
Randomly, and sequentially. 
4) Filtering the features of the URL. 
Experiments with unbalanced, and balanced data. 
From those, divisions are made, patterns 
randomly taken: 
● 80% training 20% testing 
● 90% training 10% testing 
● 60% training 40% testing 
14
10-fold cross-validation experiments 
1) Initial labelling process. 
● The classifiers are tested, firstly, with a 10-fold cross-validation process 
over the balanced data. 
15
Using separate training/test files 
1) Initial labelling process. 
● Naïve Bayes and top five classifiers are tested with training and test 
divisions, in order to avoid testing patterns being used for training and 
vice versa. 
16
Serendipity rocks 
1) Initial labelling process. 
Divisions made over unbalanced data 
17
Results continue falling 
1) Initial labelling process. 
Divisions made over balanced data (undersampling) 
18
Results continue falling 
1) Initial labelling process. 
Divisions made over balanced data (oversampling) 
19
Why are accuracies still high? 
2) Removal of duplicated requests. 
● We studied the field squid_hierarchy and saw that had two possible 
values: DIRECT or DEFAULT_PARENT. 
http_reply_ 
code 
http_metho 
d 
duration_ 
miliseconds 
content_type server_or_ 
cache_address 
time squid_hierarchy bytes url client_ 
adress 
200 GET 1114 application/octet-stream 
X.X.X.X 08:30:0 
8 
DEFAULT_PARENT 10696 
1 
http://www.on 
e.example.com 
X.X.X.X 
20
Repeated entries affect accuracies 
2) Removal of duplicated requests. 
● The connections are made, firstly, to the Squid proxy, and then, if 
appropriate, the request continues to another server. 
o Then, some of the entries were repeated, and results may be affected for 
that. 
21 
www 
“Some local IP” 192.194.2.2 “Some server IP”
Serendipity rocks again 
2) Removal of duplicated requests. 
Divisions made over unbalanced data 
22
Where are the URL features going? 
3) Enhancing the creation of training and test files. 
● Repeated URL core domains could yield to false results. 
● During the division process, we ensured that requests with the same 
URL core domain went to the same file (either for training or for 
testing). 
23
Accuracies fall down automatically 
3) Enhancing the creation of training and test files. 
24
Created Rules During Classification 
● In the experiments that included only the URL core domain as a 
classification feature, rules were too focused on that feature. 
PART decision list 
------------------ 
url = dropbox: deny (2999.0) 
url = ubuntu: allow (2165.0) 
url = facebook: deny (1808.0) 
url = valli: allow (1679.0) 
25
Created Rules During Classification 
● Another kind of rules were found, but always dependant on 
the URL core domain. 
url = grooveshark AND 
http_method = POST: allow (733.0) 
url = googleapis AND 
content_type = text/javascript AND 
client_address = 192.168.4.4: allow (155.0/2.0) 
url = abc AND 
content_type_MCT = image AND 
time <= 31532000: allow (256.0) 
26
Training with other URL features 
4) Filtering the features of the URL. 
● Rules created by the classifiers are too focused on the URL core domain 
feature. 
● We did the experiments again with the original file, but including as a 
feature only the Top Level Domain of the URL, and not the core domain. 
27
Random Forest defeats everyone 
4) Filtering the features of the URL. 
Divisions made over balanced data 
28
Created Rules During Classification 
● After including the URL top level domain as a classification feature, 
instead of URL core domain, rules classify mainly by server 
address. 
PART decision list 
------------------ 
server_or_cache_address = 173.194.34.248: allow (238.0/1.0) 
server_or_cache_address = 91.121.155.13: deny (235.0) 
server_or_cache_address = 90.84.53.48 AND 
client_address = 10.159.39.199 AND 
tld = es AND 
time <= 31533000: allow (138.0/1.0) 
29
Created Rules During Classification 
● URL TLD appears, but now the rules are not always 
dependant on this feature. 
server_or_cache_address = 90.84.53.19 AND 
tld = com: deny (33.0/1.0) 
server_or_cache_address = 87.248.20.254 AND 
content_type_MCT = image AND 
duration_milliseconds > 21: deny (15.0) 
server_or_cache_address = 23.38.17.224 AND 
time > 30532000 AND 
http_reply_code = 200 AND 
content_type_MCT = image AND 
bytes <= 520 AND 
time <= 33677000: allow (40.0) 
30
● In most cases, Random Forest classifier is the one that yields 
better results. 
● The loss of information when analysing a Log of URL 
requests lowers the results. This happens when: 
o Oversampling data (because we randomly remove data). 
o Keeping the sequence of the requests of the initial Log file while 
making the division in training and test files. 
Conclusions 
31
Conclusions 
● As seen in the rules obtained, it is possible to develop a tool 
that automatically makes an allowance or denial decision 
with respect to URLs, and that decision would depend on 
other features of a URL request and not only the URL. 
33
● Making experiments with bigger data sets (e.g. a whole 
workday). 
● Include more lexical features of a URL in the experiments 
(e.g. number of subdomains, number of arguments, or the 
path). 
● Consider sessions when classifying. 
o Defining session as the set of requests that are made from a certain 
client during a certain time). 
● To finally implement a system and to prove them with real 
data, in real-time. 
Future Work 
34
Thank you for your attention 
Questions? 
amorag@geneura.ugr.es 
jmerelo@geneura.ugr.es 
paloma@geneura.ugr.es 
Twitter (@amoragar, @jjmerelo, 
@unintendedbear)

More Related Content

Viewers also liked

MyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateMyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateRubén Aguilera
 
Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Juanma Gómez
 
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?Antonio Ognio
 
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Victor Pezzetti
 
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyNoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyDataGeekery
 
Stateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsStateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsAlvaro Sanchez-Mariscal
 
Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Juanma Gómez
 
#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScriptCarlos Azaustre
 
Game of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCGame of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCCarlos Azaustre
 

Viewers also liked (12)

MyBatis como alternativa a Hibernate
MyBatis como alternativa a HibernateMyBatis como alternativa a Hibernate
MyBatis como alternativa a Hibernate
 
Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)Mejora tus retrospectivas (codemotion 2014)
Mejora tus retrospectivas (codemotion 2014)
 
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
¿Cómo elegir el languaje y el framework de tu próxima aplicación web?
 
Erlang y elixir
Erlang y elixirErlang y elixir
Erlang y elixir
 
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
Ux2012 - Patrones de Interfaz (by Jennifer Tidwell)
 
Delegation
DelegationDelegation
Delegation
 
Interface
InterfaceInterface
Interface
 
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technologyNoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
NoSQL? No, SQL! - SQL, the underestimated "Big Data" technology
 
Stateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applicationsStateless token-based authentication for pure front-end applications
Stateless token-based authentication for pure front-end applications
 
Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)Scrum bad smells (codemotion 2014)
Scrum bad smells (codemotion 2014)
 
#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript#PlatziConf - El camino para ser un Pro en JavaScript
#PlatziConf - El camino para ser un Pro en JavaScript
 
Game of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCCGame of Frameworks - GDG Cáceres #CodeCC
Game of Frameworks - GDG Cáceres #CodeCC
 

Similar to Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorialSrinath Perera
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationMongoDB
 
A Test Automation Framework
A Test Automation FrameworkA Test Automation Framework
A Test Automation FrameworkGregory Solovey
 
Qtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQUONTRASOLUTIONS
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsTarik Essawi
 
How Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsHow Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsRanorex
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingSrinath Perera
 
Test Driven Development with Sql Server
Test Driven Development with Sql ServerTest Driven Development with Sql Server
Test Driven Development with Sql ServerDavid P. Moore
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databaseselliando dias
 
Model Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsModel Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsLionel Briand
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Informatik Aktuell
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakesshivindkaur
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Educationdostatni
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Omid Vahdaty
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youJacek Gebal
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and OptimizationMongoDB
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklistMax Kleiner
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...TEST Huddle
 

Similar to Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers (20)

Rules Programming tutorial
Rules Programming tutorialRules Programming tutorial
Rules Programming tutorial
 
Webinar: Performance Tuning + Optimization
Webinar: Performance Tuning + OptimizationWebinar: Performance Tuning + Optimization
Webinar: Performance Tuning + Optimization
 
A Test Automation Framework
A Test Automation FrameworkA Test Automation Framework
A Test Automation Framework
 
Qtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutionsQtp manual testing tutorials by QuontraSolutions
Qtp manual testing tutorials by QuontraSolutions
 
Secrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archsSecrets of highly_avail_oltp_archs
Secrets of highly_avail_oltp_archs
 
How Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming SkillsHow Manual Testers Can Break into Automation Without Programming Skills
How Manual Testers Can Break into Automation Without Programming Skills
 
Droolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 SrpingDroolsand Rule Based Systems 2008 Srping
Droolsand Rule Based Systems 2008 Srping
 
Test Driven Development with Sql Server
Test Driven Development with Sql ServerTest Driven Development with Sql Server
Test Driven Development with Sql Server
 
Automated Testing with Databases
Automated Testing with DatabasesAutomated Testing with Databases
Automated Testing with Databases
 
Model Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition SystemsModel Based Test Validation and Oracles for Data Acquisition Systems
Model Based Test Validation and Oracles for Data Acquisition Systems
 
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
Wolfgang Epting – IT-Tage 2015 – Testdaten – versteckte Geschäftschance oder ...
 
Testing insights from data lakes
Testing insights from data lakesTesting insights from data lakes
Testing insights from data lakes
 
Postgresql in Education
Postgresql in EducationPostgresql in Education
Postgresql in Education
 
Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...Lessons learned from designing a QA Automation for analytics databases (big d...
Lessons learned from designing a QA Automation for analytics databases (big d...
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love you
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 
EKON 23 Code_review_checklist
EKON 23 Code_review_checklistEKON 23 Code_review_checklist
EKON 23 Code_review_checklist
 
HW03 (1).pdf
HW03 (1).pdfHW03 (1).pdf
HW03 (1).pdf
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...The Core of Testing  – Dynamic Testing Process –  According to ISO 29119 with...
The Core of Testing – Dynamic Testing Process – According to ISO 29119 with...
 

Recently uploaded

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIADr. TATHAGAT KHOBRAGADE
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY1301aanya
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body Areesha Ahmad
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...Scintica Instrumentation
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsOrtegaSyrineMay
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Silpa
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxMohamedFarag457087
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptxryanrooker
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLkantirani197
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Silpa
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptxArvind Kumar
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Silpa
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptxSilpa
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxSilpa
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsbassianu17
 

Recently uploaded (20)

CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIACURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
CURRENT SCENARIO OF POULTRY PRODUCTION IN INDIA
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body GBSN - Microbiology (Unit 3)Defense Mechanism of the body
GBSN - Microbiology (Unit 3)Defense Mechanism of the body
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.Molecular markers- RFLP, RAPD, AFLP, SNP etc.
Molecular markers- RFLP, RAPD, AFLP, SNP etc.
 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
 
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICEPATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
PATNA CALL GIRLS 8617370543 LOW PRICE ESCORT SERVICE
 
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRLGwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
Gwalior ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Gwalior ESCORT SERVICE❤CALL GIRL
 
Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.Phenolics: types, biosynthesis and functions.
Phenolics: types, biosynthesis and functions.
 
Role of AI in seed science Predictive modelling and Beyond.pptx
Role of AI in seed science  Predictive modelling and  Beyond.pptxRole of AI in seed science  Predictive modelling and  Beyond.pptx
Role of AI in seed science Predictive modelling and Beyond.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.Selaginella: features, morphology ,anatomy and reproduction.
Selaginella: features, morphology ,anatomy and reproduction.
 
Human genetics..........................pptx
Human genetics..........................pptxHuman genetics..........................pptx
Human genetics..........................pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Genome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptxGenome sequencing,shotgun sequencing.pptx
Genome sequencing,shotgun sequencing.pptx
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Genetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditionsGenetics and epigenetics of ADHD and comorbid conditions
Genetics and epigenetics of ADHD and comorbid conditions
 

Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers

  • 1. Going a Step Beyond the Black and White Lists for URL Accesses in the Enterprise by means of Categorical Classifiers Authors: Antonio Miguel Mora García Paloma de las Cuevas Delgado Juan Julián Merelo Guervós ECTA 2014, Rome, Italy
  • 2. MUSES is an EU funded research project 1
  • 3. Bring Your Own Device What happens to corporate assets in a BYOD environment? 2
  • 4. Structure of the MUSES server 3
  • 5. Underlying Problem Enterprise Security applied to employees’ connections to the Internet (URL requests). 4 www ● Proxies ● Firewalls ● Corporate Security Policies (CSP) which may include Blacklists and Whitelists
  • 6. What do Black and White lists cover? ● Every URL inside a Blacklist is denied, if not, it is allowed. What if something is directly allowed but it should not be? ● Every URL inside a Whitelist is allowed, if not, it is denied. What if something is directly denied but it should not be? Therefore, we want to go a step beyond. 5
  • 7. ● Objective → to obtain a tool for automatically making an allowance or denial decision with respect to URLs that are not included in the black/whitelists. o This decision would be based in the one made for similar URL accesses (those with similar features). o The tool should consider other parameters of the request in addition to the URL string. Objectives 6
  • 8. Followed Schema Unlabelled Labelled requests requests Classification accuracies and Rules Classification methods 7 Data Mining Labelling Process Analysis of results Machine Learning
  • 9. Working Scenario Employees requesting accesses to URLs (records from an actual Spanish company - around 100 employees) from 8 to 10 am. 8 www ● Log File of 100k entries (patterns). CSV file format. ● A set of rules (specification of the security policies on if-then clauses).
  • 10. Data description: Entries in the Log ● An Entry (unlabelled) ● It has 7 categorical fields and 3 numerical fields. ● Leads to classification which support both types: o Rule based classifiers o Tree based classifiers http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:0 8 DEFAULT_PARENT 10696 1 http://www.on e.example.com X.X.X.X 9
  • 11. Data description: Policies and Rules ● A Policy and a Rule “Video streamings cannot be reproduced” rule "policy-1 MP4" attributes when squid:Squid(dif_MCT=="video",bytes>1000000, content_type matches "*.application.*, url matches "*.p2p.* ) then PolicyDecisionPoint.deny(); end ● It has a set of conditions, and a decision (ALLOW/DENY). ● Each condition has: Data Type, Relationship, Value. 10
  • 12. Labelling Process ● The two data sets are compared during the labelling process. ● Conditions of each rule are checked in each entry/request. ● If an entry meets all conditions, it is labelled with the corresponding decision of the rule. When - Entry meets conditions of a rule that allows making the request. AND - Entry meets conditions of a rule that denies making the request. THEN - DENY is chosen. 11
  • 13. Data Summary ● The CSV file, now with all the patterns that could be labelled (the others were not covered by the rules), has 57502 entries/patterns: o 38972 with an ALLOW label. o 18530 with a DENY label. 2:1 ratio ● Application of data balancing techniques: o Undersampling: random removal of patterns in majority class. o Oversampling: duplication of each pattern in minority class. 12
  • 14. Experimental Setup ● The classifiers are tested, firstly, with a 10-fold cross-validation process. o Top five classifiers in accuracy, are chosen for the following experiments. o Also, Naïve Bayes classifier is taking as a reference. ● Secondly, a division process is performed over the initial (labelled) log file, into both training and test files. ● These training and test files are created with different ratios and either taking the entries randomly or sequentially. 13
  • 15. Flow Diagram 1) Initial labelling process. Experiments with unbalanced, and balanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing Randomly, and sequentially. 3) Enhancing the creation of training and test files. Experiments with unbalanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 2) Removal of duplicated requests. Experiments with unbalanced data. From those, divisions are made: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing Randomly, and sequentially. 4) Filtering the features of the URL. Experiments with unbalanced, and balanced data. From those, divisions are made, patterns randomly taken: ● 80% training 20% testing ● 90% training 10% testing ● 60% training 40% testing 14
  • 16. 10-fold cross-validation experiments 1) Initial labelling process. ● The classifiers are tested, firstly, with a 10-fold cross-validation process over the balanced data. 15
  • 17. Using separate training/test files 1) Initial labelling process. ● Naïve Bayes and top five classifiers are tested with training and test divisions, in order to avoid testing patterns being used for training and vice versa. 16
  • 18. Serendipity rocks 1) Initial labelling process. Divisions made over unbalanced data 17
  • 19. Results continue falling 1) Initial labelling process. Divisions made over balanced data (undersampling) 18
  • 20. Results continue falling 1) Initial labelling process. Divisions made over balanced data (oversampling) 19
  • 21. Why are accuracies still high? 2) Removal of duplicated requests. ● We studied the field squid_hierarchy and saw that had two possible values: DIRECT or DEFAULT_PARENT. http_reply_ code http_metho d duration_ miliseconds content_type server_or_ cache_address time squid_hierarchy bytes url client_ adress 200 GET 1114 application/octet-stream X.X.X.X 08:30:0 8 DEFAULT_PARENT 10696 1 http://www.on e.example.com X.X.X.X 20
  • 22. Repeated entries affect accuracies 2) Removal of duplicated requests. ● The connections are made, firstly, to the Squid proxy, and then, if appropriate, the request continues to another server. o Then, some of the entries were repeated, and results may be affected for that. 21 www “Some local IP” 192.194.2.2 “Some server IP”
  • 23. Serendipity rocks again 2) Removal of duplicated requests. Divisions made over unbalanced data 22
  • 24. Where are the URL features going? 3) Enhancing the creation of training and test files. ● Repeated URL core domains could yield to false results. ● During the division process, we ensured that requests with the same URL core domain went to the same file (either for training or for testing). 23
  • 25. Accuracies fall down automatically 3) Enhancing the creation of training and test files. 24
  • 26. Created Rules During Classification ● In the experiments that included only the URL core domain as a classification feature, rules were too focused on that feature. PART decision list ------------------ url = dropbox: deny (2999.0) url = ubuntu: allow (2165.0) url = facebook: deny (1808.0) url = valli: allow (1679.0) 25
  • 27. Created Rules During Classification ● Another kind of rules were found, but always dependant on the URL core domain. url = grooveshark AND http_method = POST: allow (733.0) url = googleapis AND content_type = text/javascript AND client_address = 192.168.4.4: allow (155.0/2.0) url = abc AND content_type_MCT = image AND time <= 31532000: allow (256.0) 26
  • 28. Training with other URL features 4) Filtering the features of the URL. ● Rules created by the classifiers are too focused on the URL core domain feature. ● We did the experiments again with the original file, but including as a feature only the Top Level Domain of the URL, and not the core domain. 27
  • 29. Random Forest defeats everyone 4) Filtering the features of the URL. Divisions made over balanced data 28
  • 30. Created Rules During Classification ● After including the URL top level domain as a classification feature, instead of URL core domain, rules classify mainly by server address. PART decision list ------------------ server_or_cache_address = 173.194.34.248: allow (238.0/1.0) server_or_cache_address = 91.121.155.13: deny (235.0) server_or_cache_address = 90.84.53.48 AND client_address = 10.159.39.199 AND tld = es AND time <= 31533000: allow (138.0/1.0) 29
  • 31. Created Rules During Classification ● URL TLD appears, but now the rules are not always dependant on this feature. server_or_cache_address = 90.84.53.19 AND tld = com: deny (33.0/1.0) server_or_cache_address = 87.248.20.254 AND content_type_MCT = image AND duration_milliseconds > 21: deny (15.0) server_or_cache_address = 23.38.17.224 AND time > 30532000 AND http_reply_code = 200 AND content_type_MCT = image AND bytes <= 520 AND time <= 33677000: allow (40.0) 30
  • 32. ● In most cases, Random Forest classifier is the one that yields better results. ● The loss of information when analysing a Log of URL requests lowers the results. This happens when: o Oversampling data (because we randomly remove data). o Keeping the sequence of the requests of the initial Log file while making the division in training and test files. Conclusions 31
  • 33. Conclusions ● As seen in the rules obtained, it is possible to develop a tool that automatically makes an allowance or denial decision with respect to URLs, and that decision would depend on other features of a URL request and not only the URL. 33
  • 34. ● Making experiments with bigger data sets (e.g. a whole workday). ● Include more lexical features of a URL in the experiments (e.g. number of subdomains, number of arguments, or the path). ● Consider sessions when classifying. o Defining session as the set of requests that are made from a certain client during a certain time). ● To finally implement a system and to prove them with real data, in real-time. Future Work 34
  • 35. Thank you for your attention Questions? amorag@geneura.ugr.es jmerelo@geneura.ugr.es paloma@geneura.ugr.es Twitter (@amoragar, @jjmerelo, @unintendedbear)