SlideShare una empresa de Scribd logo
1 de 26
Abstract
There has been a steady increase in the utilisation of mobile devices recently, due to technology
advancements attained in the area of mobile communications and the integrated features of
mobile units. Insider threat has always been a problematic issue faced by organisations and
with a large majority of users now equipped with mobile devices, it has now become an even
more prominent issue. Freedom of movement and ubiquitous accessibility to information make
it necessary for a solution to be available to address such potential threats. In this paper, a
unique profiling method is introduced using carefully selected database objects as well as data
concerning the location of the database requests, in order to produce a comprehensive intrusion
detection framework which is sensitive to the locations where database requests are initiated
from. The conduct of experiments implementing the system has resulted in promising detection
rates achieved, with low rates of false alarms observed.
Introduction
In recent times, advances in mobile computing and mobile communication have rapidly
intensified the use of mobile devices. This trend is mainly due to the convergence features of
the mobile units as well as the users’ ability to move around easily while using mobile devices.
Currently, storage capacity and processing power of the mobile devices are typically enough
to handle both personal and office work. However, if these mobile devices are lost or stolen,
the damage caused to the relevant organisation may be substantial especially if sensitive
information are stored on such mobile devices. To overcome this issue, organisations may
create a database to store information centrally at a fixed server, with users just having to install
the necessary applications on their mobile units. In this way, users are still able to reap the
benefits of using mobile devices and enjoy the flexibility of movement while the risk of data
loss is simultaneously minimised.
Despite the said solution and the availability of several other techniques to protect data on
mobile devices (such as data encryption), there is still a type of risk that cannot be managed by
implementing these techniques, namely the risk of insiders trying to damage the database by
illegally modifying the data. These insiders would be the normal internal users within the
company who have the necessary access permissions to the data and who are aware of existing
policies and security mechanisms implemented by the organisation. In such a situation, none
of the encryption and access control techniques can prevent them from abusing their privilege.
Insider threats are increasingly growing in organisations1. Therefore, deployment of an
intrusion detection system becomes necessary in order to alert administrators against
unauthorised data access by such insiders.
One of the methods commonly considered for detecting database attacks is behavioral analysis.
To implement this technique, specific profiles for the critical assets of the database need to be
generated. This process can be automated using data mining methods. Profiles can represent
the characteristics of a user or a group of users. However, the features may vary between
different organisations or departments. This is the reason why intrusion detection systems
should be adaptive to the specific requirements of an organisation.
Profiling behavioral activity is done through recording users’ access to the different objects or
the different levels of each object in a database. More detailed monitoring will produce more
accurate profiles. For example, a profile for user’s access to tables is normally less accurate as
compared to the one with access attributes of the database tables. On the other hand, queries
that come in the form of a transaction also have their own effects on the database. Given that
different roles (representing a group of users or a department of the organisation) may display
similar behavior in the database through their transactions, it is important to analyse the
structure of the transactions and add them to the profile features.
As mentioned in our previous work2, a combination of two query profile structures which are
completely user-independent is introduced to form accurate profiles at transaction level. In this
study, these profiles are enhanced to be better suited to mobile database environments, in order
to produce a comprehensive intrusion detection framework which is sensitive to the locations
where database requests are initiated from. It is necessary to set out a definition of the term
“mobile database system” together with one core assumption made in this study, as follows:
Definition: In this study, a mobile database system is defined as a database system that receives
data access requests from mobile devices. In this case, the database records the location
information of each request.
Assumption: There is a relationship between access patterns and the geographical location of
the devices from which database requests are initiated.
Literature
This section aims to review a few issues that is to serve as background to this study. These
issues are intrusion detection systems (IDS) in general, location-aware IDS, and IDS at
database level. The gaps for each of the abovementioned areas are covered in this paper and
are further discussed at the end.
IDS is defined as a mechanism to detect and report any suspicious activity occuring in the
system. The idea of such a mechanism have been introduced and have been increasingly
researched during the past decade. As defined in 3, IDS should be able to produce an alarm to
notify security administrators in case of any security threat. Two models of detection have
been defined by which decisions are made on the existence of threats; namely anomaly
detection and signature detection3-7. In anomaly detection all activities are divided into 2
categories, namely “normal” or “abnormal”, with the “abnormal” behaviors regarded as
intrusions3,6,7.
Signature-based detection uses a predefined list of known intrusions to detect threats3,6-8. If a
behavior is close to the stored patterns, it will be considered as an intrusion.
IDS were first introduced to enhance the network security against newly emerged security risks
like Denial of Service (DoS) and against probing attacks9. However the application of IDS has
now extended to cover various network domains, computer systems and even applications. An
example is the application of IDS at database level, as implemented in this paper.
Intrusion Detection Challenges: FP and FN
Although both signature-based and anomaly-based detection methods are widely used to detect
suspicious intruder activities, there are still major concerns regarding the accuracy of IDS
detection. For both detection methods, the rate of success or failure in detecting intrusions
would be the measure that determines whether a computer system is in a secure environment
or otherwise.
There are two common terminologies used when discussing IDS inaccuracy, being False
Positive (FP) rates and False Negative (FN) rates. They represent the accuracy level of the
intrusion detection system. FP represents mistaken detections for which the alarm is wrongly
generated, whereas FN means the failure to detect when a real-attack is attempted. IDS usually
suffer from FPs and related alarms due to wrong detection results10. An IDS must be able to
fulfil reliability requirements in terms of precision in detecting attacks with minimum false
alarms11. Another challenging aspect of the performance of IDS relates to the amount of data
which an IDS can handle at any one time. This refers to the speed at which the system can
process incoming events in order to determine whether or not they are intrusions. When it
comes to database intrusion detection systems, which is the main topic of this study, the amount
of data requests that come through queries or transactions can determine the performance of
the system. Therefore, both accuracy and performance are factors which should be considered
when developing an IDS11. These two features have been carefully incorporated into the
proposed database IDS in this study.
Location-aware Intrusion Detection Systems
Advancement in mobile technology has made the usage of mobile devices more convenient
and popular. Factors such as long battery life, powerful processors, large storage capacities and
low weights have motivated organisations to migrate to mobile units rather than continue using
traditional computer systems. The most desirable benefit of using mobile units is the freedom
of movement.
On the other hand, data protection becomes a real concern since mobile devices are more prone
to be lost. Aside from passwords, several security measures can be implemented on mobile
devices such as fingerprint authentication12 and voice recognition13. However, data remains at
risk when a device is stolen. A portion of these data may contain confidential information that
can be used to access an organisations’ network, application and databases. In terms of database
access (which is the main focus of this study), there should be a solution in place to detect
anomalous requests, which would enable administrators to be informed as to whether the
request is sent by a legitimate user or an illegal one, given that the origin is the mobile unit for
which permissions have already been set for the registered account.
Research on mobility behavior of mobile phone users14 has shown that most of the observed
users usually use specific and limited paths and locations. In reaching this conclusion, the
mobile phones of 100,000 users are monitored. The outcome of the research indicates that the
location of mobile users is generally predictable. Therefore, this feature can be used to form
normal profiles of the mobile units in an organisation.
Network access patterns are used to build profiles15 in order to detect abnormal behavior of a
mobile user environment. Detection accuracy shows 90% success overall. Similarly,
information about users’ location is utilised for building smartphone user profiles16 with 81%
accuracy achieved. Enhancements to these research have been performed17 using two profiling
techniques, being empirical cumulative probability measure and another model on trajectories
using Markov properties. However, these valuable studies are limited to the standalone mobile
device and are not applicable for centralised database access.
Mobility and similarity detection have been widely used to detect anomalous activities in
mobile environment18-20. Expression-based algorithms are utilised to spot patterns in
mobility21. Mobility patterns are also utilised22 to address issues of social activity predictions.
Association rule mining23 helps to extract the relationships between user activities and their
locations. Besides association rule mining, other methods like k Nearest Neighbours (KNN),
Bayesian network, bipartite graphs, and neural networks are used to determine regularity and
detect malicious behaviors or malwares in mobile devices24-26. An intrusion detection system
for mobile devices was introduced by Sun et al.27 using three level Markov Chain which is
independent from time to location feature. It is effective only during phone calls when the
mobility speed is high. Another detection model, where predefined routes are used, is
proposed by Hall et al.28 with the focus being on mobility in public transportation.
However, all the reviewed techniques and models have a common feature which is dependence
on the user. This dependency may affect the overall performance of an IDS if it is implemented
centrally and is expected to deal with a large number of users. The focus of this paper is on
databases in mobile environments which users can access through their mobile devices.
Therefore, the computational task is performed in the system that hosts an instance of the
database, making it necessary to have a user-independent detection method at this point.
Database
Lu et al.29 proposed a method to detect malicious transactions by measuring transaction
violations in order to distinguish legal requests from illegal ones. Audit logs from the DBMS
were used to implement this idea. Similarly Fonseca et al.1 tried to identify abnormal activities
by defining and comparing normal patterns in the stored database logs. Although the technique
used by Lu et al.29 and Fonseca et al.1 is workable, it needs a lot of human supervision and the
detection of insider attacks is not automatic.
Networked Bayesian Network (NBN) 30 is used to calculate the probability of an action being
an intrusion, in order to detect unauthorised insider attacks. This operation is done when a
number of critical objects are accessed in a transaction. The authors make the assumption that
half of the insider activities are to be regarded as intrusions. This is unsatisfactory since there
is no evidence or basis to support this assumption. However, as mentioned by the authors, this
method is not applicable for authorised insider attacks.
A method which involves fingerprinting the transactions was used by Lee et al.31 to detect
unauthorised database access. They focused on Structured Query Language (SQL) statements
to create signatures for the intrusion detection system, which is considered as a signature-based
approach. This approach may not work efficiently for unknown access patterns.
Yaseen and Panda32 utilised a knowledge base to define the amount of information that insiders
have about data items, and subsequently proposed a threat prediction graph to detect and
prevent insider threats. Their work is a valuable contribution to this area of study since they
conducted in-depth analysis of the dependencies of database items. However, the work is based
on several unrealistic assumptions, such as insiders’ access patterns. The probability of
misusing data previously accessed by insiders is equal to the probability of accessing new
items.
Gaps
A majority of IDS focus on networks or hosts and therefore cannot be effective for databases.
A few detection systems have been proposed or implemented at database level33-35. Review of
current database IDS models and systems show that there is still a lot of work to be done in
terms of accuracy of detection engine and error rates in order to have a robust IDS at database
level. Most of the current models suffer from high FP34 or FN33 leads to put more accurate data
preparation techniques into consideration while applying these data to the detection engine.
High FN becomes a limitation to the traditional signature-based detection method because an
effective expert system should only react to well-defined attacks. This causes reliability
problems in relation to new entries for which there is no pattern or signature in the system36.
High FP is also a challenge for the anomaly detection methods. Incomplete learning processes
lead to a high number of false alarms. Through this method, a statistical process is used to
detect undefined attacks based on user or system behavior. The performance of IDS that use
anomaly detection is highly dependent on the amount of correct behavioral statistics which it
gathers. System administrators usually suffer from having to handle a huge amount of FP alerts
daily37. Again, we can conclude that a more accurate repository of valid signatures and/or
behavioral statistics is required to overcome the challenges mentioned above.
Another issue that may affect the accuracy of database intrusion detection systems (dIDS) is
its handling of a mobile environment14,15. Although detection engines are placed at database
level, it is still collecting mobility behavior of users, where their requests can help to produce
more accurate patterns for detection purposes16. This hypothesis is proved through the
experiment conducted in the present study.
As stated above, despite many attempts to secure databases against intruders, a lot of effort is
still required to improve IDS in order to combat the phenomenon of insider attacks. Moreover,
none of the research work mentioned adapt their systems to mobile environments where
requests are sent through mobile units. These gaps are covered in this study with the aim of
providing a robust detection framework which can effectively deal with insider threat in a
mobile environment.
To enhance the capability of database intrusion detection, a multi-layer profiling method is
introduced in this study, covering both context and result-set based approaches as well as query
structure. This model uses a unique structure incorporating carefully selected features which
help to increase accuracy and consequently reduce false alarms. The proposed profiling
happens at query level and then at transaction level. This helps to avoid dependencies on users
and user roles. The factor of mobility is also considered. A location abstraction model is
introduced to update profile structures in order to enhance the accuracy of detection engine
while processing queries in mobile database systems.
Mobile
The demand for mobile devices is constantly growing, with rapid innovation seen in terms of
technology, processing power, networking capabilities, and storage capacity. New features are
constantly being introduced and integrated in these devices, enabling more services to be
provided to users. Databases and data management services are no exceptions. More and more
mobile applications now offer connections to central databases, leading to a variety of
information being transferred into the mobile device. Accessing and exchanging data over
networks are activities which inevitably give rise to concerns regarding security and this issue
becomes even more serious when the access is ubiquitous. Therefore, more security features
should be introduced for mobile database systems, especiallly given the distributed nature of
mobile systems38,39.
A mobile database system is a system where applications can gain access to the database
through the use of mobile devices. In a mobile database system, either a portion of the
database or the entire database can be integrated into the mobile unit. However, because of
limitations such as battery power, capacity, and computational capabilities, the main database
usually remains stored in a fixed machine. In general, mobile databases are referred to as a
special form of distributed database system and can be defined in two forms. One is the
implementation of whole databases in fixed and wired distributed systems. The other is
distributing database between wired and wireless systems in which both fixed and mobile
machines perform the task of data management40. As shown in Figure 1, the scenario which
is selected for this study adopts the first-mentioned architecture, namely a mobile database
system where mobile units are connected to the database system (DBS) through mobile
networks. The database itself is also distributed among different locations. The fact that
mobile units can carry a part of the database does not have any effect on the current work,
since the main processing task for the proposed database intrusion detection system happens
at the main databases which are already synchronised.
Dataset
The main database is created in Microsoft SQL Server 2008 and is filled with about eight
million records. To follow a formal structure for the database, TPC-H standard is selected and
implemented according to the published datasheet41. Fields of each table are populated with
realistic sample data and transactions are created using dbgen and qgen tools respectively41.
Transactions contain one or more queries accessing several attributes of the database in order
to prepare the database log.
The log file is used as the main dataset for the purpose of data mining. The precise number of
queries is 1000, with 500 transactions being generated and launched randomly on the database.
These queries are used in the mining process and we consider the database log as an intrusion-
free dataset. Details of the mining procedure are explained in the following section.
Transactions contain a number of queries that help to gather enough data relationships to
produce an acceptable range of itemsets for the data mining process. 100 malicious transactions
are also generated to train the system and also to check the accuracy of detection. The malicious
transactions contain queries in which abnormal access patterns can be found such as drop table
command.
Mining Procedure
Several data mining tools are required for data analysis and pattern discovery that can help to
provide better strategies for business and scientific research areas. The existing gap between
raw data and useful information demonstrates the importance of data mining techniques to turn
plain data into valuable information. Therefore, some techniques are needed for knowledge
discovery that can automatically find logical relationships between data with minimum user
intervention.
Many issues may arise in terms of performance and efficiency when dealing with large volumes
of data in databases. It should be noted that the same mining method may not provide the
desired results when the objective of mining is altered. For IDS, there are a few frequently used
mining algorithms. Two common methods are association rule learning and sequence mining.
The effectiveness of rule mining is complemented with the application of PARAS, being an
approach to form a better representation of complex redundancy relationships42. Association
rules would be used to uncover relationships between unrelated data in rational databases or
any other kind of data repository. An association rule is divided into two parts, being “if” and
“then”. For example, “if a customer buys a dozen eggs, then he is 80% likely to purchase milk.”
In fact, association rules are constituted through analysing the data using “if/then” patterns and
using criteria such as “support” and “confidence” to identify the most important relationships.
“Support” determines the frequency of items appearing in the database whilst “Confidence”
shows the number of items “if/then” which has been found to be true. Findings from other
studies 43,44 also support the use of apriori for dIDS. The first stage in the framework proposed
in this paper is to mine dataset with apriori algorithm and produce rules to be used later for
profiling. These rules will show the frequent itemsets including all data access which occur
frequently in the database.
With the help of association rule mining, data correlations can be highlighted. This
unsupervised learning process is the first phase of the proposed intrusion detection system. An
example of such association is updating an order request for customer address which can be a
normal action. However if the update includes the customer name and the invoice price, this
may require more consideration. Similarly, if the address update task which is a “normal” office
job during office hours takes place at night, this can be considered as an intrusion. Here the
relationship between updating an item and time may be helpful. Several examples can be
discussed to demonstrate how data dependencies or correlations between items of an itemset
are able to help the IDS to produce more accurate decisions when processing new requests.
By using apriori algorithm, the number of data required to be processed to generate normal
patterns may be reduced given that itemsets that are not frequently accessed are removed from
the list. At the end, the most frequently accessed itemsets will remain and more accurate
profiles can thus be generated. This will minimise the size of data for intrusion detection
process. However, to what extent can we can limit the item selection? This depends on two
factors, namely minimum support and minimum confidence43. Choosing a proper support
factor while searching the dataset can provide more accurate results. The purpose here is to
have frequent subsets in each frequent itemset.
A minimum threshold for the level of frequency is defined as minimum support. Having
different minimum supports may lead to having different numbers of rules. There is no specific
method for determining minimum support except by testing several numbers in order to find
the optimum one. In this experiment, the number is set at 30%. Minimum support is not an
absolute value and is relative to the dataset that is being used in the experiment. To identify a
suitable minimum support, several testing sessions were conducted using different numbers
and the results were measured in terms of the coverage of the rules on the dataset. It is observed
that for minimum supports that applied more than 30%, many of the rules for data dependencies
were missed. On the other hand, for lower numbers like 25%, some rules that should not be
considered as frequent itemsets were caught in the results. Finally a midway number of 30%
was chosen as minimum support. The same logic is applied for determining the value of
minimum confidence. Implementation of apriori is done through Weka45 libraries which are
used in the Java code presented in this work. The application is developed by the author, who
can call Weka mining engine to apply apriori on the dataset. It is necessary to code the
application in order to automate mining procedures. After mining sequential patterns, apriori
generates read and write sequence sets. It then extracts dependency rules using minimum
confidence. This parameter is set at 80% in this experiment. At the end the output of apriori
will be dependency rules showing associations between data items in a dataset. Samples of
extracted rules for the utilised dataset of this work are presented in Table 1. Overall, 20 rules
are extracted by apriori which are then used in the profiling process.
This information helps us to gain better knowledge about normal behavior on database requests
in order to build more accurate user-independent profiles.
Transaction
Although the database log file contains all activities done by running queries, a proper
extraction and interpretation process is required in order to prepare a suitable query structure
for the data mining process. Profiling has recently been introduced and implemented in several
works as reviewed in Section 2. Profiling is used when the detection method is based on the
behavior of user, group of users (role), and queries. Therefore, the main issue that arises here
is the selection of suitable features of the log file to increase the accuracy level of detection and
to reduce errors accordingly.
Based on the association rules mentioned above, profiles are created for transactions that follow
the same dependency rules and their related itemsets. For example, if an itemset contains () a
profile string will be created for the related transactions as: “” as well as other items in each
transaction.
It should be noted that the order of items is important in the profile string and there should not
be any spaces. This is because a binary form of these profile strings is generated for the purpose
of comparison with incoming requests.
Profile
Profiling can be performed at the query level, context level or even result-set level. However,
for the purpose of this study, a combination of all the said levels is selected to form the profile
structure. The efficiency of this profile structure has been approved in previous works2. In
addition, the location information of the database requests is also added to the profiles to meet
the requirements of a mobile database system. Therefore, the overall structure of the final
profile is as follows:{}. This profile represents the SQL command, attribute, table name, result
set, location and timestamp respectively.
Feature
For complex queries in which multiple attributes are accessed, all attributes are selected with
related command separately. For example if the query41 is like:
“SELECT” is counted along with each set of attributes and tables. This means that if the said
query is assumed to be a normal query, highlighted pairs can be listed as {}. However, not all
sets would be considered in the final classification since certain data accessed would be ignored
through data mining procedure due to having low frequency. Therefore, only data which are
accessed frequently enough are listed in extracted rules from the applied mining algorithm.
Classification
Classification is used to refine the training process in order to better distinguish between
intrusion transactions and non-intrusion transactions. For the supervised learning task, labeled
transactions need to be provided in order to teach the system about malicious queries. For this
purpose, several transactions are generated as intrusions consisting of malicious queries. These
transactions are then injected into the system, updating the classifier knowledge base. Through
this process, the proposed system is completely taught about normal and malicious profiles.
Therefore, the probability of an intrusion can be calculated depending on the level of the
system’s current knowledge about intrusions. In this study, Naive Bayes classifier is used to
consider uncertainty and to find the probability of an intrusion occurrence. A brief explanation
of this classifier and its effects on the current study is presented below.
Implementation
As explained earlier, data generator tools are used to fill up the database tables with realistic
sample data. This means that if there are numeric, text and date fields in the tables, they are all
filled up with numbers, characters and dates. A total amount of eight million records are
generated in the database.
As mentioned earlier, transactions are generated based on recommendations 41. However,
several transactions are also created manually in order to fill up the database log. All malicious
transactions are synthetic. They contain one or more queries with abnormal requests such as
deleting a table, updating an item out of normal time, updating an item out of normal location,
selecting all attributes, updating multiple unrelated attributes, etc. These queries are used in
intrusion tagged transactions.
The mining process is performed by Weka45.Weka is an open source project that provides both
data mining and classification features. It is used in this study to apply apriori algorithm and
Naive Bayes classifier on the dataset. Weka shares libraries that can be integrated into different
coding platforms to be used inside the code of programmers. In this way, a programmer can
call Weka to launch any of the available data mining algorithms on the predefined datasets. To
be able to use these libraries, Weka should be installed completely on the computer system.
Alternatively, a complete Weka package called weka.jar should be imported to the
programming platform45.
Generating location information is performed through simulation. For this purpose, Network
simulator 2 (NS-2)47 is used to simulate a mobile network with multiple devices
communicating with database server while moving around a predefined area. By using this
method, a complete location profile is collected and added to the profile structure mentioned
earlier. Abstracting location information into profiles follows a mathematical procedure which
is explained later in this paper. Adding locations to the profiles is necessary in order to convert
the queries into a location-aware query. A location-aware query refers to a query which
contains at least one location-related attribute or one location-related predicate48. This profile
item helps to increase the accuracy of the detection system for insiders who are using mobile
units to connect the database. Simulation parameters are set out in Table 2, whilst a view of
the simulated area is shown in Figure 2.
To create location-based profiles, a standard model is required to represent the positions of
mobile units. There are currently two widely used models called the geometric model and the
symbolic model49. The geometric model defines locations with n-dimensional coordinates
including both longitude and latitude information. This can also be represented as a set of
coordinates showing a boundary or geographical area. In symbolic models, abstract symbols
are used to represent locations. This enables locations to be traced by name and consequently
makes the process of clustering and analysing easier, especially when the system requires
human decisions49. In this study, symbolic location model is used to create location profiles.
Statistics can prove to be immenseful helpful in achieving the set goal. Most database requests
come from mobile devices belonging to internal staff members who have direct links to the
location of the user. By exploiting this fact, it is possible to formulate a method to divide
geographical areas into the mobile units’ related locations. For example, in a police database,
most requests to access records relating to a specific city would come from police officers who
are working in the same city. Similarly, requests to access the medical records of a patient in a
multi-branch hospital would normally be made by staff of the same hospital branch as the
patient. Although there are exceptions to these examples, the majority would follow the same
logical structure. This assumption may help to enhance the accuracy of an IDS. For example,
when the record of a patient in a particular hospital branch is updated by a request that comes
from a different branch, or when an event related to a particular city is updated by a request
which comes from a mobile unit located in another city, these are not to be treated as normal
requests. The system should bring them to notice by informing administrators about them
through alarms.
In order to divide location information into logical areas, it is necessary to find the most
frequently repeated locations and set them as the center of the distribution, and then calculate
and list their neighbours in the same group. Statistically, the Mean value can do the job, which
is simply to calculate the average of a group of data. This is done by adding up all the data and
dividing them by the total number of values.
Once the Mean has been determined, the deviation degree of all locations to the mean value is
to be calculated. This can help to identify a threshold for dividing location information into
separated geographical areas. Standard deviation is used for this purpose. It calculates the
distance of each observation from the mean value and provides an average deviation.
Performance
Performance metrics are used to measure the accuracy and reliability of the proposed IDS. FP,
FN and accuracy are the metrics used in this experiment.
In general, higher true positives increases the accuracy of detection engines. Similarly less
detection errors result in lower FP and rates.
Data
K-fold cross validation (CV) is used to evaluate the accuracy and validity of the dataset utilised
in this study. In k-fold CV, dataset is divided into equal partitions called folds52. A model
should then be defined and applied to each partition in order to calculate the results based on
predetermined metrics. The proposed detection framework in this study is used as the model
and accuracy of detection is adopted as metric to test and validate the dataset. Having defined
k-folds, one is selected to test the dataset whilst the rest of the k- folds are used to train the
system. This action is repeated for each fold until all folds play the role of both trainer and
tester of the dataset. At the end, an average of the observed results represents the overall
effectiveness of the system.
In order to provide appropriate demonstration of dataset in each fold, data is properly classified
and arranged to ensure that each fold contains equal share of the whole dataset. This means
that the number of instances in every fold should be the same even though they are selected
randomly. The effectiveness of using k-fold CV has been approved and recommended in 52-54.
Proposed
The main components of the proposed framework are profile generation, transaction
processing, and probability checking. Each of these components is discussed in this section.
Application of this framework on the utilised dataset has shown interesting results which
demonstrate higher accuracy in terms of intrusion detection. The structure of the framework is
obtained from our previous work2. However, more detailed and accurate profiling method has
been applied and the framework is enhanced for mobile database systems, as shown in Figure
4.
Incoming
A database transaction containing one or more queries is injected into the system by using Java
codes in order to determine whether or not there are intrusions. Each query is converted into a
profile string so that it can be compared with the existing intrusion or valid patterns. The
conversion is straightforward. All the required elements are extracted from the query and listed
next to each other, separated by a comma. In fact, all of the words inside the transaction are
compared with a list of keywords. If they match, they will be added to the new string for further
comparison. Keywords are obtained from the items extracted by mining the database queries.
These words are listed as frequently accessed itemsets or association rules from the output of
mining procedure. These profile strings are converted into binary in order to ease the
comparison process. Transactions are read from a text file and loaded into a temporary string
variable. In this way, there is no constraint for loading and processing incoming transactions
through Java code. Exclusive OR (XOR) operation is used to compare binary records, where
the output of ‘0’ represents equality in both binary strings. This process follows Algorithm 2,
as explained below.
Instructions in Algorithm 2 show how the system behaves with new transactions. In summary,
if the text string generated for incoming transactions does not match up with existing valid
patterns, it will be sent to the classifier in order to check the probability of intrusion. Naive
Bayes classifier used in this study helps to determine whether or not the new pattern is an
intrusion by measuring the level of similarity between the new pattern and existing ones. On
the other hand, if the pattern of the incoming transaction matches existing patterns it is
considered as a normal request and can pass through the system without further examination.
There is another probability here which is a profile string without attributes. In such case, the
incoming transaction is detected as an intrusion and an alarm will be generated to end the
process. This is because SQL commands which do not contain target attributes most probably
fall into an intrusion category such as “” or SQL injection attacks like “ “ in which “1” is an
attribute used to bypass the condition.
Checking
As stated earlier, if the incoming requests do not match with existing valid patterns, they are
sent for checking by the Naive Bayes classifier. At this stage, both the probability of being
“intrusion” and “non-intrusion” requests are calculated by the system using Equation 4. The
greater probability will be taken as the output of this procedure.
If it is concluded that the incoming entry is an intrusion, an alarm will be generated and the
intrusion repository will be updated accordingly. However, if it is concluded that the incoming
entry is not an intrusion, the valid pattern repository is updated with the new pattern.
Intrusion
The intrusion repository is a list of all intrusions detected by the system. In the beginning this
repository is empty. It gradually grows when the system starts to detect intrusions. As shown
in Figure 4, having such a repository helps to detect repeated intrusions without having to send
them for examination. This reduces the detection time dramatically, leading to an overall higher
performance. Significant processing job is performed to compare between incoming requests
and existing valid patterns, where the incoming requests will be validated if similar patterns is
found in the intrusion repository. Updating the repository is performed whenever an alarm is
generated due to an intrusion. The code which updates the intrusion repository follows
Algorithm 3 as presented below.
Implementation
As noted earlier, the k-fold cross validation method is adopted to assess the validity of the data
used in both training and testing processes. For this purpose, the dataset is divided into five
segments which is called fold here. Each segment is stored in a separate file to be used as either
a training set or a test set. The function of being train set or test set is circulating between folds
so that all folds end up performing both roles.
The code for implementing k-fold CV is written in Java, which is connected to Weka
application in order to apply the algorithms. There are 1000 normal patterns and 100 malicious
patterns, all of which are synthetic. Valid patterns are randomly divided into 5 groups, each
containing 200 patterns. The same division is done for intrusions with 20 patterns in each
group. The valid groups and the intrusion groups are then merged together to produce 5 folds
with 220 pattern containing both normal and malicious records. Since k is 5 here, the
experiment on data validation should be repeated at least 5 times in a way where each fold is
used as training set and test set. Therefore, in each experiment there are 4 folds used to train
the system and 1 fold used to test the system. Metric is accuracy here which is explained in
Equation 9.
Once the accuracy is measured for iterations of the experiment, the average of all results is
used to represent the final detection rate which is 96.1%.
Results
The same detection methods, dataset, and framework used for general databases are applied to
check the effectiveness of the enhanced structure of the profiles in mobile database intrusion
detection cases.2 As expected, adding location profiles reduces errors for mobile environments.
Both FN and FP rates are reduced by 14% and 0.84% respectively. The results and comparison
with previous works are shown in Table 4. It can be seen that better results are obtained for
the detection method proposed in this paper.
It is observed that the application of the proposed enhanced profiling method (which relates
location information to the valid patterns) is able to improve accuracy. Reduction in FPR is
also considerable and demonstrates how knowledge about locations can be exploited to reduce
errors in the system.
Discussion
As mentioned earlier, raw data accessed are mined with association rules. The mining
algorithm extracts the most frequent itemsets for any related attributes. This means that the
output is a rule showing that there is a relationship between two or more attributes because of
the fact that they are frequently accessed together. Consequently, data access patterns with low
frequency are removed from the list. Therefore profiles which are created based on these rules
are more accurate as compared to the profiles for all data accessed. This method is the key to
reducing FP rates.
On the other hand, by applying accurate and relative location information to the profiles, more
precise knowledge can be obtained on the normal behavior of data requests in a mobile
environment. This helps to reduce the overall FN rates. Therefore, the unique proposed
profiling technique together with the decision-making method used in this study result in higher
detection accuracy and lower false alarms.
Conclusion
An effective intrusion detection framework, optimised for mobile database environments, has
been introduced and discussed in this study. Findings from implementation of the proposed
detection method show a high level of accuracy and a low level of FP rates. This is due to the
unique profiling method and detection mechanism applied in the system.
The detection system is designed in such a way that it is sensitive to the locations from which
database requests are initiated. Based on the performed experiment, promising results were
obtained in terms of accuracy and FN rates, being 96.1% and 17% respectively. This shows
better outcomes as compared to previous works. Using accurate profiling method, adding
location information, and using a comprehensive detection procedure are key features in this
experiment.
Feature selection is always a challenge while creating profiles. Other combinations of query
features can be tested to evaluate the efficiency of the system. However, adding more features
may affect the performance of detection systems. This should be considered when designing
detection solutions. Although higher performance is already achieved by applying an intrusion
repository in the proposed framework, more work can be done to enhance the processing speed,
for example by using multi-thread coding.
There is also a chance to increase the detection rate if the system can be combined and
customised with an organisation’s policy repository. Although this will make the system
specifically tailored to each organisation, it will increase the accuracy of valid patterns. Policies
vary between organisations in which sensitive attributes or requests can be defined, which helps
to highlight abnormal activities more accurately. This increases the overall accuracy of the IDS.
Upgrading the proposed dIDS to a database intrusion prevention system (dIPS) and using cloud
service to make the detection engine available all the times through the internet are two
enhancement features that are currently trending topics. These features can perhaps be explored
in future work.

Más contenido relacionado

La actualidad más candente

Iaetsd database intrusion detection using
Iaetsd database intrusion detection usingIaetsd database intrusion detection using
Iaetsd database intrusion detection using
Iaetsd Iaetsd
 
A survey of confidential data storage and deletion methods
A survey of confidential data storage and deletion methodsA survey of confidential data storage and deletion methods
A survey of confidential data storage and deletion methods
unyil96
 
Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013
ijcsbi
 
Security Issues Surrounding Data Manipulation in a Relational Database
Security Issues Surrounding Data Manipulation in a Relational DatabaseSecurity Issues Surrounding Data Manipulation in a Relational Database
Security Issues Surrounding Data Manipulation in a Relational Database
David Murphy
 

La actualidad más candente (17)

Iaetsd database intrusion detection using
Iaetsd database intrusion detection usingIaetsd database intrusion detection using
Iaetsd database intrusion detection using
 
Review of the Introduction and Use of RFID
Review of the Introduction and Use of RFIDReview of the Introduction and Use of RFID
Review of the Introduction and Use of RFID
 
A Survey on Various Data Mining Technique in Intrusion Detection System
A Survey on Various Data Mining Technique in Intrusion Detection SystemA Survey on Various Data Mining Technique in Intrusion Detection System
A Survey on Various Data Mining Technique in Intrusion Detection System
 
INVESTIGATING & IMPROVING THE RELIABILITY AND REPEATABILITY OF KEYSTROKE DYNA...
INVESTIGATING & IMPROVING THE RELIABILITY AND REPEATABILITY OF KEYSTROKE DYNA...INVESTIGATING & IMPROVING THE RELIABILITY AND REPEATABILITY OF KEYSTROKE DYNA...
INVESTIGATING & IMPROVING THE RELIABILITY AND REPEATABILITY OF KEYSTROKE DYNA...
 
Optimised malware detection in digital forensics
Optimised malware detection in digital forensicsOptimised malware detection in digital forensics
Optimised malware detection in digital forensics
 
A survey of confidential data storage and deletion methods
A survey of confidential data storage and deletion methodsA survey of confidential data storage and deletion methods
A survey of confidential data storage and deletion methods
 
A Proactive Approach in Network Forensic Investigation Process
A Proactive Approach in Network Forensic Investigation ProcessA Proactive Approach in Network Forensic Investigation Process
A Proactive Approach in Network Forensic Investigation Process
 
A Survey: Data Leakage Detection Techniques
A Survey: Data Leakage Detection Techniques A Survey: Data Leakage Detection Techniques
A Survey: Data Leakage Detection Techniques
 
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
The Constrained Method of Accessibility and Privacy Preserving Of Relational ...
 
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
IRJET - Cross-Site Scripting on Banking Application and Mitigating Attack usi...
 
F1803042939
F1803042939F1803042939
F1803042939
 
Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013Vol 6 No 1 - October 2013
Vol 6 No 1 - October 2013
 
A Review of Machine Learning based Anomaly Detection Techniques
A Review of Machine Learning based Anomaly Detection TechniquesA Review of Machine Learning based Anomaly Detection Techniques
A Review of Machine Learning based Anomaly Detection Techniques
 
Securing Cloud Using Fog: A Review
Securing Cloud Using Fog: A ReviewSecuring Cloud Using Fog: A Review
Securing Cloud Using Fog: A Review
 
Comparison study of machine learning classifiers to detect anomalies
Comparison study of machine learning classifiers  to detect anomalies Comparison study of machine learning classifiers  to detect anomalies
Comparison study of machine learning classifiers to detect anomalies
 
Security Issues Surrounding Data Manipulation in a Relational Database
Security Issues Surrounding Data Manipulation in a Relational DatabaseSecurity Issues Surrounding Data Manipulation in a Relational Database
Security Issues Surrounding Data Manipulation in a Relational Database
 
J1802056063
J1802056063J1802056063
J1802056063
 

Destacado (9)

'Dockerizing' within enterprises
'Dockerizing' within enterprises'Dockerizing' within enterprises
'Dockerizing' within enterprises
 
Introduction To Sales Management
 Introduction To Sales Management Introduction To Sales Management
Introduction To Sales Management
 
HAYS - Engagement, Community, and Transformation - A Case Study
HAYS - Engagement, Community, and Transformation - A Case StudyHAYS - Engagement, Community, and Transformation - A Case Study
HAYS - Engagement, Community, and Transformation - A Case Study
 
Data Breaches: The Untold Story
Data Breaches: The Untold Story  Data Breaches: The Untold Story
Data Breaches: The Untold Story
 
Valvolopatías
ValvolopatíasValvolopatías
Valvolopatías
 
U7 t2 toro
U7 t2 toroU7 t2 toro
U7 t2 toro
 
Windows10
Windows10Windows10
Windows10
 
Harper_Michael 2016-Resume-Final2
Harper_Michael 2016-Resume-Final2Harper_Michael 2016-Resume-Final2
Harper_Michael 2016-Resume-Final2
 
Bridge electrical engineering work1
Bridge electrical engineering work1Bridge electrical engineering work1
Bridge electrical engineering work1
 

Similar a 1639(pm proofreading)(tracked)

An Improved Method for Preventing Data Leakage in an Organization
An Improved Method for Preventing Data Leakage in an OrganizationAn Improved Method for Preventing Data Leakage in an Organization
An Improved Method for Preventing Data Leakage in an Organization
IJERA Editor
 
I want you to Read intensively papers and give me a summary for ever.pdf
I want you to Read intensively papers and give me a summary for ever.pdfI want you to Read intensively papers and give me a summary for ever.pdf
I want you to Read intensively papers and give me a summary for ever.pdf
amitkhanna2070
 
Hindering data theft attack through fog computing
Hindering data theft attack through fog computingHindering data theft attack through fog computing
Hindering data theft attack through fog computing
eSAT Publishing House
 
Database Security—Concepts,Approaches, and ChallengesElisa
Database Security—Concepts,Approaches, and ChallengesElisaDatabase Security—Concepts,Approaches, and ChallengesElisa
Database Security—Concepts,Approaches, and ChallengesElisa
OllieShoresna
 
Accenture-Informed-Consent-Data-Motion
Accenture-Informed-Consent-Data-MotionAccenture-Informed-Consent-Data-Motion
Accenture-Informed-Consent-Data-Motion
Steven Tiell
 

Similar a 1639(pm proofreading)(tracked) (20)

Kg2417521755
Kg2417521755Kg2417521755
Kg2417521755
 
An Improved Method for Preventing Data Leakage in an Organization
An Improved Method for Preventing Data Leakage in an OrganizationAn Improved Method for Preventing Data Leakage in an Organization
An Improved Method for Preventing Data Leakage in an Organization
 
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network SecurityWhitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
Whitepaper- User Behavior-Based Anomaly Detection for Cyber Network Security
 
I want you to Read intensively papers and give me a summary for ever.pdf
I want you to Read intensively papers and give me a summary for ever.pdfI want you to Read intensively papers and give me a summary for ever.pdf
I want you to Read intensively papers and give me a summary for ever.pdf
 
Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics Optimised Malware Detection in Digital Forensics
Optimised Malware Detection in Digital Forensics
 
Detecting Anomaly IDS in Network using Bayesian Network
Detecting Anomaly IDS in Network using Bayesian NetworkDetecting Anomaly IDS in Network using Bayesian Network
Detecting Anomaly IDS in Network using Bayesian Network
 
C3602021025
C3602021025C3602021025
C3602021025
 
Idps
IdpsIdps
Idps
 
Hindering data theft attack through fog computing
Hindering data theft attack through fog computingHindering data theft attack through fog computing
Hindering data theft attack through fog computing
 
4777.team c.final
4777.team c.final4777.team c.final
4777.team c.final
 
Database Security—Concepts,Approaches, and ChallengesElisa
Database Security—Concepts,Approaches, and ChallengesElisaDatabase Security—Concepts,Approaches, and ChallengesElisa
Database Security—Concepts,Approaches, and ChallengesElisa
 
IMPROVED IDS USING LAYERED CRFS WITH LOGON RESTRICTIONS AND MOBILE ALERTS BAS...
IMPROVED IDS USING LAYERED CRFS WITH LOGON RESTRICTIONS AND MOBILE ALERTS BAS...IMPROVED IDS USING LAYERED CRFS WITH LOGON RESTRICTIONS AND MOBILE ALERTS BAS...
IMPROVED IDS USING LAYERED CRFS WITH LOGON RESTRICTIONS AND MOBILE ALERTS BAS...
 
EFFECTIVE MALWARE DETECTION APPROACH BASED ON DEEP LEARNING IN CYBER-PHYSICAL...
EFFECTIVE MALWARE DETECTION APPROACH BASED ON DEEP LEARNING IN CYBER-PHYSICAL...EFFECTIVE MALWARE DETECTION APPROACH BASED ON DEEP LEARNING IN CYBER-PHYSICAL...
EFFECTIVE MALWARE DETECTION APPROACH BASED ON DEEP LEARNING IN CYBER-PHYSICAL...
 
Effective Malware Detection Approach based on Deep Learning in Cyber-Physical...
Effective Malware Detection Approach based on Deep Learning in Cyber-Physical...Effective Malware Detection Approach based on Deep Learning in Cyber-Physical...
Effective Malware Detection Approach based on Deep Learning in Cyber-Physical...
 
Accenture-Informed-Consent-Data-Motion
Accenture-Informed-Consent-Data-MotionAccenture-Informed-Consent-Data-Motion
Accenture-Informed-Consent-Data-Motion
 
50320130403001 2-3
50320130403001 2-350320130403001 2-3
50320130403001 2-3
 
50320130403001 2-3
50320130403001 2-350320130403001 2-3
50320130403001 2-3
 
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSSUNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
 
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSSUNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
UNCONSTRAINED ENDPOINT SECURITY SYSTEM: UEPTSS
 
A Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And TechniquesA Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And Techniques
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

1639(pm proofreading)(tracked)

  • 1. Abstract There has been a steady increase in the utilisation of mobile devices recently, due to technology advancements attained in the area of mobile communications and the integrated features of mobile units. Insider threat has always been a problematic issue faced by organisations and with a large majority of users now equipped with mobile devices, it has now become an even more prominent issue. Freedom of movement and ubiquitous accessibility to information make it necessary for a solution to be available to address such potential threats. In this paper, a unique profiling method is introduced using carefully selected database objects as well as data concerning the location of the database requests, in order to produce a comprehensive intrusion detection framework which is sensitive to the locations where database requests are initiated from. The conduct of experiments implementing the system has resulted in promising detection rates achieved, with low rates of false alarms observed. Introduction In recent times, advances in mobile computing and mobile communication have rapidly intensified the use of mobile devices. This trend is mainly due to the convergence features of the mobile units as well as the users’ ability to move around easily while using mobile devices. Currently, storage capacity and processing power of the mobile devices are typically enough to handle both personal and office work. However, if these mobile devices are lost or stolen, the damage caused to the relevant organisation may be substantial especially if sensitive information are stored on such mobile devices. To overcome this issue, organisations may create a database to store information centrally at a fixed server, with users just having to install the necessary applications on their mobile units. In this way, users are still able to reap the
  • 2. benefits of using mobile devices and enjoy the flexibility of movement while the risk of data loss is simultaneously minimised. Despite the said solution and the availability of several other techniques to protect data on mobile devices (such as data encryption), there is still a type of risk that cannot be managed by implementing these techniques, namely the risk of insiders trying to damage the database by illegally modifying the data. These insiders would be the normal internal users within the company who have the necessary access permissions to the data and who are aware of existing policies and security mechanisms implemented by the organisation. In such a situation, none of the encryption and access control techniques can prevent them from abusing their privilege. Insider threats are increasingly growing in organisations1. Therefore, deployment of an intrusion detection system becomes necessary in order to alert administrators against unauthorised data access by such insiders. One of the methods commonly considered for detecting database attacks is behavioral analysis. To implement this technique, specific profiles for the critical assets of the database need to be generated. This process can be automated using data mining methods. Profiles can represent the characteristics of a user or a group of users. However, the features may vary between different organisations or departments. This is the reason why intrusion detection systems should be adaptive to the specific requirements of an organisation. Profiling behavioral activity is done through recording users’ access to the different objects or the different levels of each object in a database. More detailed monitoring will produce more accurate profiles. For example, a profile for user’s access to tables is normally less accurate as compared to the one with access attributes of the database tables. On the other hand, queries
  • 3. that come in the form of a transaction also have their own effects on the database. Given that different roles (representing a group of users or a department of the organisation) may display similar behavior in the database through their transactions, it is important to analyse the structure of the transactions and add them to the profile features. As mentioned in our previous work2, a combination of two query profile structures which are completely user-independent is introduced to form accurate profiles at transaction level. In this study, these profiles are enhanced to be better suited to mobile database environments, in order to produce a comprehensive intrusion detection framework which is sensitive to the locations where database requests are initiated from. It is necessary to set out a definition of the term “mobile database system” together with one core assumption made in this study, as follows: Definition: In this study, a mobile database system is defined as a database system that receives data access requests from mobile devices. In this case, the database records the location information of each request. Assumption: There is a relationship between access patterns and the geographical location of the devices from which database requests are initiated. Literature This section aims to review a few issues that is to serve as background to this study. These issues are intrusion detection systems (IDS) in general, location-aware IDS, and IDS at database level. The gaps for each of the abovementioned areas are covered in this paper and are further discussed at the end. IDS is defined as a mechanism to detect and report any suspicious activity occuring in the system. The idea of such a mechanism have been introduced and have been increasingly
  • 4. researched during the past decade. As defined in 3, IDS should be able to produce an alarm to notify security administrators in case of any security threat. Two models of detection have been defined by which decisions are made on the existence of threats; namely anomaly detection and signature detection3-7. In anomaly detection all activities are divided into 2 categories, namely “normal” or “abnormal”, with the “abnormal” behaviors regarded as intrusions3,6,7. Signature-based detection uses a predefined list of known intrusions to detect threats3,6-8. If a behavior is close to the stored patterns, it will be considered as an intrusion. IDS were first introduced to enhance the network security against newly emerged security risks like Denial of Service (DoS) and against probing attacks9. However the application of IDS has now extended to cover various network domains, computer systems and even applications. An example is the application of IDS at database level, as implemented in this paper. Intrusion Detection Challenges: FP and FN Although both signature-based and anomaly-based detection methods are widely used to detect suspicious intruder activities, there are still major concerns regarding the accuracy of IDS detection. For both detection methods, the rate of success or failure in detecting intrusions would be the measure that determines whether a computer system is in a secure environment or otherwise. There are two common terminologies used when discussing IDS inaccuracy, being False Positive (FP) rates and False Negative (FN) rates. They represent the accuracy level of the intrusion detection system. FP represents mistaken detections for which the alarm is wrongly
  • 5. generated, whereas FN means the failure to detect when a real-attack is attempted. IDS usually suffer from FPs and related alarms due to wrong detection results10. An IDS must be able to fulfil reliability requirements in terms of precision in detecting attacks with minimum false alarms11. Another challenging aspect of the performance of IDS relates to the amount of data which an IDS can handle at any one time. This refers to the speed at which the system can process incoming events in order to determine whether or not they are intrusions. When it comes to database intrusion detection systems, which is the main topic of this study, the amount of data requests that come through queries or transactions can determine the performance of the system. Therefore, both accuracy and performance are factors which should be considered when developing an IDS11. These two features have been carefully incorporated into the proposed database IDS in this study. Location-aware Intrusion Detection Systems Advancement in mobile technology has made the usage of mobile devices more convenient and popular. Factors such as long battery life, powerful processors, large storage capacities and low weights have motivated organisations to migrate to mobile units rather than continue using traditional computer systems. The most desirable benefit of using mobile units is the freedom of movement. On the other hand, data protection becomes a real concern since mobile devices are more prone to be lost. Aside from passwords, several security measures can be implemented on mobile devices such as fingerprint authentication12 and voice recognition13. However, data remains at risk when a device is stolen. A portion of these data may contain confidential information that can be used to access an organisations’ network, application and databases. In terms of database access (which is the main focus of this study), there should be a solution in place to detect
  • 6. anomalous requests, which would enable administrators to be informed as to whether the request is sent by a legitimate user or an illegal one, given that the origin is the mobile unit for which permissions have already been set for the registered account. Research on mobility behavior of mobile phone users14 has shown that most of the observed users usually use specific and limited paths and locations. In reaching this conclusion, the mobile phones of 100,000 users are monitored. The outcome of the research indicates that the location of mobile users is generally predictable. Therefore, this feature can be used to form normal profiles of the mobile units in an organisation. Network access patterns are used to build profiles15 in order to detect abnormal behavior of a mobile user environment. Detection accuracy shows 90% success overall. Similarly, information about users’ location is utilised for building smartphone user profiles16 with 81% accuracy achieved. Enhancements to these research have been performed17 using two profiling techniques, being empirical cumulative probability measure and another model on trajectories using Markov properties. However, these valuable studies are limited to the standalone mobile device and are not applicable for centralised database access. Mobility and similarity detection have been widely used to detect anomalous activities in mobile environment18-20. Expression-based algorithms are utilised to spot patterns in mobility21. Mobility patterns are also utilised22 to address issues of social activity predictions. Association rule mining23 helps to extract the relationships between user activities and their locations. Besides association rule mining, other methods like k Nearest Neighbours (KNN), Bayesian network, bipartite graphs, and neural networks are used to determine regularity and detect malicious behaviors or malwares in mobile devices24-26. An intrusion detection system
  • 7. for mobile devices was introduced by Sun et al.27 using three level Markov Chain which is independent from time to location feature. It is effective only during phone calls when the mobility speed is high. Another detection model, where predefined routes are used, is proposed by Hall et al.28 with the focus being on mobility in public transportation. However, all the reviewed techniques and models have a common feature which is dependence on the user. This dependency may affect the overall performance of an IDS if it is implemented centrally and is expected to deal with a large number of users. The focus of this paper is on databases in mobile environments which users can access through their mobile devices. Therefore, the computational task is performed in the system that hosts an instance of the database, making it necessary to have a user-independent detection method at this point. Database Lu et al.29 proposed a method to detect malicious transactions by measuring transaction violations in order to distinguish legal requests from illegal ones. Audit logs from the DBMS were used to implement this idea. Similarly Fonseca et al.1 tried to identify abnormal activities by defining and comparing normal patterns in the stored database logs. Although the technique used by Lu et al.29 and Fonseca et al.1 is workable, it needs a lot of human supervision and the detection of insider attacks is not automatic. Networked Bayesian Network (NBN) 30 is used to calculate the probability of an action being an intrusion, in order to detect unauthorised insider attacks. This operation is done when a number of critical objects are accessed in a transaction. The authors make the assumption that half of the insider activities are to be regarded as intrusions. This is unsatisfactory since there
  • 8. is no evidence or basis to support this assumption. However, as mentioned by the authors, this method is not applicable for authorised insider attacks. A method which involves fingerprinting the transactions was used by Lee et al.31 to detect unauthorised database access. They focused on Structured Query Language (SQL) statements to create signatures for the intrusion detection system, which is considered as a signature-based approach. This approach may not work efficiently for unknown access patterns. Yaseen and Panda32 utilised a knowledge base to define the amount of information that insiders have about data items, and subsequently proposed a threat prediction graph to detect and prevent insider threats. Their work is a valuable contribution to this area of study since they conducted in-depth analysis of the dependencies of database items. However, the work is based on several unrealistic assumptions, such as insiders’ access patterns. The probability of misusing data previously accessed by insiders is equal to the probability of accessing new items. Gaps A majority of IDS focus on networks or hosts and therefore cannot be effective for databases. A few detection systems have been proposed or implemented at database level33-35. Review of current database IDS models and systems show that there is still a lot of work to be done in terms of accuracy of detection engine and error rates in order to have a robust IDS at database level. Most of the current models suffer from high FP34 or FN33 leads to put more accurate data preparation techniques into consideration while applying these data to the detection engine. High FN becomes a limitation to the traditional signature-based detection method because an effective expert system should only react to well-defined attacks. This causes reliability problems in relation to new entries for which there is no pattern or signature in the system36.
  • 9. High FP is also a challenge for the anomaly detection methods. Incomplete learning processes lead to a high number of false alarms. Through this method, a statistical process is used to detect undefined attacks based on user or system behavior. The performance of IDS that use anomaly detection is highly dependent on the amount of correct behavioral statistics which it gathers. System administrators usually suffer from having to handle a huge amount of FP alerts daily37. Again, we can conclude that a more accurate repository of valid signatures and/or behavioral statistics is required to overcome the challenges mentioned above. Another issue that may affect the accuracy of database intrusion detection systems (dIDS) is its handling of a mobile environment14,15. Although detection engines are placed at database level, it is still collecting mobility behavior of users, where their requests can help to produce more accurate patterns for detection purposes16. This hypothesis is proved through the experiment conducted in the present study. As stated above, despite many attempts to secure databases against intruders, a lot of effort is still required to improve IDS in order to combat the phenomenon of insider attacks. Moreover, none of the research work mentioned adapt their systems to mobile environments where requests are sent through mobile units. These gaps are covered in this study with the aim of providing a robust detection framework which can effectively deal with insider threat in a mobile environment. To enhance the capability of database intrusion detection, a multi-layer profiling method is introduced in this study, covering both context and result-set based approaches as well as query structure. This model uses a unique structure incorporating carefully selected features which help to increase accuracy and consequently reduce false alarms. The proposed profiling
  • 10. happens at query level and then at transaction level. This helps to avoid dependencies on users and user roles. The factor of mobility is also considered. A location abstraction model is introduced to update profile structures in order to enhance the accuracy of detection engine while processing queries in mobile database systems. Mobile The demand for mobile devices is constantly growing, with rapid innovation seen in terms of technology, processing power, networking capabilities, and storage capacity. New features are constantly being introduced and integrated in these devices, enabling more services to be provided to users. Databases and data management services are no exceptions. More and more mobile applications now offer connections to central databases, leading to a variety of information being transferred into the mobile device. Accessing and exchanging data over networks are activities which inevitably give rise to concerns regarding security and this issue becomes even more serious when the access is ubiquitous. Therefore, more security features should be introduced for mobile database systems, especiallly given the distributed nature of mobile systems38,39. A mobile database system is a system where applications can gain access to the database through the use of mobile devices. In a mobile database system, either a portion of the database or the entire database can be integrated into the mobile unit. However, because of limitations such as battery power, capacity, and computational capabilities, the main database usually remains stored in a fixed machine. In general, mobile databases are referred to as a special form of distributed database system and can be defined in two forms. One is the implementation of whole databases in fixed and wired distributed systems. The other is distributing database between wired and wireless systems in which both fixed and mobile
  • 11. machines perform the task of data management40. As shown in Figure 1, the scenario which is selected for this study adopts the first-mentioned architecture, namely a mobile database system where mobile units are connected to the database system (DBS) through mobile networks. The database itself is also distributed among different locations. The fact that mobile units can carry a part of the database does not have any effect on the current work, since the main processing task for the proposed database intrusion detection system happens at the main databases which are already synchronised. Dataset The main database is created in Microsoft SQL Server 2008 and is filled with about eight million records. To follow a formal structure for the database, TPC-H standard is selected and implemented according to the published datasheet41. Fields of each table are populated with realistic sample data and transactions are created using dbgen and qgen tools respectively41. Transactions contain one or more queries accessing several attributes of the database in order to prepare the database log. The log file is used as the main dataset for the purpose of data mining. The precise number of queries is 1000, with 500 transactions being generated and launched randomly on the database. These queries are used in the mining process and we consider the database log as an intrusion- free dataset. Details of the mining procedure are explained in the following section. Transactions contain a number of queries that help to gather enough data relationships to produce an acceptable range of itemsets for the data mining process. 100 malicious transactions are also generated to train the system and also to check the accuracy of detection. The malicious transactions contain queries in which abnormal access patterns can be found such as drop table command.
  • 12. Mining Procedure Several data mining tools are required for data analysis and pattern discovery that can help to provide better strategies for business and scientific research areas. The existing gap between raw data and useful information demonstrates the importance of data mining techniques to turn plain data into valuable information. Therefore, some techniques are needed for knowledge discovery that can automatically find logical relationships between data with minimum user intervention. Many issues may arise in terms of performance and efficiency when dealing with large volumes of data in databases. It should be noted that the same mining method may not provide the desired results when the objective of mining is altered. For IDS, there are a few frequently used mining algorithms. Two common methods are association rule learning and sequence mining. The effectiveness of rule mining is complemented with the application of PARAS, being an approach to form a better representation of complex redundancy relationships42. Association rules would be used to uncover relationships between unrelated data in rational databases or any other kind of data repository. An association rule is divided into two parts, being “if” and “then”. For example, “if a customer buys a dozen eggs, then he is 80% likely to purchase milk.” In fact, association rules are constituted through analysing the data using “if/then” patterns and using criteria such as “support” and “confidence” to identify the most important relationships. “Support” determines the frequency of items appearing in the database whilst “Confidence” shows the number of items “if/then” which has been found to be true. Findings from other studies 43,44 also support the use of apriori for dIDS. The first stage in the framework proposed in this paper is to mine dataset with apriori algorithm and produce rules to be used later for
  • 13. profiling. These rules will show the frequent itemsets including all data access which occur frequently in the database. With the help of association rule mining, data correlations can be highlighted. This unsupervised learning process is the first phase of the proposed intrusion detection system. An example of such association is updating an order request for customer address which can be a normal action. However if the update includes the customer name and the invoice price, this may require more consideration. Similarly, if the address update task which is a “normal” office job during office hours takes place at night, this can be considered as an intrusion. Here the relationship between updating an item and time may be helpful. Several examples can be discussed to demonstrate how data dependencies or correlations between items of an itemset are able to help the IDS to produce more accurate decisions when processing new requests. By using apriori algorithm, the number of data required to be processed to generate normal patterns may be reduced given that itemsets that are not frequently accessed are removed from the list. At the end, the most frequently accessed itemsets will remain and more accurate profiles can thus be generated. This will minimise the size of data for intrusion detection process. However, to what extent can we can limit the item selection? This depends on two factors, namely minimum support and minimum confidence43. Choosing a proper support factor while searching the dataset can provide more accurate results. The purpose here is to have frequent subsets in each frequent itemset. A minimum threshold for the level of frequency is defined as minimum support. Having different minimum supports may lead to having different numbers of rules. There is no specific method for determining minimum support except by testing several numbers in order to find the optimum one. In this experiment, the number is set at 30%. Minimum support is not an
  • 14. absolute value and is relative to the dataset that is being used in the experiment. To identify a suitable minimum support, several testing sessions were conducted using different numbers and the results were measured in terms of the coverage of the rules on the dataset. It is observed that for minimum supports that applied more than 30%, many of the rules for data dependencies were missed. On the other hand, for lower numbers like 25%, some rules that should not be considered as frequent itemsets were caught in the results. Finally a midway number of 30% was chosen as minimum support. The same logic is applied for determining the value of minimum confidence. Implementation of apriori is done through Weka45 libraries which are used in the Java code presented in this work. The application is developed by the author, who can call Weka mining engine to apply apriori on the dataset. It is necessary to code the application in order to automate mining procedures. After mining sequential patterns, apriori generates read and write sequence sets. It then extracts dependency rules using minimum confidence. This parameter is set at 80% in this experiment. At the end the output of apriori will be dependency rules showing associations between data items in a dataset. Samples of extracted rules for the utilised dataset of this work are presented in Table 1. Overall, 20 rules are extracted by apriori which are then used in the profiling process. This information helps us to gain better knowledge about normal behavior on database requests in order to build more accurate user-independent profiles. Transaction Although the database log file contains all activities done by running queries, a proper extraction and interpretation process is required in order to prepare a suitable query structure for the data mining process. Profiling has recently been introduced and implemented in several works as reviewed in Section 2. Profiling is used when the detection method is based on the
  • 15. behavior of user, group of users (role), and queries. Therefore, the main issue that arises here is the selection of suitable features of the log file to increase the accuracy level of detection and to reduce errors accordingly. Based on the association rules mentioned above, profiles are created for transactions that follow the same dependency rules and their related itemsets. For example, if an itemset contains () a profile string will be created for the related transactions as: “” as well as other items in each transaction. It should be noted that the order of items is important in the profile string and there should not be any spaces. This is because a binary form of these profile strings is generated for the purpose of comparison with incoming requests. Profile Profiling can be performed at the query level, context level or even result-set level. However, for the purpose of this study, a combination of all the said levels is selected to form the profile structure. The efficiency of this profile structure has been approved in previous works2. In addition, the location information of the database requests is also added to the profiles to meet the requirements of a mobile database system. Therefore, the overall structure of the final profile is as follows:{}. This profile represents the SQL command, attribute, table name, result set, location and timestamp respectively. Feature
  • 16. For complex queries in which multiple attributes are accessed, all attributes are selected with related command separately. For example if the query41 is like: “SELECT” is counted along with each set of attributes and tables. This means that if the said query is assumed to be a normal query, highlighted pairs can be listed as {}. However, not all sets would be considered in the final classification since certain data accessed would be ignored through data mining procedure due to having low frequency. Therefore, only data which are accessed frequently enough are listed in extracted rules from the applied mining algorithm. Classification Classification is used to refine the training process in order to better distinguish between intrusion transactions and non-intrusion transactions. For the supervised learning task, labeled transactions need to be provided in order to teach the system about malicious queries. For this purpose, several transactions are generated as intrusions consisting of malicious queries. These transactions are then injected into the system, updating the classifier knowledge base. Through this process, the proposed system is completely taught about normal and malicious profiles. Therefore, the probability of an intrusion can be calculated depending on the level of the system’s current knowledge about intrusions. In this study, Naive Bayes classifier is used to consider uncertainty and to find the probability of an intrusion occurrence. A brief explanation of this classifier and its effects on the current study is presented below. Implementation As explained earlier, data generator tools are used to fill up the database tables with realistic sample data. This means that if there are numeric, text and date fields in the tables, they are all
  • 17. filled up with numbers, characters and dates. A total amount of eight million records are generated in the database. As mentioned earlier, transactions are generated based on recommendations 41. However, several transactions are also created manually in order to fill up the database log. All malicious transactions are synthetic. They contain one or more queries with abnormal requests such as deleting a table, updating an item out of normal time, updating an item out of normal location, selecting all attributes, updating multiple unrelated attributes, etc. These queries are used in intrusion tagged transactions. The mining process is performed by Weka45.Weka is an open source project that provides both data mining and classification features. It is used in this study to apply apriori algorithm and Naive Bayes classifier on the dataset. Weka shares libraries that can be integrated into different coding platforms to be used inside the code of programmers. In this way, a programmer can call Weka to launch any of the available data mining algorithms on the predefined datasets. To be able to use these libraries, Weka should be installed completely on the computer system. Alternatively, a complete Weka package called weka.jar should be imported to the programming platform45. Generating location information is performed through simulation. For this purpose, Network simulator 2 (NS-2)47 is used to simulate a mobile network with multiple devices communicating with database server while moving around a predefined area. By using this method, a complete location profile is collected and added to the profile structure mentioned earlier. Abstracting location information into profiles follows a mathematical procedure which is explained later in this paper. Adding locations to the profiles is necessary in order to convert the queries into a location-aware query. A location-aware query refers to a query which
  • 18. contains at least one location-related attribute or one location-related predicate48. This profile item helps to increase the accuracy of the detection system for insiders who are using mobile units to connect the database. Simulation parameters are set out in Table 2, whilst a view of the simulated area is shown in Figure 2. To create location-based profiles, a standard model is required to represent the positions of mobile units. There are currently two widely used models called the geometric model and the symbolic model49. The geometric model defines locations with n-dimensional coordinates including both longitude and latitude information. This can also be represented as a set of coordinates showing a boundary or geographical area. In symbolic models, abstract symbols are used to represent locations. This enables locations to be traced by name and consequently makes the process of clustering and analysing easier, especially when the system requires human decisions49. In this study, symbolic location model is used to create location profiles. Statistics can prove to be immenseful helpful in achieving the set goal. Most database requests come from mobile devices belonging to internal staff members who have direct links to the location of the user. By exploiting this fact, it is possible to formulate a method to divide geographical areas into the mobile units’ related locations. For example, in a police database, most requests to access records relating to a specific city would come from police officers who are working in the same city. Similarly, requests to access the medical records of a patient in a multi-branch hospital would normally be made by staff of the same hospital branch as the patient. Although there are exceptions to these examples, the majority would follow the same logical structure. This assumption may help to enhance the accuracy of an IDS. For example, when the record of a patient in a particular hospital branch is updated by a request that comes from a different branch, or when an event related to a particular city is updated by a request
  • 19. which comes from a mobile unit located in another city, these are not to be treated as normal requests. The system should bring them to notice by informing administrators about them through alarms. In order to divide location information into logical areas, it is necessary to find the most frequently repeated locations and set them as the center of the distribution, and then calculate and list their neighbours in the same group. Statistically, the Mean value can do the job, which is simply to calculate the average of a group of data. This is done by adding up all the data and dividing them by the total number of values. Once the Mean has been determined, the deviation degree of all locations to the mean value is to be calculated. This can help to identify a threshold for dividing location information into separated geographical areas. Standard deviation is used for this purpose. It calculates the distance of each observation from the mean value and provides an average deviation. Performance Performance metrics are used to measure the accuracy and reliability of the proposed IDS. FP, FN and accuracy are the metrics used in this experiment. In general, higher true positives increases the accuracy of detection engines. Similarly less detection errors result in lower FP and rates. Data K-fold cross validation (CV) is used to evaluate the accuracy and validity of the dataset utilised in this study. In k-fold CV, dataset is divided into equal partitions called folds52. A model
  • 20. should then be defined and applied to each partition in order to calculate the results based on predetermined metrics. The proposed detection framework in this study is used as the model and accuracy of detection is adopted as metric to test and validate the dataset. Having defined k-folds, one is selected to test the dataset whilst the rest of the k- folds are used to train the system. This action is repeated for each fold until all folds play the role of both trainer and tester of the dataset. At the end, an average of the observed results represents the overall effectiveness of the system. In order to provide appropriate demonstration of dataset in each fold, data is properly classified and arranged to ensure that each fold contains equal share of the whole dataset. This means that the number of instances in every fold should be the same even though they are selected randomly. The effectiveness of using k-fold CV has been approved and recommended in 52-54. Proposed The main components of the proposed framework are profile generation, transaction processing, and probability checking. Each of these components is discussed in this section. Application of this framework on the utilised dataset has shown interesting results which demonstrate higher accuracy in terms of intrusion detection. The structure of the framework is obtained from our previous work2. However, more detailed and accurate profiling method has been applied and the framework is enhanced for mobile database systems, as shown in Figure 4. Incoming
  • 21. A database transaction containing one or more queries is injected into the system by using Java codes in order to determine whether or not there are intrusions. Each query is converted into a profile string so that it can be compared with the existing intrusion or valid patterns. The conversion is straightforward. All the required elements are extracted from the query and listed next to each other, separated by a comma. In fact, all of the words inside the transaction are compared with a list of keywords. If they match, they will be added to the new string for further comparison. Keywords are obtained from the items extracted by mining the database queries. These words are listed as frequently accessed itemsets or association rules from the output of mining procedure. These profile strings are converted into binary in order to ease the comparison process. Transactions are read from a text file and loaded into a temporary string variable. In this way, there is no constraint for loading and processing incoming transactions through Java code. Exclusive OR (XOR) operation is used to compare binary records, where the output of ‘0’ represents equality in both binary strings. This process follows Algorithm 2, as explained below. Instructions in Algorithm 2 show how the system behaves with new transactions. In summary, if the text string generated for incoming transactions does not match up with existing valid patterns, it will be sent to the classifier in order to check the probability of intrusion. Naive Bayes classifier used in this study helps to determine whether or not the new pattern is an intrusion by measuring the level of similarity between the new pattern and existing ones. On the other hand, if the pattern of the incoming transaction matches existing patterns it is considered as a normal request and can pass through the system without further examination. There is another probability here which is a profile string without attributes. In such case, the incoming transaction is detected as an intrusion and an alarm will be generated to end the
  • 22. process. This is because SQL commands which do not contain target attributes most probably fall into an intrusion category such as “” or SQL injection attacks like “ “ in which “1” is an attribute used to bypass the condition. Checking As stated earlier, if the incoming requests do not match with existing valid patterns, they are sent for checking by the Naive Bayes classifier. At this stage, both the probability of being “intrusion” and “non-intrusion” requests are calculated by the system using Equation 4. The greater probability will be taken as the output of this procedure. If it is concluded that the incoming entry is an intrusion, an alarm will be generated and the intrusion repository will be updated accordingly. However, if it is concluded that the incoming entry is not an intrusion, the valid pattern repository is updated with the new pattern. Intrusion The intrusion repository is a list of all intrusions detected by the system. In the beginning this repository is empty. It gradually grows when the system starts to detect intrusions. As shown in Figure 4, having such a repository helps to detect repeated intrusions without having to send them for examination. This reduces the detection time dramatically, leading to an overall higher performance. Significant processing job is performed to compare between incoming requests and existing valid patterns, where the incoming requests will be validated if similar patterns is found in the intrusion repository. Updating the repository is performed whenever an alarm is generated due to an intrusion. The code which updates the intrusion repository follows Algorithm 3 as presented below.
  • 23. Implementation As noted earlier, the k-fold cross validation method is adopted to assess the validity of the data used in both training and testing processes. For this purpose, the dataset is divided into five segments which is called fold here. Each segment is stored in a separate file to be used as either a training set or a test set. The function of being train set or test set is circulating between folds so that all folds end up performing both roles. The code for implementing k-fold CV is written in Java, which is connected to Weka application in order to apply the algorithms. There are 1000 normal patterns and 100 malicious patterns, all of which are synthetic. Valid patterns are randomly divided into 5 groups, each containing 200 patterns. The same division is done for intrusions with 20 patterns in each group. The valid groups and the intrusion groups are then merged together to produce 5 folds with 220 pattern containing both normal and malicious records. Since k is 5 here, the experiment on data validation should be repeated at least 5 times in a way where each fold is used as training set and test set. Therefore, in each experiment there are 4 folds used to train the system and 1 fold used to test the system. Metric is accuracy here which is explained in Equation 9. Once the accuracy is measured for iterations of the experiment, the average of all results is used to represent the final detection rate which is 96.1%. Results
  • 24. The same detection methods, dataset, and framework used for general databases are applied to check the effectiveness of the enhanced structure of the profiles in mobile database intrusion detection cases.2 As expected, adding location profiles reduces errors for mobile environments. Both FN and FP rates are reduced by 14% and 0.84% respectively. The results and comparison with previous works are shown in Table 4. It can be seen that better results are obtained for the detection method proposed in this paper. It is observed that the application of the proposed enhanced profiling method (which relates location information to the valid patterns) is able to improve accuracy. Reduction in FPR is also considerable and demonstrates how knowledge about locations can be exploited to reduce errors in the system. Discussion As mentioned earlier, raw data accessed are mined with association rules. The mining algorithm extracts the most frequent itemsets for any related attributes. This means that the output is a rule showing that there is a relationship between two or more attributes because of the fact that they are frequently accessed together. Consequently, data access patterns with low frequency are removed from the list. Therefore profiles which are created based on these rules are more accurate as compared to the profiles for all data accessed. This method is the key to reducing FP rates. On the other hand, by applying accurate and relative location information to the profiles, more precise knowledge can be obtained on the normal behavior of data requests in a mobile environment. This helps to reduce the overall FN rates. Therefore, the unique proposed
  • 25. profiling technique together with the decision-making method used in this study result in higher detection accuracy and lower false alarms. Conclusion An effective intrusion detection framework, optimised for mobile database environments, has been introduced and discussed in this study. Findings from implementation of the proposed detection method show a high level of accuracy and a low level of FP rates. This is due to the unique profiling method and detection mechanism applied in the system. The detection system is designed in such a way that it is sensitive to the locations from which database requests are initiated. Based on the performed experiment, promising results were obtained in terms of accuracy and FN rates, being 96.1% and 17% respectively. This shows better outcomes as compared to previous works. Using accurate profiling method, adding location information, and using a comprehensive detection procedure are key features in this experiment. Feature selection is always a challenge while creating profiles. Other combinations of query features can be tested to evaluate the efficiency of the system. However, adding more features may affect the performance of detection systems. This should be considered when designing detection solutions. Although higher performance is already achieved by applying an intrusion repository in the proposed framework, more work can be done to enhance the processing speed, for example by using multi-thread coding. There is also a chance to increase the detection rate if the system can be combined and customised with an organisation’s policy repository. Although this will make the system
  • 26. specifically tailored to each organisation, it will increase the accuracy of valid patterns. Policies vary between organisations in which sensitive attributes or requests can be defined, which helps to highlight abnormal activities more accurately. This increases the overall accuracy of the IDS. Upgrading the proposed dIDS to a database intrusion prevention system (dIPS) and using cloud service to make the detection engine available all the times through the internet are two enhancement features that are currently trending topics. These features can perhaps be explored in future work.