1639(pm proofreading)(tracked)

Abstract
There has been a steady increase in the utilisation of mobile devices recently, due to technology
advancements attained in the area of mobile communications and the integrated features of
mobile units. Insider threat has always been a problematic issue faced by organisations and
with a large majority of users now equipped with mobile devices, it has now become an even
more prominent issue. Freedom of movement and ubiquitous accessibility to information make
it necessary for a solution to be available to address such potential threats. In this paper, a
unique profiling method is introduced using carefully selected database objects as well as data
concerning the location of the database requests, in order to produce a comprehensive intrusion
detection framework which is sensitive to the locations where database requests are initiated
from. The conduct of experiments implementing the system has resulted in promising detection
rates achieved, with low rates of false alarms observed.
Introduction
In recent times, advances in mobile computing and mobile communication have rapidly
intensified the use of mobile devices. This trend is mainly due to the convergence features of
the mobile units as well as the users’ ability to move around easily while using mobile devices.
Currently, storage capacity and processing power of the mobile devices are typically enough
to handle both personal and office work. However, if these mobile devices are lost or stolen,
the damage caused to the relevant organisation may be substantial especially if sensitive
information are stored on such mobile devices. To overcome this issue, organisations may
create a database to store information centrally at a fixed server, with users just having to install
the necessary applications on their mobile units. In this way, users are still able to reap the

benefits of using mobile devices and enjoy the flexibility of movement while the risk of data
loss is simultaneously minimised.
Despite the said solution and the availability of several other techniques to protect data on
mobile devices (such as data encryption), there is still a type of risk that cannot be managed by
implementing these techniques, namely the risk of insiders trying to damage the database by
illegally modifying the data. These insiders would be the normal internal users within the
company who have the necessary access permissions to the data and who are aware of existing
policies and security mechanisms implemented by the organisation. In such a situation, none
of the encryption and access control techniques can prevent them from abusing their privilege.
Insider threats are increasingly growing in organisations1. Therefore, deployment of an
intrusion detection system becomes necessary in order to alert administrators against
unauthorised data access by such insiders.
One of the methods commonly considered for detecting database attacks is behavioral analysis.
To implement this technique, specific profiles for the critical assets of the database need to be
generated. This process can be automated using data mining methods. Profiles can represent
the characteristics of a user or a group of users. However, the features may vary between
different organisations or departments. This is the reason why intrusion detection systems
should be adaptive to the specific requirements of an organisation.
Profiling behavioral activity is done through recording users’ access to the different objects or
the different levels of each object in a database. More detailed monitoring will produce more
accurate profiles. For example, a profile for user’s access to tables is normally less accurate as
compared to the one with access attributes of the database tables. On the other hand, queries

that come in the form of a transaction also have their own effects on the database. Given that
different roles (representing a group of users or a department of the organisation) may display
similar behavior in the database through their transactions, it is important to analyse the
structure of the transactions and add them to the profile features.
As mentioned in our previous work2, a combination of two query profile structures which are
completely user-independent is introduced to form accurate profiles at transaction level. In this
study, these profiles are enhanced to be better suited to mobile database environments, in order
to produce a comprehensive intrusion detection framework which is sensitive to the locations
where database requests are initiated from. It is necessary to set out a definition of the term
“mobile database system” together with one core assumption made in this study, as follows:
Definition: In this study, a mobile database system is defined as a database system that receives
data access requests from mobile devices. In this case, the database records the location
information of each request.
Assumption: There is a relationship between access patterns and the geographical location of
the devices from which database requests are initiated.
Literature
This section aims to review a few issues that is to serve as background to this study. These
issues are intrusion detection systems (IDS) in general, location-aware IDS, and IDS at
database level. The gaps for each of the abovementioned areas are covered in this paper and
are further discussed at the end.
IDS is defined as a mechanism to detect and report any suspicious activity occuring in the
system. The idea of such a mechanism have been introduced and have been increasingly

researched during the past decade. As defined in 3, IDS should be able to produce an alarm to
notify security administrators in case of any security threat. Two models of detection have
been defined by which decisions are made on the existence of threats; namely anomaly
detection and signature detection3-7. In anomaly detection all activities are divided into 2
categories, namely “normal” or “abnormal”, with the “abnormal” behaviors regarded as
intrusions3,6,7.
Signature-based detection uses a predefined list of known intrusions to detect threats3,6-8. If a
behavior is close to the stored patterns, it will be considered as an intrusion.
IDS were first introduced to enhance the network security against newly emerged security risks
like Denial of Service (DoS) and against probing attacks9. However the application of IDS has
now extended to cover various network domains, computer systems and even applications. An
example is the application of IDS at database level, as implemented in this paper.
Intrusion Detection Challenges: FP and FN
Although both signature-based and anomaly-based detection methods are widely used to detect
suspicious intruder activities, there are still major concerns regarding the accuracy of IDS
detection. For both detection methods, the rate of success or failure in detecting intrusions
would be the measure that determines whether a computer system is in a secure environment
or otherwise.
There are two common terminologies used when discussing IDS inaccuracy, being False
Positive (FP) rates and False Negative (FN) rates. They represent the accuracy level of the
intrusion detection system. FP represents mistaken detections for which the alarm is wrongly

generated, whereas FN means the failure to detect when a real-attack is attempted. IDS usually
suffer from FPs and related alarms due to wrong detection results10. An IDS must be able to
fulfil reliability requirements in terms of precision in detecting attacks with minimum false
alarms11. Another challenging aspect of the performance of IDS relates to the amount of data
which an IDS can handle at any one time. This refers to the speed at which the system can
process incoming events in order to determine whether or not they are intrusions. When it
comes to database intrusion detection systems, which is the main topic of this study, the amount
of data requests that come through queries or transactions can determine the performance of
the system. Therefore, both accuracy and performance are factors which should be considered
when developing an IDS11. These two features have been carefully incorporated into the
proposed database IDS in this study.
Location-aware Intrusion Detection Systems
Advancement in mobile technology has made the usage of mobile devices more convenient
and popular. Factors such as long battery life, powerful processors, large storage capacities and
low weights have motivated organisations to migrate to mobile units rather than continue using
traditional computer systems. The most desirable benefit of using mobile units is the freedom
of movement.
On the other hand, data protection becomes a real concern since mobile devices are more prone
to be lost. Aside from passwords, several security measures can be implemented on mobile
devices such as fingerprint authentication12 and voice recognition13. However, data remains at
risk when a device is stolen. A portion of these data may contain confidential information that
can be used to access an organisations’ network, application and databases. In terms of database
access (which is the main focus of this study), there should be a solution in place to detect

anomalous requests, which would enable administrators to be informed as to whether the
request is sent by a legitimate user or an illegal one, given that the origin is the mobile unit for
which permissions have already been set for the registered account.
Research on mobility behavior of mobile phone users14 has shown that most of the observed
users usually use specific and limited paths and locations. In reaching this conclusion, the
mobile phones of 100,000 users are monitored. The outcome of the research indicates that the
location of mobile users is generally predictable. Therefore, this feature can be used to form
normal profiles of the mobile units in an organisation.
Network access patterns are used to build profiles15 in order to detect abnormal behavior of a
mobile user environment. Detection accuracy shows 90% success overall. Similarly,
information about users’ location is utilised for building smartphone user profiles16 with 81%
accuracy achieved. Enhancements to these research have been performed17 using two profiling
techniques, being empirical cumulative probability measure and another model on trajectories
using Markov properties. However, these valuable studies are limited to the standalone mobile
device and are not applicable for centralised database access.
Mobility and similarity detection have been widely used to detect anomalous activities in
mobile environment18-20. Expression-based algorithms are utilised to spot patterns in
mobility21. Mobility patterns are also utilised22 to address issues of social activity predictions.
Association rule mining23 helps to extract the relationships between user activities and their
locations. Besides association rule mining, other methods like k Nearest Neighbours (KNN),
Bayesian network, bipartite graphs, and neural networks are used to determine regularity and
detect malicious behaviors or malwares in mobile devices24-26. An intrusion detection system

for mobile devices was introduced by Sun et al.27 using three level Markov Chain which is
independent from time to location feature. It is effective only during phone calls when the
mobility speed is high. Another detection model, where predefined routes are used, is
proposed by Hall et al.28 with the focus being on mobility in public transportation.
However, all the reviewed techniques and models have a common feature which is dependence
on the user. This dependency may affect the overall performance of an IDS if it is implemented
centrally and is expected to deal with a large number of users. The focus of this paper is on
databases in mobile environments which users can access through their mobile devices.
Therefore, the computational task is performed in the system that hosts an instance of the
database, making it necessary to have a user-independent detection method at this point.
Database
Lu et al.29 proposed a method to detect malicious transactions by measuring transaction
violations in order to distinguish legal requests from illegal ones. Audit logs from the DBMS
were used to implement this idea. Similarly Fonseca et al.1 tried to identify abnormal activities
by defining and comparing normal patterns in the stored database logs. Although the technique
used by Lu et al.29 and Fonseca et al.1 is workable, it needs a lot of human supervision and the
detection of insider attacks is not automatic.
Networked Bayesian Network (NBN) 30 is used to calculate the probability of an action being
an intrusion, in order to detect unauthorised insider attacks. This operation is done when a
number of critical objects are accessed in a transaction. The authors make the assumption that
half of the insider activities are to be regarded as intrusions. This is unsatisfactory since there

is no evidence or basis to support this assumption. However, as mentioned by the authors, this
method is not applicable for authorised insider attacks.
A method which involves fingerprinting the transactions was used by Lee et al.31 to detect
unauthorised database access. They focused on Structured Query Language (SQL) statements
to create signatures for the intrusion detection system, which is considered as a signature-based
approach. This approach may not work efficiently for unknown access patterns.
Yaseen and Panda32 utilised a knowledge base to define the amount of information that insiders
have about data items, and subsequently proposed a threat prediction graph to detect and
prevent insider threats. Their work is a valuable contribution to this area of study since they
conducted in-depth analysis of the dependencies of database items. However, the work is based
on several unrealistic assumptions, such as insiders’ access patterns. The probability of
misusing data previously accessed by insiders is equal to the probability of accessing new
items.
Gaps
A majority of IDS focus on networks or hosts and therefore cannot be effective for databases.
A few detection systems have been proposed or implemented at database level33-35. Review of
current database IDS models and systems show that there is still a lot of work to be done in
terms of accuracy of detection engine and error rates in order to have a robust IDS at database
level. Most of the current models suffer from high FP34 or FN33 leads to put more accurate data
preparation techniques into consideration while applying these data to the detection engine.
High FN becomes a limitation to the traditional signature-based detection method because an
effective expert system should only react to well-defined attacks. This causes reliability
problems in relation to new entries for which there is no pattern or signature in the system36.

High FP is also a challenge for the anomaly detection methods. Incomplete learning processes
lead to a high number of false alarms. Through this method, a statistical process is used to
detect undefined attacks based on user or system behavior. The performance of IDS that use
anomaly detection is highly dependent on the amount of correct behavioral statistics which it
gathers. System administrators usually suffer from having to handle a huge amount of FP alerts
daily37. Again, we can conclude that a more accurate repository of valid signatures and/or
behavioral statistics is required to overcome the challenges mentioned above.
Another issue that may affect the accuracy of database intrusion detection systems (dIDS) is
its handling of a mobile environment14,15. Although detection engines are placed at database
level, it is still collecting mobility behavior of users, where their requests can help to produce
more accurate patterns for detection purposes16. This hypothesis is proved through the
experiment conducted in the present study.
As stated above, despite many attempts to secure databases against intruders, a lot of effort is
still required to improve IDS in order to combat the phenomenon of insider attacks. Moreover,
none of the research work mentioned adapt their systems to mobile environments where
requests are sent through mobile units. These gaps are covered in this study with the aim of
providing a robust detection framework which can effectively deal with insider threat in a
mobile environment.
To enhance the capability of database intrusion detection, a multi-layer profiling method is
introduced in this study, covering both context and result-set based approaches as well as query
structure. This model uses a unique structure incorporating carefully selected features which
help to increase accuracy and consequently reduce false alarms. The proposed profiling

happens at query level and then at transaction level. This helps to avoid dependencies on users
and user roles. The factor of mobility is also considered. A location abstraction model is
introduced to update profile structures in order to enhance the accuracy of detection engine
while processing queries in mobile database systems.
Mobile
The demand for mobile devices is constantly growing, with rapid innovation seen in terms of
technology, processing power, networking capabilities, and storage capacity. New features are
constantly being introduced and integrated in these devices, enabling more services to be
provided to users. Databases and data management services are no exceptions. More and more
mobile applications now offer connections to central databases, leading to a variety of
information being transferred into the mobile device. Accessing and exchanging data over
networks are activities which inevitably give rise to concerns regarding security and this issue
becomes even more serious when the access is ubiquitous. Therefore, more security features
should be introduced for mobile database systems, especiallly given the distributed nature of
mobile systems38,39.
A mobile database system is a system where applications can gain access to the database
through the use of mobile devices. In a mobile database system, either a portion of the
database or the entire database can be integrated into the mobile unit. However, because of
limitations such as battery power, capacity, and computational capabilities, the main database
usually remains stored in a fixed machine. In general, mobile databases are referred to as a
special form of distributed database system and can be defined in two forms. One is the
implementation of whole databases in fixed and wired distributed systems. The other is
distributing database between wired and wireless systems in which both fixed and mobile

machines perform the task of data management40. As shown in Figure 1, the scenario which
is selected for this study adopts the first-mentioned architecture, namely a mobile database
system where mobile units are connected to the database system (DBS) through mobile
networks. The database itself is also distributed among different locations. The fact that
mobile units can carry a part of the database does not have any effect on the current work,
since the main processing task for the proposed database intrusion detection system happens
at the main databases which are already synchronised.
Dataset
The main database is created in Microsoft SQL Server 2008 and is filled with about eight
million records. To follow a formal structure for the database, TPC-H standard is selected and
implemented according to the published datasheet41. Fields of each table are populated with
realistic sample data and transactions are created using dbgen and qgen tools respectively41.
Transactions contain one or more queries accessing several attributes of the database in order
to prepare the database log.
The log file is used as the main dataset for the purpose of data mining. The precise number of
queries is 1000, with 500 transactions being generated and launched randomly on the database.
These queries are used in the mining process and we consider the database log as an intrusion-
free dataset. Details of the mining procedure are explained in the following section.
Transactions contain a number of queries that help to gather enough data relationships to
produce an acceptable range of itemsets for the data mining process. 100 malicious transactions
are also generated to train the system and also to check the accuracy of detection. The malicious
transactions contain queries in which abnormal access patterns can be found such as drop table
command.

Mining Procedure
Several data mining tools are required for data analysis and pattern discovery that can help to
provide better strategies for business and scientific research areas. The existing gap between
raw data and useful information demonstrates the importance of data mining techniques to turn
plain data into valuable information. Therefore, some techniques are needed for knowledge
discovery that can automatically find logical relationships between data with minimum user
intervention.
Many issues may arise in terms of performance and efficiency when dealing with large volumes
of data in databases. It should be noted that the same mining method may not provide the
desired results when the objective of mining is altered. For IDS, there are a few frequently used
mining algorithms. Two common methods are association rule learning and sequence mining.
The effectiveness of rule mining is complemented with the application of PARAS, being an
approach to form a better representation of complex redundancy relationships42. Association
rules would be used to uncover relationships between unrelated data in rational databases or
any other kind of data repository. An association rule is divided into two parts, being “if” and
“then”. For example, “if a customer buys a dozen eggs, then he is 80% likely to purchase milk.”
In fact, association rules are constituted through analysing the data using “if/then” patterns and
using criteria such as “support” and “confidence” to identify the most important relationships.
“Support” determines the frequency of items appearing in the database whilst “Confidence”
shows the number of items “if/then” which has been found to be true. Findings from other
studies 43,44 also support the use of apriori for dIDS. The first stage in the framework proposed
in this paper is to mine dataset with apriori algorithm and produce rules to be used later for

profiling. These rules will show the frequent itemsets including all data access which occur
frequently in the database.
With the help of association rule mining, data correlations can be highlighted. This
unsupervised learning process is the first phase of the proposed intrusion detection system. An
example of such association is updating an order request for customer address which can be a
normal action. However if the update includes the customer name and the invoice price, this
may require more consideration. Similarly, if the address update task which is a “normal” office
job during office hours takes place at night, this can be considered as an intrusion. Here the
relationship between updating an item and time may be helpful. Several examples can be
discussed to demonstrate how data dependencies or correlations between items of an itemset
are able to help the IDS to produce more accurate decisions when processing new requests.
By using apriori algorithm, the number of data required to be processed to generate normal
patterns may be reduced given that itemsets that are not frequently accessed are removed from
the list. At the end, the most frequently accessed itemsets will remain and more accurate
profiles can thus be generated. This will minimise the size of data for intrusion detection
process. However, to what extent can we can limit the item selection? This depends on two
factors, namely minimum support and minimum confidence43. Choosing a proper support
factor while searching the dataset can provide more accurate results. The purpose here is to
have frequent subsets in each frequent itemset.
A minimum threshold for the level of frequency is defined as minimum support. Having
different minimum supports may lead to having different numbers of rules. There is no specific
method for determining minimum support except by testing several numbers in order to find
the optimum one. In this experiment, the number is set at 30%. Minimum support is not an

absolute value and is relative to the dataset that is being used in the experiment. To identify a
suitable minimum support, several testing sessions were conducted using different numbers
and the results were measured in terms of the coverage of the rules on the dataset. It is observed
that for minimum supports that applied more than 30%, many of the rules for data dependencies
were missed. On the other hand, for lower numbers like 25%, some rules that should not be
considered as frequent itemsets were caught in the results. Finally a midway number of 30%
was chosen as minimum support. The same logic is applied for determining the value of
minimum confidence. Implementation of apriori is done through Weka45 libraries which are
used in the Java code presented in this work. The application is developed by the author, who
can call Weka mining engine to apply apriori on the dataset. It is necessary to code the
application in order to automate mining procedures. After mining sequential patterns, apriori
generates read and write sequence sets. It then extracts dependency rules using minimum
confidence. This parameter is set at 80% in this experiment. At the end the output of apriori
will be dependency rules showing associations between data items in a dataset. Samples of
extracted rules for the utilised dataset of this work are presented in Table 1. Overall, 20 rules
are extracted by apriori which are then used in the profiling process.
This information helps us to gain better knowledge about normal behavior on database requests
in order to build more accurate user-independent profiles.
Transaction
Although the database log file contains all activities done by running queries, a proper
extraction and interpretation process is required in order to prepare a suitable query structure
for the data mining process. Profiling has recently been introduced and implemented in several
works as reviewed in Section 2. Profiling is used when the detection method is based on the

behavior of user, group of users (role), and queries. Therefore, the main issue that arises here
is the selection of suitable features of the log file to increase the accuracy level of detection and
to reduce errors accordingly.
Based on the association rules mentioned above, profiles are created for transactions that follow
the same dependency rules and their related itemsets. For example, if an itemset contains () a
profile string will be created for the related transactions as: “” as well as other items in each
transaction.
It should be noted that the order of items is important in the profile string and there should not
be any spaces. This is because a binary form of these profile strings is generated for the purpose
of comparison with incoming requests.
Profile
Profiling can be performed at the query level, context level or even result-set level. However,
for the purpose of this study, a combination of all the said levels is selected to form the profile
structure. The efficiency of this profile structure has been approved in previous works2. In
addition, the location information of the database requests is also added to the profiles to meet
the requirements of a mobile database system. Therefore, the overall structure of the final
profile is as follows:{}. This profile represents the SQL command, attribute, table name, result
set, location and timestamp respectively.
Feature

For complex queries in which multiple attributes are accessed, all attributes are selected with
related command separately. For example if the query41 is like:
“SELECT” is counted along with each set of attributes and tables. This means that if the said
query is assumed to be a normal query, highlighted pairs can be listed as {}. However, not all
sets would be considered in the final classification since certain data accessed would be ignored
through data mining procedure due to having low frequency. Therefore, only data which are
accessed frequently enough are listed in extracted rules from the applied mining algorithm.
Classification
Classification is used to refine the training process in order to better distinguish between
intrusion transactions and non-intrusion transactions. For the supervised learning task, labeled
transactions need to be provided in order to teach the system about malicious queries. For this
purpose, several transactions are generated as intrusions consisting of malicious queries. These
transactions are then injected into the system, updating the classifier knowledge base. Through
this process, the proposed system is completely taught about normal and malicious profiles.
Therefore, the probability of an intrusion can be calculated depending on the level of the
system’s current knowledge about intrusions. In this study, Naive Bayes classifier is used to
consider uncertainty and to find the probability of an intrusion occurrence. A brief explanation
of this classifier and its effects on the current study is presented below.
Implementation
As explained earlier, data generator tools are used to fill up the database tables with realistic
sample data. This means that if there are numeric, text and date fields in the tables, they are all

filled up with numbers, characters and dates. A total amount of eight million records are
generated in the database.
As mentioned earlier, transactions are generated based on recommendations 41. However,
several transactions are also created manually in order to fill up the database log. All malicious
transactions are synthetic. They contain one or more queries with abnormal requests such as
deleting a table, updating an item out of normal time, updating an item out of normal location,
selecting all attributes, updating multiple unrelated attributes, etc. These queries are used in
intrusion tagged transactions.
The mining process is performed by Weka45.Weka is an open source project that provides both
data mining and classification features. It is used in this study to apply apriori algorithm and
Naive Bayes classifier on the dataset. Weka shares libraries that can be integrated into different
coding platforms to be used inside the code of programmers. In this way, a programmer can
call Weka to launch any of the available data mining algorithms on the predefined datasets. To
be able to use these libraries, Weka should be installed completely on the computer system.
Alternatively, a complete Weka package called weka.jar should be imported to the
programming platform45.
Generating location information is performed through simulation. For this purpose, Network
simulator 2 (NS-2)47 is used to simulate a mobile network with multiple devices
communicating with database server while moving around a predefined area. By using this
method, a complete location profile is collected and added to the profile structure mentioned
earlier. Abstracting location information into profiles follows a mathematical procedure which
is explained later in this paper. Adding locations to the profiles is necessary in order to convert
the queries into a location-aware query. A location-aware query refers to a query which

contains at least one location-related attribute or one location-related predicate48. This profile
item helps to increase the accuracy of the detection system for insiders who are using mobile
units to connect the database. Simulation parameters are set out in Table 2, whilst a view of
the simulated area is shown in Figure 2.
To create location-based profiles, a standard model is required to represent the positions of
mobile units. There are currently two widely used models called the geometric model and the
symbolic model49. The geometric model defines locations with n-dimensional coordinates
including both longitude and latitude information. This can also be represented as a set of
coordinates showing a boundary or geographical area. In symbolic models, abstract symbols
are used to represent locations. This enables locations to be traced by name and consequently
makes the process of clustering and analysing easier, especially when the system requires
human decisions49. In this study, symbolic location model is used to create location profiles.
Statistics can prove to be immenseful helpful in achieving the set goal. Most database requests
come from mobile devices belonging to internal staff members who have direct links to the
location of the user. By exploiting this fact, it is possible to formulate a method to divide
geographical areas into the mobile units’ related locations. For example, in a police database,
most requests to access records relating to a specific city would come from police officers who
are working in the same city. Similarly, requests to access the medical records of a patient in a
multi-branch hospital would normally be made by staff of the same hospital branch as the
patient. Although there are exceptions to these examples, the majority would follow the same
logical structure. This assumption may help to enhance the accuracy of an IDS. For example,
when the record of a patient in a particular hospital branch is updated by a request that comes
from a different branch, or when an event related to a particular city is updated by a request

which comes from a mobile unit located in another city, these are not to be treated as normal
requests. The system should bring them to notice by informing administrators about them
through alarms.
In order to divide location information into logical areas, it is necessary to find the most
frequently repeated locations and set them as the center of the distribution, and then calculate
and list their neighbours in the same group. Statistically, the Mean value can do the job, which
is simply to calculate the average of a group of data. This is done by adding up all the data and
dividing them by the total number of values.
Once the Mean has been determined, the deviation degree of all locations to the mean value is
to be calculated. This can help to identify a threshold for dividing location information into
separated geographical areas. Standard deviation is used for this purpose. It calculates the
distance of each observation from the mean value and provides an average deviation.
Performance
Performance metrics are used to measure the accuracy and reliability of the proposed IDS. FP,
FN and accuracy are the metrics used in this experiment.
In general, higher true positives increases the accuracy of detection engines. Similarly less
detection errors result in lower FP and rates.
Data
K-fold cross validation (CV) is used to evaluate the accuracy and validity of the dataset utilised
in this study. In k-fold CV, dataset is divided into equal partitions called folds52. A model

should then be defined and applied to each partition in order to calculate the results based on
predetermined metrics. The proposed detection framework in this study is used as the model
and accuracy of detection is adopted as metric to test and validate the dataset. Having defined
k-folds, one is selected to test the dataset whilst the rest of the k- folds are used to train the
system. This action is repeated for each fold until all folds play the role of both trainer and
tester of the dataset. At the end, an average of the observed results represents the overall
effectiveness of the system.
In order to provide appropriate demonstration of dataset in each fold, data is properly classified
and arranged to ensure that each fold contains equal share of the whole dataset. This means
that the number of instances in every fold should be the same even though they are selected
randomly. The effectiveness of using k-fold CV has been approved and recommended in 52-54.
Proposed
The main components of the proposed framework are profile generation, transaction
processing, and probability checking. Each of these components is discussed in this section.
Application of this framework on the utilised dataset has shown interesting results which
demonstrate higher accuracy in terms of intrusion detection. The structure of the framework is
obtained from our previous work2. However, more detailed and accurate profiling method has
been applied and the framework is enhanced for mobile database systems, as shown in Figure
4.
Incoming

A database transaction containing one or more queries is injected into the system by using Java
codes in order to determine whether or not there are intrusions. Each query is converted into a
profile string so that it can be compared with the existing intrusion or valid patterns. The
conversion is straightforward. All the required elements are extracted from the query and listed
next to each other, separated by a comma. In fact, all of the words inside the transaction are
compared with a list of keywords. If they match, they will be added to the new string for further
comparison. Keywords are obtained from the items extracted by mining the database queries.
These words are listed as frequently accessed itemsets or association rules from the output of
mining procedure. These profile strings are converted into binary in order to ease the
comparison process. Transactions are read from a text file and loaded into a temporary string
variable. In this way, there is no constraint for loading and processing incoming transactions
through Java code. Exclusive OR (XOR) operation is used to compare binary records, where
the output of ‘0’ represents equality in both binary strings. This process follows Algorithm 2,
as explained below.
Instructions in Algorithm 2 show how the system behaves with new transactions. In summary,
if the text string generated for incoming transactions does not match up with existing valid
patterns, it will be sent to the classifier in order to check the probability of intrusion. Naive
Bayes classifier used in this study helps to determine whether or not the new pattern is an
intrusion by measuring the level of similarity between the new pattern and existing ones. On
the other hand, if the pattern of the incoming transaction matches existing patterns it is
considered as a normal request and can pass through the system without further examination.
There is another probability here which is a profile string without attributes. In such case, the
incoming transaction is detected as an intrusion and an alarm will be generated to end the

process. This is because SQL commands which do not contain target attributes most probably
fall into an intrusion category such as “” or SQL injection attacks like “ “ in which “1” is an
attribute used to bypass the condition.
Checking
As stated earlier, if the incoming requests do not match with existing valid patterns, they are
sent for checking by the Naive Bayes classifier. At this stage, both the probability of being
“intrusion” and “non-intrusion” requests are calculated by the system using Equation 4. The
greater probability will be taken as the output of this procedure.
If it is concluded that the incoming entry is an intrusion, an alarm will be generated and the
intrusion repository will be updated accordingly. However, if it is concluded that the incoming
entry is not an intrusion, the valid pattern repository is updated with the new pattern.
Intrusion
The intrusion repository is a list of all intrusions detected by the system. In the beginning this
repository is empty. It gradually grows when the system starts to detect intrusions. As shown
in Figure 4, having such a repository helps to detect repeated intrusions without having to send
them for examination. This reduces the detection time dramatically, leading to an overall higher
performance. Significant processing job is performed to compare between incoming requests
and existing valid patterns, where the incoming requests will be validated if similar patterns is
found in the intrusion repository. Updating the repository is performed whenever an alarm is
generated due to an intrusion. The code which updates the intrusion repository follows
Algorithm 3 as presented below.

Implementation
As noted earlier, the k-fold cross validation method is adopted to assess the validity of the data
used in both training and testing processes. For this purpose, the dataset is divided into five
segments which is called fold here. Each segment is stored in a separate file to be used as either
a training set or a test set. The function of being train set or test set is circulating between folds
so that all folds end up performing both roles.
The code for implementing k-fold CV is written in Java, which is connected to Weka
application in order to apply the algorithms. There are 1000 normal patterns and 100 malicious
patterns, all of which are synthetic. Valid patterns are randomly divided into 5 groups, each
containing 200 patterns. The same division is done for intrusions with 20 patterns in each
group. The valid groups and the intrusion groups are then merged together to produce 5 folds
with 220 pattern containing both normal and malicious records. Since k is 5 here, the
experiment on data validation should be repeated at least 5 times in a way where each fold is
used as training set and test set. Therefore, in each experiment there are 4 folds used to train
the system and 1 fold used to test the system. Metric is accuracy here which is explained in
Equation 9.
Once the accuracy is measured for iterations of the experiment, the average of all results is
used to represent the final detection rate which is 96.1%.
Results

The same detection methods, dataset, and framework used for general databases are applied to
check the effectiveness of the enhanced structure of the profiles in mobile database intrusion
detection cases.2 As expected, adding location profiles reduces errors for mobile environments.
Both FN and FP rates are reduced by 14% and 0.84% respectively. The results and comparison
with previous works are shown in Table 4. It can be seen that better results are obtained for
the detection method proposed in this paper.
It is observed that the application of the proposed enhanced profiling method (which relates
location information to the valid patterns) is able to improve accuracy. Reduction in FPR is
also considerable and demonstrates how knowledge about locations can be exploited to reduce
errors in the system.
Discussion
As mentioned earlier, raw data accessed are mined with association rules. The mining
algorithm extracts the most frequent itemsets for any related attributes. This means that the
output is a rule showing that there is a relationship between two or more attributes because of
the fact that they are frequently accessed together. Consequently, data access patterns with low
frequency are removed from the list. Therefore profiles which are created based on these rules
are more accurate as compared to the profiles for all data accessed. This method is the key to
reducing FP rates.
On the other hand, by applying accurate and relative location information to the profiles, more
precise knowledge can be obtained on the normal behavior of data requests in a mobile
environment. This helps to reduce the overall FN rates. Therefore, the unique proposed

profiling technique together with the decision-making method used in this study result in higher
detection accuracy and lower false alarms.
Conclusion
An effective intrusion detection framework, optimised for mobile database environments, has
been introduced and discussed in this study. Findings from implementation of the proposed
detection method show a high level of accuracy and a low level of FP rates. This is due to the
unique profiling method and detection mechanism applied in the system.
The detection system is designed in such a way that it is sensitive to the locations from which
database requests are initiated. Based on the performed experiment, promising results were
obtained in terms of accuracy and FN rates, being 96.1% and 17% respectively. This shows
better outcomes as compared to previous works. Using accurate profiling method, adding
location information, and using a comprehensive detection procedure are key features in this
experiment.
Feature selection is always a challenge while creating profiles. Other combinations of query
features can be tested to evaluate the efficiency of the system. However, adding more features
may affect the performance of detection systems. This should be considered when designing
detection solutions. Although higher performance is already achieved by applying an intrusion
repository in the proposed framework, more work can be done to enhance the processing speed,
for example by using multi-thread coding.
There is also a chance to increase the detection rate if the system can be combined and
customised with an organisation’s policy repository. Although this will make the system

specifically tailored to each organisation, it will increase the accuracy of valid patterns. Policies
vary between organisations in which sensitive attributes or requests can be defined, which helps
to highlight abnormal activities more accurately. This increases the overall accuracy of the IDS.
Upgrading the proposed dIDS to a database intrusion prevention system (dIPS) and using cloud
service to make the detection engine available all the times through the internet are two
enhancement features that are currently trending topics. These features can perhaps be explored
in future work.

1639(pm proofreading)(tracked)

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (17)

Destacado

Destacado (9)

Similar a 1639(pm proofreading)(tracked)

Similar a 1639(pm proofreading)(tracked) (20)

Último

Último (20)

1639(pm proofreading)(tracked)