Strategies for Landing an Oracle DBA Job as a Fresher
A novel web search engine model based on index query bit-level compression - PhD
1. بسم ال الرحمن الرحيم
A Novel Web Search Engine Model Based on Index-
Query Bit-Level Compression
Prepared By
Saif Mahmood Saab
Supervisor By
Dr. Hussein Al-Bahadili
Dissertation
Submitted In Partial Fulfillment
of the Requirements for the Degree of Doctorate of Philosophy
in Computer Information Systems
Faculty of Information Systems and Technology
University of Banking and Financial Sciences
Amman - Jordan
(May - 2011)
3. Authorization
I, the undersigned Saif Mahmood Saab authorize the Arab
Academy for Banking and Financial Sciences to provide copies of
this Dissertation to Libraries, Institutions, Agencies, and any
Parties upon their request.
Name: Saif Mahmood Saab
Signature:
Date: 30/05/2011
ii
5. Acknowledgments
First and foremost, I thank Allah (Subhana Wa Taala) for endowing me
with health, patience, and knowledge to complete this work.
I am thankful to anyone who supported me during my study. I would like
to thank my honorific supervisor, Dr. Hussein Al-Bahadili, who accepted
me as his Ph.D. student without any hesitation and offered me so much
advice, patiently supervising me, and always guiding me in the right
direction.
Last but not least, I would like to thank my parents for their support over
the years, my wife for her understanding and continuance encouragement
and my friends specially Mahmoud Alsiksek and Ali AlKhaledi.
It will not be enough to express my gratitude in words to all those people
who helped me; I would still like to give my many, many thanks to all
these people.
iv
6. List of Figures
Figure Title Page
1.1 Architecture and main components of standard search engine 10
model.
3.1 Architecture and main components of the CIQ Web search 41
engine model.
3.2 Lists of IDs for each type of character sets assuming m=6. 48
3.3-a Locations of data and parity bits in 7-bit codeword 54
3.3-b An uncompressed binary sequence of 21-bit length divided 54
into 3 blocks of 7-bit length, where b1 and b3 are valid blocks,
and b2 is a non-valid block
3.3-c The compressed binary sequence (18-bit length). 54
3.4 The main steps of the HCDC compressor 55
3.5 The main steps of the HCDC decompressor. 56
3.6 Variation of Cmin and Cmax with p. 58
3.7 Variation of r with p.
1 59
3.8 Variations of C with respect to r for various values of p. 60
3.9 The compressed file header of the HCDC scheme. 65
4.1 The compression ratio (C) for different sizes index files 75
4.2 The reduction factor (Rs) for different sizes index files. 76
4.3 Variation of C and average Sf for different sizes index files. 89
4.4 Variation of Rs and Rt for different sizes index files. 89
4.5 The CIQ performance triangle. 90
5.1 The CIQ performance triangle. 92
v
7. List of Tables
Table Title Page
1.1 Document ID and its contents. 8
1.2 A record and word level inverted indexes for documents in 8
Table (1.1).
3.1 List of most popular stopwords (117 stop-words). 47
3.2 Type of character sets and equivalent maximum number of 47
IDs
3.4 Variation of Cmin, Cmax, and r1 with number of parity bits (p). 58
3.6 Variations of C with respect to r for various values of p. 59
3.7 Valid 7-bit codewords. 61
3.8 The HCDC algorithm compressed file header. 64
4.1 List of visited Websites 71
4.2 The sizes of the generated indexes. 72
4.3 Type and number of characters in each generated inverted 73
index file.
4.4 Type and frequency of characters in each generated inverted 74
index file.
4.5 Values of C and Rs for different sizes index files. 75
4.6 Performance analysis and implementation validation. 77
4.7 List of keywords. 78
4.8 Values of No, Nc, To, Tc, Sf and Rt for 1000 index file 79
4.9 Values of No, Nc, To, Tc, Sf and Rt for 10000 index file 80
4.10 Values of No, Nc, To, Tc, Sf and Rt for 25000 index file 81
4.11 Values of No, Nc, To, Tc, Sf and Rt for 50000 index file 82
4.12 Values of No, Nc, To, Tc, Sf and Rt for 75000 index file 83
4.13 Variation of Sf for different index sizes and keywords. 85
4.14 Variation of No and Nc for different index sizes and keywords. 86
4.15 Variation of To and Tc for different index sizes and keywords. 87
4.16 Values of C, Rs, average Sf, and average Rt for different sizes 88
index files.
vi
8. Abbreviations
ACW Adaptive Character Wordlength
API Application Programming Interface
ASCII American Standard Code for Information Interchange
ASF Apache Software Foundation
BWT Burrows-Wheeler block sorting transform
CIQ compressed index-query
CPU Central Processing Unit
DBA Database Administrator
FLH Fixed-Length Hamming
GFS Google File System
GZIP GNU zip
HCDC Hamming Code Data Compression
HTML Hypertext Mark-up Language
ID3 A metadata container used in conjunction with the MP3 audio file
format
JSON JavaScript Object Notation
LAN Local Area Networks
LANMAN Microsoft LAN Manager
LDPC Low-Density Parity Check
LZW Lempel-Zif-Welch
MP3 A patented digital audio encoding format
NTLM Windows NT LAN Manager
PDF Portable Document Format
RLE Run Length Encoding
RSS Really Simple Syndication
RTF Rich Text Format
SAN Storage Area Networks
SASE Shrink And Search Engine
SP4 Windows Service Pack 4
UNIX UNiplexed Information and Computing Service
URL Uniform Resource Locator
XML Extensible Markup Language
ZIP A data compression and archive format, the name zip (meaning speed)
vii
9. Table of Contents
Authorization - ii -
Dedications - iii -
Acknowledgments - iv -
List of Figures -v-
List of Tables - vi -
Abbreviations - vii -
Table of Contents - viii -
Abstract -x-
Chapter One -1-
Introduction -1-
1.1 Web Search Engine Model -3-
1.1.1 Web crawler -3-
1.1.2 Document analyzer and indexer -4-
1.1.3 Searching process -9-
1.2 Challenges to Web Search Engines - 10 -
1.3 Data Compression Techniques - 12 -
1.3.1 Definition of data compression - 12 -
1.3.2 Data compression models - 12 -
1.3.3 Classification of data compression algorithms - 14 -
1.3.4 Performance evaluation parameters - 17 -
1.4 Current Trends in Building High-Performance Web Search Engine - 20 -
1.5 Statement of the Problem - 20 -
1.6 Objectives of this Thesis - 21 -
1.7 Organization of this Thesis - 21 -
Chapter Two - 23 -
Literature Review - 23 -
2.1 Trends Towards High-Performance Web Search Engine - 23 -
2.1.1 Succinct data structure - 23 -
2.1.2 Compressed full-text self-index - 24 -
2.1.3 Query optimization - 24 -
2.1.4 Efficient architectural design - 25 -
2.1.5 Scalability - 25 -
2.1.6 Semantic search engine - 26 -
2.1.7 Using Social Networks - 26 -
2.1.8 Caching - 27 -
2.2 Recent Research on Web Search Engine - 27 -
2.3 Recent Research on Bit-Level Data Compression Algorithms - 33 -
viii
10. Chapter Three - 39 -
The Novel CIQ Web Search Engine Model - 39 -
3.1 The CIQ Web Search Engine Model - 40 -
3.2 Implementation of the CIQ Model: CIQ-based Test Tool (CIQTT) - 42 -
3.2.1 COLCOR: Collects the testing corpus (documents) - 42 -
3.2.2 PROCOR: Processing and analyzing testing corpus (documents) - 46 -
3.2.3 INVINX: Building the inverted index and start indexing. - 46 -
3.2.4 COMINX: Compressing the inverted index - 50 -
3.2.5 SRHINX: Searching index (inverted or inverted/compressed index) - 51 -
3.2.6 COMRES: Comparing the outcomes of different search processes
performed by SRHINX procedure. - 52 -
3.3 The Bit-Level Data Compression Algorithm - 52 -
3.3.1 The HCDC algorithm - 52 -
3.3.2 Derivation and analysis of HCDC algorithm compression ratio - 56 -
3.3.3 The Compressed File Header - 63 -
3.4 Implementation of the HCDC algorithm in CIQTT - 65 -
3.5 Performance Measures - 66 -
Chapter Four - 68 -
Results and Discussions - 68 -
4.1 Test Procedures - 69 -
4.2 Determination of the Compression Ratio (C) & the Storage Reduction Factor (Rs) - 70 -
4.2.1 Step 1: Collect the testing corpus using COLCOR procedure - 70 -
4.2.2 Step 2: Process and analyze the corpus to build the inverted index file using
PROCOR and INVINX procedures - 72 -
4.2.3 Step 3: Compress the inverted index file using the INXCOM procedure - 72 -
4.3 Determination of the Speedup Factor (Sf) and the Time Reduction Factor (Rt) - 77 -
4.3.1 Choose a list of keywords - 77 -
4.3.2 Perform the search processes - 78 -
4.3.3 Determine Sf and Rt. - 84 -
4.4 Validation of the Accuracy of the CIQ Web Search Model - 88 -
4.5 Summary of Results - 88 -
Chapter Five - 91 -
Conclusions and Recommendations for Future Work - 91 -
5.1 Conclusions - 91 -
5.2 Recommendations for Future Work - 93 -
References - 94 -
Appendix I - 105 -
Appendix II - 108 -
Appendix III - 112 -
Appendix IV - 115 -
ix
11. Abstract
Web search engine is an information retrieval system designed to help finding
information stored on the Web. Standard Web search engine consists of three main
components: Web crawler, document analyzer and indexer, and search processor. Due
to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of: storage capacity, data retrieval rate, query
processing time, and communication overhead. Large search engines, in particular,
have to be able to process tens of thousands of queries per second on tens of billions
of documents, making query throughput a critical issue. To satisfy this heavy
workload, search engines use a variety of performance optimizations including
succinct data structure, compressed text indexing, query optimization, high-speed
processing and communication systems, and efficient search engine architectural
design. However, it is believed that the performance of the current Web search engine
models still short from meeting users and applications needs.
In this work we develop a novel Web search engine model based on index-query
compression, therefore, it is referred to as the compressed index-query (CIQ) model.
The model incorporates two compression layers both implemented at the back-end
processor (server) side, one layer resides after the indexer acting as a second
compression layer to generate a double compressed index, and the second layer be
located after the query parser for query compression to enable compressed index-
query search. The data compression algorithm used is the novel Hamming code data
compression (HCDC) algorithm.
The different components of the CIQ model is implemented in a number of
procedures forming what is referred to as the CIQ test tool (CIQTT), which is used as
a test bench to validate the accuracy and integrity of the retrieved data, and to evaluate
the performance of the CIQ model. The results obtained demonstrate that the new
CIQ model attained an excellent performance as compared to the current
uncompressed model, as such: the CIQ model achieved a tremendous accuracy with
100% agreement with the current uncompressed model.
The new model demands less disk space as the HCDC algorithm achieves a
compression ratio over 1.3 with compression efficiency of more than 95%, which
implies a reduction in storage requirement over 24%. The new CIQ model performs
faster than the current model as it achieves a speed up factor over 1.3 providing a
reduction in processing time of over 24%.
x
12. Chapter One
Introduction
A search engine is an information retrieval system designed to help in finding files stored
on a computer, for example, public server on the World Wide Web (or simply the Web),
server on a private network of computers, or on a stand-alone computer [Bri 98]. The
search engine allows us to search the storage media for a certain content in a form of text
meeting specific criteria (typically those containing a given word or phrase) and
retrieving a list of files that match those criteria. In this work, we are concerned with the
type of search engine that is designed to help in finding files stored on the Web (Web
search engine).
Webmasters and content providers began optimizing sites for Web search engines in the
mid-1990s, as the first search engines were cataloging the early Web. Initially, all a
webmaster needed to do was to submit the address of a page, or the uniform resource
locator (URL), to various engines which would send a spider to crawl that page, extract
links to other pages from it, and return information found on the page to be indexed [Bri
98]. The process involves a search engine crawler downloading a page and storing it on
the search engine's own server, where a second program, known as an indexer, extracts
various information about the page, such as the words it contains and where there
location are, as well as any weight for specific words, and all links the page contains,
which are then placed into a scheduler for crawling at a later date [Web 4].
Standard search engine consists of the following main components: Web crawler,
document analyzer and indexer, and searching process [Bah 10d]. The main purpose of
using certain data structure for searching is to construct an index that allows focusing the
search for a given keyword (query). The improvement in the query performance is paid
by the additional space necessary to store the index. Therefore, most of the research in
this field has been directed to design data structures which offer a good trade between
queries and update time versus space usage.
For this reason compression appears always as an attractive choice, if not mandatory.
However space overhead is not the only resource to be optimized when managing large
1
13. data collections; in fact, data turn out to be useful only when properly indexed to support
search operations that efficiently extract the user-requested information. Approaches to
combine compression and indexing techniques are nowadays receiving more and more
attention. A first step towards the design of a compressed full-text index is achieving
guaranteed performance and lossless data [Fer 01].
In the light of the significant increase in CPU speed that makes more economical to store
data in compressed form than uncompressed. Storing data in a compressed form may
introduce significant improvement in space occupancy and also processing time. This is
because space optimization is closely related to time optimization in a disk memory
(improve time processing) [Fer 01].
There are a number of trends that have been identified in the literature for building high-
performance search engines, such as: succinct data structure, compressed full-text self-
index, query optimization, and high-speed processing and communication systems.
Starting from these promising trends, many researchers have tried to combine text
compression with indexing techniques and searching algorithms. They have mainly
investigated and analyzed the compressed matching problem under various compression
schemes [Fer 01].
Due to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of: (i) storage capacity, (ii) data retrieval rate, (iii) query
processing time, and (iv) communication overhead. The large engines, in particular, have
to be able to process tens of thousands of queries per second on tens of billions of
documents, making query throughput a critical issue. To satisfy this heavy workload,
search engines use a variety of performance optimizations including index compression.
With the tremendous increase in users and applications needs we believe that the current
search engines model need more retrieval performance and more compact and cost-
effective systems are still required.
In this work we develop a novel web search engine model that is based on index-query
bit-level compression. The model incorporates two bit-level data compression layers both
2
14. implemented at the back-end processor side, one after the indexer acting as a second
compression layer to generate a double compressed index, and the other one after the
query parser for query compression to enable bit-level compressed index-query search.
So that less disk space is required to store the compressed index file and also reducing
disk I/O overheads, and consequently higher retrieval rate or performance.
An important feature of the bit-level technique to be used for performing the search
process at the compressed index-query level, is to generate similar compressed binary
sequence for the same character from the search queries and the index files. The data
compression technique that satisfies this important feature is the HCDC algorithm [Bah
07b, Bah 08a]. Therefore; it will be used in this work. Recent investigations on using this
algorithm for text compression have demonstrated an excellent performance in
comparison with many widely-used and well-known data compression algorithms and
state of the art tools [Bah 07b, Bah 08a].
1.1 Web Search Engine Model
A Web search engine is an information retrieval system designed to help find files stored
on a public server on the Web [Bri 98, Mel 00]. Standard Web search engine consists of
the following main components:
• Web crawler
• Document analyzer and indexer
• Searching process
In what follows we provide a brief description for each of the above components.
1.1.1 Web crawler
A Web crawler is a computer program that browses the Web in a methodical, automated
manner. Other terms for Web crawlers are ants, automatic indexers, bots, worms, Web
spider and Web robot. Unfortunately, each spider has its own personal agenda as it
indexes a site. Some search engines use META tag; others may use the META description
3
15. of a page, and some use the first sentence or paragraph on the sites. That is mean; a page
that ranks higher on one web search engine may not rank as well on another. Given a set
of “URLs” unified resource locations, the crawler repeatedly removes one URL from the
set, downloads the targeted page, extracts all the URLs contained in it, and adds all
previously unknown URLs to the set [Bri 98, Jun 00].
Web search engines work by storing information about many Web pages, which they
retrieve from the Web itself. These pages are retrieved by a spider - sophisticated Web
browser which follows every link extracted or stored in its database. The contents of each
page are then analyzed to determine how it should be indexed, for example, words are
extracted from the titles, headings, or special fields called Meta tags.
1.1.2 Document analyzer and indexer
Indexing is the process of creating an index that is a specialized file containing a
compiled version of documents retrieved by the spider [Bah 10d]. Indexing process
collect, parse, and store data to facilitate fast and accurate information retrieval. Index
design incorporates interdisciplinary concepts from linguistics, mathematics, informatics,
physics and computer science [Web 5].
The purpose of storing an index is to optimize speed and performance in finding relevant
documents for a search query. Without an index, the search engine would scan every
(possible) document in the Internet, which would require considerable time and
computing power (impossible with the current Internet size). For example, while an index
of 10000 documents can be queried within milliseconds, a sequential scan of every word
in the documents could take hours. The additional computer storage required to store the
index, as well as the considerable increase in the time required for an update to take
place, are traded off for the time saved during information retrieval [Web 5].
4
16. Index design factors
Major factors should be carefully considered when designing a search engines, these
include [Bri 98, Web 5]:
• Merge factors: How data enters the index, or how words or subject features are
added to the index during text corpus traversal, and whether multiple indexers can
work asynchronously. The indexer must first check whether it is updating old
content or adding new content. Traversal typically correlates to the data collection
policy. Search engine index merging is similar in concept to the SQL Merge
command and other merge algorithms.
• Storage techniques: How to store the index data, that is, whether information
should be data compressed or filtered.
• Index size: How much computer storage is required to support the index.
• Lookup speed: How quickly a word can be found in the index. The speed of
finding an entry in a data structure, compared with how quickly it can be updated
or removed, is a central focus of computer science.
• Maintenance: How the index is maintained over time.
• Fault tolerance: How important it is for the service to be robust. Issues include
dealing with index corruption, determining whether bad data can be treated in
isolation, dealing with bad hardware, partitioning, and schemes such as hash-
based or composite partitioning, as well as replication.
Index data structures
Search engine architectures vary in the way indexing is performed and in methods of
index storage to meet the various design factors. There are many architectures for the
indexes and the most used is inverted index. Inverted index save a list of occurrences of
every keyword, typically, in the form of a hash table or binary tree [Bah 10c].
5
17. Through the indexing, there are several processes taken place, here the processes that
related to our work will be discussed. These processes may be used and this depends on
the search engine configuration [Bah 10d].
• Extract URLs. A process of extracting all URLs from the document being
indexed, it used to guide crawling the website, do link checking, build a site map,
and build a table of internal and external links from the page.
• Code striping. A process of removing hyper-text markup language (HTML) tags,
scripts, and styles, and decoding HTML character references and entities used to
embed special characters.
• Language recognition. A process by which a computer program attempts to
automatically identify, or categorize, the language or languages by which a
document is written.
• Document tokenization. A process of detecting the encoding used for the page;
determining the language of the content (some pages use multiple languages);
finding word, sentence and paragraph boundaries; combining multiple adjacent-
words into one phrase; and changing the case of text.
• Document parsing or syntactic analysis. The process of analyzing a sequence of
tokens (for example, words) to determine their grammatical structure with respect
to a given (more or less) formal grammar.
• Lemmatization/stemming. The process for reducing inflected (or sometimes
derived) words to their stem, base or root form – generally a written word form,
this stage can be done in indexing and/or searching stage. The stem doesn't need
to be identical to the morphological root of the word; it is usually sufficient that
relate words map to the same stem, even if this stem is not in itself a valid root.
The process is useful in search engines for query expansion or indexing and other
natural language processing problems.
6
18. • Normalization. The process by which text is transformed in some way to make it
consistent in a way which it might not have been before. Text normalization is
often performed before text is processed in some way, such as generating
synthesized speech, automated language translation, storage in a database, or
comparison.
Inverted Index
The inverted index structure is widely used in the modern supper fast Web search engine
like Google, Yahoo, Lucene and other major search engines. Inverted index (also referred
to as postings file or inverted file) is an index data structure storing a mapping from
content, such as words or numbers, to its locations in a database file, or in a document or
a set of documents. The main purpose of using the inverted index is to allow fast full text
searches, at a cost of increased processing when a document is added to the index [Bri 98,
Nag 02, Web 4]. The inverted index is one of the most used data structure in information
retrieval systems [Web 4, Bri 98].
There are two main variants of inverted indexes [Bae 99]:
(1) A record level inverted index (or inverted file index or just inverted file) contains a
list of references to documents for each word; we use this simple type in our
search engine.
(2) A word level inverted index (or full inverted index or inverted list) additionally
contains the positions of each word within a document; these positions can be used
to rank the results according to document relevancy to the query.
The latter form offers more functionality (like phrase searches), but needs more time and
space to be created. In order to simplify the understanding of the above two inverted
indexes let us consider the following example.
7
19. Example
Let us consider a case in which six documents have the text shown in Table (1.1). The
contents of a record and word level indexes are shown in Table (1.2).
Table (1.1)
Document ID and its contents.
Document ID Text
1 Aqaba is a hot city
2 Amman is a cold city
3 Aqaba is a port
4 Amman is a modern city
5 Aqaba in the south
6 Amman in Jordan
Table (1.2)
A record and word level inverted indexes for documents in Table (1.1).
Record level inverted index Word level inverted index
Text Documents Text Documents: Location
Aqaba 1, 3, 5 Aqaba 1:1 , 3:1 , 5:1
is 1, 2, 3, 4 is 1:2 , 2:2 , 3:2 , 4:2
a 1, 2, 3, 4 a 1:3 , 2:3 , 3:3 , 4:3
hot 1 hot 1:4
city 1, 2, 4 city 1:5 , 2:5 , 4:3
Amman 2, 4, 6 Amman 2:1 , 4:1 , 6:1
cold 2 cold 2:4
the 5 the 5:3
modern 4 modern 4:2
south 5 south 5:4
in 5, 6 in 5:2 , 6:2
Jordan 6 Jordan 6:3
8
20. When we search for the word “Amman”, we get three results which are documents 2, 4, 6
if a record level inverted index is used, and 2:1, 4:1, 6:1 if a word level inverted index is
used. In this work, the record level inverted index is used for it's simplicity and because
we don't need to rank our results.
1.1.3 Searching process
When the index is ready the searching can be perform through query interface, a user
enters a query into a search engine (typically by using keywords), the engine examines its
index and provides a listing of best matching Web pages according to its criteria, usually
with a short summary containing the document's title and sometimes parts of the text
[Bah 10d].
In this stage the results ranked, where ranking is a relationship between a set of items
such that, for any two items, the first is either “ranked higher than”, “ranked lower than”
or “ranked equal” to the second. In mathematics, this is known as a weak order or total
pre-order of objects. It is not necessarily a total order of documents because two different
documents can have the same ranking. Ranking is done according to document relevancy
to the query, freshness and popularity [Bri 98]. Figure (1.1) outlines the architecture and
main components of standard search engine model.
9
21. Figure (1.1). Architecture and main components of standard search engine model.
1.2 Challenges to Web Search Engines
Building and operating large-scale Web search engine used by hundreds of millions of
people around the world provides a number of interesting challenges [Hen 03, Hui 09,
Ois 10, Ami 05]. Designing such systems requires making complex design trade-offs in a
number of dimensions and the main challenges to designing efficient, effective, and
reliable Web search engine are:
• The Web is growing much faster than any present-technology search engine
can possibly index.
• The cost of index storing which include data storage cost, electricity and cool-
ing the data center.
• The real time web which updated in real time requires a fast and reliable
crawler and then indexes this content to make it searchable.
10
22. • Many Web pages are updated frequently, which forces the search engine to re-
visit them periodically.
• Query time (latency), the need to keep up with the increase of index size and
to perform the query and show the results in less time.
• Most search engine uses keyword for searching and this limited the results to
text pages only.
• Dynamically generated sites, which may be slow or difficult to index, or may
result in excessive results from a single site.
• Many dynamically generated sites are not indexable by search engines; this
phenomenon is known as the invisible Web.
• Several content types are not crawlable and indexable by search engines like
multi-media and flash content.
• Some sites use tricks to manipulate the search engine to display them as the
first result returned for some keywords and this known as Spamming. This
can lead to some search results being polluted, with more relevant links being
pushed down in the result list.
• Duplicate hosts, Web search engines try to avoid having duplicate and near-
duplicate pages in their collection, since such pages increase the time it takes
to add useful con-tent to the collection.
• Web graph modeling, the open problem is to come up with a random graph
model that models the behavior of the Web graph on the pages and host level.
• Scalability, search engine technology should scale in a dramatic way to keep
up with the growth of the Web.
• Reliability, search engine requires a reliable technology to support it 24 hour
operation to meet users needs.
11
23. 1.3 Data Compression Techniques
This section presents definition, models, classification methodologies and classes, and
performance evaluation measures of data compression algorithms. Further details on data
compression can be found in [Say 00].
1.3.1 Definition of data compression
Data compression algorithms are designed to reduce the size of data so that it requires
less disk space for storage and less memory [Say 00]. Data compression is usually
obtained by substituting a shorter symbol for an original symbol in the source data,
containing the same information but with a smaller representation in length. The symbols
may be characters, words, phrases, or any other unit that may be stored in a dictionary of
symbols and processed by a computing system.
A data compression algorithm usually utilizes an efficient algorithmic transformation of
data representation to produce more compact representation. Such an algorithm is also
known as an encoding algorithm. It is important to be able to restore the original data
back, either in an exact or an approximate form, therefore a data decompression
algorithm, also known as a decoding algorithm.
1.3.2 Data compression models
There are a number of data compression algorithms that have been developed throughout
the years. These algorithms can be categorized into four major categories of data
compression models [Rab 08, Hay 08, Say 00]:
1. Substitution data compression model
2. Statistical data compression model
3. Dictionary based data compression model
4. Bit-level data compression model
12
24. In substitution compression techniques, a shorter representation is used to replace a
sequence of repeating characters. Example of substitution data compression techniques
include: null suppression [Pan 00], Run Length Encoding [Smi 97], bit mapping and half
byte packing [Pan 00].
In statistical techniques, the characters in the source file are converted to a binary code,
where the most common characters in the file have the shortest binary codes, and the
least common have the longest, the binary codes are generated based on the estimated
probability of the character within the file. Then, the binary coded file is compressed
using 8-bit character wordlength, or by applying the adaptive character wordlength
(ACW) algorithm [Bah 08b, Bah 10a], or it variation the ACW(n,s) scheme [Bah 10a]
Example of statistical data compression techniques include: Shannon-Fano coding [Rue
06], static/adaptive/semi-adaptive Huffman coding [Huf 52, Knu 85, Vit 89], and
arithmetic coding [How 94, Wit 87].
Dictionary based data compression techniques involved the substitution of sub-strings of
text by indices or pointer code, relative to a dictionary of the sub-strings, such as Lempel-
Zif-Welch (LZW) [Ziv 78, Ziv 77, Nel 89]. Many compression algorithms use a
combination of different data compression techniques to improve compression ratios.
Finally, since data files could be represented in binary digits, a bit-level processing can be
performed to reduce the size of data. A data file can be represented in binary digits by
concatenating the binary sequences of the characters within the file using a specific
mapping or coding format, such as ASCII codes, Huffman codes, adaptive codes, …, etc.
The coding format has a huge influence on the entropy of the generated binary sequence
and consequently the compression ratio (C) or the coding rate (Cr) that can be achieved.
The entropy is a measure of the information content of a message and the smallest
number of bits per character needed, on average, to represent a message. Therefore, the
entropy of a complete message would be the sum of the individual characters’ entropy.
The entropy of a character (symbol) is represented as the negative logarithm of its
probability and expressed using base two.
13
25. Where the probability of each symbol of the alphabet is constant, the entropy is
calculated as [Bel 89, Bel 90]:
n
E=−∑ p i log 2 p i (1.1)
i= 1
Where E is the entropy in bits
pi is the estimated probability of occurrence of character (symbol)
n is the number of characters.
In bit-level processing, n is equal to 2 as we have only two characters (0 and 1).
In bit-level data compression algorithms, the binary sequence is usually divided into
groups of bits that are called minterms, blocks, subsequences, etc. In this work we shall
used the term blocks to refer to each group of bits. These blocks might be considered as
representing a Boolean function.
Then, algebraic simplifications are performed on these Boolean functions to reduce the
size or the number of blocks, and hence, the number of bits representing the output
(compressed) data is reduced as well. Examples of such algorithms include: the
Hamming code data compression (HCDC) algorithm [Bah 07b, Bah 08a], the adaptive
HCDC(k) scheme [Bah 07a, Bah 10b, Rab 08], the adaptive character wordlength (ACW)
algorithm [Bah 08b, Bah 10a], the ACW(n,s) scheme [Bah 10a], the Boolean functions
algebraic simplifications algorithm [Nof 07], the fixed length Hamming (FLH) algorithm
[Sha 04], and the neural network based algorithm [Mah 00].
1.3.3 Classification of data compression algorithms
Data compression algorithms are categorized by several characteristics, such as:
• Data compression fidelity
• Length of data compression symbols
14
26. • Data compression symbol table
• Data compression processing time
In what follows a brief definition is given for each of the above classification criteria.
Data compression fidelity
Basically data compression can be classified into two fundamentally different styles of
data compression depending on the fidelity of the restored data, these are:
(1) Lossless data compression algorithms
In a lossless data compression, a transformation of the representation of the original
data set is performed such that it is possible to reproduce exactly the original data
set by performing a decompression transformation. This type of compression is
usually used in compressing text files, executable codes, word processing files,
database files, tabulation files, and whenever the original needs to be exactly
restored from the compressed data.
Many popular data compression applications have been developed utilizing lossless
compression algorithms, for example, lossless compression algorithms are used in
the popular ZIP file format and in the UNIX tool gzip. It is mainly used for text and
executable files compression as in such file data must be exactly retrieved
otherwise it is useless. It is also used as a component within lossy data compression
technologies. It can usually achieve a 2:1 to 8:1 compression ratio range.
(2) Lossy data compression algorithms
In a lossy data compression a transformation of the representation of the original
data set is performed such that an exact representation of the original data set can
not be reproduced, but an approximate representation is reproduced by performing
a decompression transformation.
15
27. A lossy data compression is used in applications wherein exact representation of
the original data is not necessary, such as in streaming multimedia on the Internet,
telephony and voice applications, and some image file formats. Lossy
compression can provide higher compression ratios of 100:1 to 200:1, depending
on the type of information being compressed. In addition, higher compression
ratio can be achieved if more errors are allowed to be introduced into the original
data [Lel 87].
Length of data compression symbols
Data compression algorithms can be classified, depending on the length of the symbols
the algorithm can process, into fixed and variable length; regardless of whether the
algorithm uses variable length symbols in the original data or in the compressed data, or
both.
For example, the run-length encoding (RLE) uses fixed length symbols in both the
original and the compressed data. Huffman encoding uses variable length compressed
symbols to represent fixed-length original symbols. Other methods compress variable-
length original symbols into fixed-length or variable-length compressed data.
Data compression symbol table
Data compression algorithms can classified as either static, adaptive, or semi-adaptive
data compression algorithms [Rue 06, Pla 06, Smi 97]. In static compression algorithms,
the encoding process is fixed regardless of the data content; while in adaptive algorithms,
the encoding process is data dependent. In semi-adaptive algorithms, the data to be
compressed are first analyzed in their entirety, an appropriate model is then built, and
afterwards the data is encoded. The model is stored as part of the compressed data, as it is
required by the decompressor to reverse the compression.
Data compression/decompression processing time
Data compression algorithms can be classified according to the compression/
decompression processing time as symmetric or asymmetric algorithms. In symmetric
16
28. algorithms the compression/decompression processing time are almost the same; while
for asymmetric algorithms, normally, the compression time is much more than the
decompression processing time [Pla 06].
1.3.4 Performance evaluation parameters
In order to be able to compare the efficiency of the different compression techniques
reliably, and not allowing extreme cases to cloud or bias the technique unfairly, certain
issues need to be considered.
The most important issues need to be taken into account in evaluating the performance of
various algorithms includes [Say 00]:
(1) Measuring the amount of compression
(2) Compression/decompression time (algorithm complexity)
These issues need to be carefully considered in the context for which the compression
algorithm is used. Practically, things like finite memory, error control, type of data, and
compression style (adaptive/dynamic, semi-adaptive or static) are also factors that should
be considered in comparing the different data compression algorithms.
(1) Measuring the amount of compression
Several parameters are used to measure the amount of compression that can be achieved
by a particular data compression algorithm, such as:
(i) Compression ratio (C)
(ii) Reduction ratio (Rs)
(iii) Coding rate (Cr)
17
29. Definitions of these parameters are given below.
(i) Compression ratio (C)
The compression ratio (C) is defined as the ratio between the size of the data before
compression and the size of the data after compression. It is expressed as:
So
C = (1.1)
Sc
Where So is the size of the original data (uncompressed data)
Sc is the sizes of the compressed data
(ii) Reduction ratio (Rs)
The reduction ratio (R) represents the ratio between the difference between the size of the
original data (So) and the size of the compressed data (Sc) to the size of the original data.
It is usually given in percents and it is mathematically expressed as:
(1.2)
(1.3)
(iii) Coding rate (Cr)
The coding rate (Cr) expresses the same concept at the compression ratio, but it relates
the ratio to a more tangible quantity. For example, for a text file, the coding rate may be
expressed in “bits/character” (bpc), where in uncompressed text file a coding rate of 7 or
8 bpc is used. In addition, the coding rate of an audio stream may be expressed in
“bits/analogue”. For still image compression, the coding rate is expressed in “bits/pixel”.
In general, it can the coding rate can be expressed mathematically as:
18
30. q ⋅ Sc
Cr = (1.4)
So
Where q is the number of bit represents each symbol in the uncompressed file. The
relationship between the coding rate (Cr) and the compression ratio (C), for example, for
text file originally using 7 bpc, can be given by:
7
Cr = (1.5)
C
It can be clearly seen from Eqn. (1.5) that a lower coding rate indicates a higher
compression ratio.
(2) Compression/decompression time (algorithm complexity)
The compression/decompression time (which is an indication of the algorithm
complexity) is defined as the processing time required compressing or decompressing the
data. These compression and decompression times have to be evaluated separately. As it
has been discussed in Section 1.4.3, data compression algorithms are classified according
to the compression/decompression time into either symmetric or asymmetric algorithms.
In this context, data storage applications mainly concern with the amount of compression
that can be achieved and the decompression processing time that is required to retrieve
the data back (asymmetric algorithms). As in data compression applications, the
compression is only performed once or non-frequently repeated.
Data transmission applications focus predominately on reducing the amount of data to be
transmitted over communication channels, and both compression and decompression
processing times are the same at the respective junctions or nodes (symmetric algorithms)
[Liu 05].
For a fair comparison between the different available algorithms, it is important to
consider both the amount of compression and the processing time. Therefore, it would be
useful to be able to parameterize the algorithm such that the compression ratio and
processing time could be optimized for a particular application.
19
31. There are extreme cases where data compression works very well or other conditions
where it is inefficient, the type of data that the original data file contains and the upper
limits of the processing time have an appreciable effect on the efficiency of the technique
selected. Therefore, it is important to select the most appropriate technique for a
particular data profile in terms of both data compression and processing time [Rue 06].
1.4 Current Trends in Building High-Performance Web Search
Engine
There are several major trends that can be identified in the literature for building high-
performance Web search engine. A list of these trends is given below and further
discussion will be given in Chapter 2; these trends include:
(1) Succinct data structure
(2) Compressed full-text self-index
(3) Query optimization
(4) Efficient architectural design
(5) Scalability
(6) Semantic Search Engine
(7) Using Social Network
(8) Caching
1.5 Statement of the Problem
Due to the rapid growth in the size of the Web, Web search engines are facing enormous
performance challenges, in terms of storage capacity, data retrieval rate, query processing
time, and communication overhead. Large search engines, in particular, have to be able to
process tens of thousands of queries per second on tens of billions of documents, making
query throughput a critical issue. To satisfy this heavy workload, search engines use a
variety of performance optimizations techniques including index compression; and some
20
32. obvious solutions to these issues are to develop more succinct data structure, compressed
index, query optimization, and higher-speed processing and communication systems.
We believe that current search engine model cannot meet users and applications needs
and more retrieval performance and more compact and cost-effective systems are still
required. The main contribution of this thesis is to develop a novel Web search engine
model that is based on index-query compression; therefore, it is referred to as the CIQ
Web search engine model or simply the CIQ model. The model incorporates two bit-level
compression layers both implemented at the back-end processor side, one after the
indexer acting as a second compression layer to generate a double compressed index, and
the other one after the query parser for query compression to enable bit-level compressed
index-query search. So that less disk space is required storing the index file, reducing
disk I/O overheads, and consequently higher retrieval rate or performance.
1.6 Objectives of this Thesis
The main objectives of this thesis can be summarized as follows:
• Develop a new Web search engine model that is accurate as the current Web
search engine model, requires less disk space for storing index files, performs
search process faster than current models, reduces disk I/O overheads, and
consequently provides higher retrieval rate or performance.
• Modify the HCDC algorithm to meet the requirements of the new CIQ model.
• Study and optimize the statistics of the inverted index files to achieve maximum
possible performance (compression ratio and minimum searching time).
• Validate the searching accuracy of the new CIQ Web search engine model.
• Evaluate and compare the performance of the new Web search engine model in
terms of disk space requirement and query processing time (searching time) for
different search scenarios.
21
33. 1.7 Organization of this Thesis
This thesis is organized into five chapters. Chapter 1 provides an introduction to the
general domain of this thesis. The rest of this thesis is organized as follows: Chapter 2
presents a literature work and also summarizes some of the previous work that is related
to Web search engine, in particular, works that is related to enhancing the performance of
the Web search engine through data compression at different levels.
Chapter 3 describes the concept, methodology, and implementation of the novel CIQ Web
search engine model. It also includes the detail description of the HCDC algorithm and
the modifications implemented to meet the new application needs.
Chapter 4 presents a description of a number of scenarios simulated to evaluate the
performance of the new Web search engine model. The effect of index file size on the
performance of the new model is investigated and discussed. Finally, in Chapter 5, based
on the results obtained from the different simulations, conclusions are drawn and
recommendations for future work are pointed-out.
22
34. Chapter Two
Literature Review
This work is concern with the development of a novel high-performance Web search
engine model that is based on compressing the index files and search queries using a bit-
level data compression technique, namely, the novel Hamming codes based data
compression (HCDC) algorithm [Bah 07b, Bah 08a]. In this model the search process is
performed at a compressed index-query level. It produces a double compressed index file,
which consequently requires less disk space to store the index files, reduces
communication time, and on the other hand, compressing the search query, reduces I/O
overheads and increases retrieval rate.
This chapter presents a literature review, which is divided into three sections. Section 2.1
presents a brief definition of the current trends towards enhancing the performance of
Web search engines. Then, in Section 2.2 and 2.3, we present a review on some of the
most recent and related work on Web search engine and bit-level data compression
algorithms, respectively.
2.1 Trends Towards High-Performance Web Search Engine
Chapter 1 list several major trends that can be identified in the literature for building
high-performance Web search engine. In what follows, we provide a brief definition for
each of these trends.
2.1.1 Succinct data structure
Recent years have witnessed an increasing interest on succinct data structures. Their aim
is to represent the data using as little space as possible, yet efficiently answering queries
on the represented data. Several results exist on the representation of sequences [Fer 07,
Ram 02], trees [Far 05], graphs [Mun 97], permutations and functions [Mun 03], and
texts [Far 05, Nav 04].
One of the most basic structures, which lie at the heart of the representation of more
complex ones are binary sequences with rank and select queries. Given a binary sequence
23
35. S=s1s2 … sn, which is denoted by Rankc(S; q) the number of times the bit c appears in S[1;
q]=s1s2 … sq, and by Selectc(S; q) the position in S of the q-th occurrence of bit c. The best
results answer those queries in constant time, retrieve any sq in constant time, and occupy
nH0(S)+O(n) bits of storage, where H0(S) is the zero-order empirical entropy of S. This
space bound includes that for representing S itself, so the binary sequence is being
represented in compressed form yet allowing those queries to be answered optimally
[Ram 02].
For the general case of sequences over an arbitrary alphabet of size r, the only known re-
sult is the one in [Gro 03] which still achieves nH0(S)+O(n) space occupancy. The data
structure in [Gro 03] is the elegant wavelet tree, it takes O(log r) time to answer Rankc(S;
q) and Selectc(S; q) queries, and to retrieve any character sq.
2.1.2 Compressed full-text self-index
A compressed full-text self-index [Nav 07] represents a text in a compressed form and
still answers queries efficiently. This represents a significant advancement over the full-
text indexing techniques of the previous decade, whose indexes required several times the
size of the text. full-text indexing must be used.
Although it is relatively new, this algorithmic technology has matured up to a point where
theoretical research is giving way to practical developments. Nonetheless this requires
significant programming skills, a deep engineering effort, and a strong algorithmic back-
ground to dig into the research results. To date only isolated implementations and focused
comparisons of compressed indexes have been reported, and they missed a common API,
which prevented their re-use or deployment within other applications.
2.1.3 Query optimization
Query optimization is an important skill for search engine developers and database ad-
ministrators (DBAs). In order to improve the performance of the search queries, develop-
ers and DBAs need to understand the query optimizer and the techniques it uses to select
an access path and prepare a query execution plan. Query tuning involves knowledge of
24
36. techniques such as cost-based and heuristic-based optimizers, plus the tools a search plat-
form provides for explaining a query execution plan [Che 01].
2.1.4 Efficient architectural design
Answering large number of queries per second on a huge collection of data requires the
equivalent of a small supercomputer, and all current major engines are based on large
clusters of servers connected by high-speed local area networks (LANs) or storage area
networks (SANs).
There are two basic ways to partition an inverted index structure over the nodes:
• A local index organization where each node builds a complete index on its own
subset of documents (used by AltaVista and Inktomi)
• A global index organization where each node contains complete inverted lists for
a subset of the words.
Each scheme has advantages and disadvantages that we do not have space to discuss here
and further discussions can be found in [Bad 02, Mel 00].
2.1.5 Scalability
Search engine technology should scale in a dramatic way to keep up with the growth of
the Web [Bri 98]. In 1994, one of the first Web search engines, the World Wide Web
Worm (WWWW) had an index of 110,000 pages [Mcb 94]. At the end of 1997, the top
search engines claim to index from 2 million (WebCrawler) to 100 million Web docu-
ments [Bri 98]. In 2005 Google claim to index 1.2 billion pages (as they were showing in
Google home page) in July 2008 Google announced to hit a new milestone: 1 trillion (as
in 1,000,000,000,000) unique URLs on the Web at once [Web 2].
At the same time, the number of queries search engines handle has grown rabidly too. In
March and April 1994, the WWWW received an average of about 1500 queries per day.
In November 1997, Altavista claimed it handled roughly 20 million queries per day. With
the increasing number of users on the web, and automated systems which query search
engines, Google handled hundreds of millions of queries per day in 2000 and about 3 bil-
25
37. lion queries per day in 2009 and twitter handled about 635 millions queries per day [web
1].
Creating a Web search engine which scales even to today’s Web presents many challeng-
es. Fast crawling technology is needed to gather the Web documents and keep them up to
date. Storage space must be used efficiently to store indexes and, optionally, the docu-
ments themselves as cashed pages. The indexing system must process hundreds of giga-
bytes of data efficiently. Queries must be handled quickly, at a rate of hundreds to thou-
sands per second.
2.1.6 Semantic search engine
The semantic Web is an extension of the current Web in which information is given well-
defined meaning, better enabling computers and people to work together in cooperation
[Guh 03]. It is the idea of having data on the Web defined and linked in a way that it can
be used for more effective discovery, automation, integration, and reuse across various
applications.
In particular, the semantic Web will contain resources corresponding not just to media ob-
jects (such as Web pages, images, audio clips, etc.) as the current Web does, but also to
objects such as people, places, organizations and events. Further, the semantic Web will
contain not just a single kind of relation (the hyperlink) between resources, but many dif-
ferent kinds of relations between the different types of resources mentioned above [Guh
03].
Semantic search attempts to augment and improve traditional search results (based on in-
formation retrieval technology) by using data from the semantic Web and to produce pre-
cise answers to user queries. This can be done easily by taking advantage of the availabil-
ity of explicit semantics of information in the context of the semantic Web search engine
[Lei 06].
2.1.7 Using Social Networks
There is an increasing interest about social networks. In general, recent studies suggest
that a social network of a person has a significant impact on his or her information acqui-
26
38. sition [Kir 08]. It is an ongoing trend that people increasingly reveal very personal infor-
mation on social network sites in particular and in the Web.
As this information becomes more and more publicly available from these various social
network sites and the Web in general, the social relationships between people can be
identified. This in turn enables the automatic extraction of social networks. This trend is
furthermore driven and enforced by recent initiatives such as Facebook’s connect. MyS-
pace’s data availability and Google’s FriendConnect by making their social network data
available to anyone [Kir 08].
So to combine the social network data with the search engine technology to improve the
results relevancy to the users and to increase the sociality of the results is one of the
trends currently used by the search engine like Google and Bing. Microsoft and Facebook
have announced a new partnership that brings Facebook data and profile search to Bing.
The deal marks a big leap forward in social search and also represents a new advantage
for Bing [Web 3].
2.1.8 Caching
Popular Web search engines receive a round hundred millions of queries per day, and for
each search query, return a result page(s) to the user who submitted the query. The user
may request additional result pages for the same query, submit a new query, or quit
searching process altogether. An efficient scheme for caching query result pages may en-
able search engines to lower their response time and reduce their hardware requirements
[Lem 04].
Studies have shown that a small set of popular queries accounts for a significant fraction
of the query stream. These statistical properties of the query stream seem to call for the
caching of search results [Sar 01].
2.2 Recent Research on Web Search Engine
E. Moura et al. [Mou 97] presented a technique to build an index based on suffix arrays
for compressed texts. They developed a compression scheme for textual databases based
on words that generates a compression code that preserves the lexicographical ordering of
27
39. the text words. As a consequence, it permits the sorting of the compressed strings to gen-
erate the suffix array without decompressing. Their results demonstrated that as the com-
pressed text is under 30% of the size of the original text, they were able to build the suffix
array twice as fast on the compressed text. The compressed text plus index is 55-60% of
the size of the original text plus index and search times were reduced to approximately
half the time. They presented analytical and experimental results for different variations
of the word-oriented compression paradigm.
S. Varadarajan and T. Chiueh [Var 97] described a text search engine called shrink and
search engine (SASE), which operates in the compressed domain. It provides an exact
search mechanism using an inverted index and an approximate search mechanism using a
vantage point tree. SASE allows a flexible trade-off between search time and storage
space required to maintain the search indexes. The experimental results showed that the
compression efficiency is within 7-17% of GZIP, which is one of the best lossless
compression utilities. The sum of the compressed file size and the inverted indexes is
only between 55-76% of the original database, while the search performance is
comparable to a fully inverted index.
S. Brin and L. Page [Bri 98] presented the Google search engine, a prototype of a large-
scale search engine which makes heavy use of the structure present in hypertext. Google
is designed to crawl and index the Web efficiently and produce much more satisfying
search results than existing systems. They provided an in-depth description of the large-
scale web search engine. Apart from the problems of scaling traditional search techniques
to data of large magnitude, there are many other technical challenges, such as the use of
the additional information present in hypertext to produce better search results. In their
work they addressed the question of how to build a practical large-scale system that can
exploit the additional information present in hypertext.
E. Moura et al. [Mou 00] presented a fast compression and decompression technique for
natural language texts. The novelties are that (i) decompression of arbitrary portions of
the text can be done very efficiently, (ii) exact search for words and phrases can be done
on the compressed text directly by using any known sequential pattern matching
28
40. algorithm, and (iii) word-based approximate and extended search can be done efficiently
without any decoding. The compression scheme uses a semi-static word-based model and
a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.
N. Fuhr and N. Govert [Fuh 02] investigated two different approaches for reducing index
space of inverted files for XML documents. First, they considered methods for
compressing index entries. Second, they developed the new XS tree data structure which
contains the structural description of a document in a rather compact form, such that
these descriptions can be kept in main memory. Experimental results on two large XML
document collections show that very high compression rates for indexes can be achieved,
but any compression increases retrieval time.
A. Nagarajarao et al. [Nag 02] implemented an inverted index as a part of a mass
collaboration system. It provides the facility to search for documents that satisfy a given
query. It also supports incremental updates whereby documents can be added without re-
indexing. The index can be queried even when updates are being done to it. Further,
querying can be done in two modes. A normal mode that can be used when an immediate
response is required and a batched mode that can provide better throughput at the cost of
increased response time for some requests. The batched mode may be useful in an alert
system where some of the queries can be scheduled. They implemented generators to
generate large data sets that they used as benchmarks. They tested there inverted index
with data sets of the order of gigabytes to ensure scalability.
R. Grossi et al. [Gro 03] presented a novel implementation of compressed suffix arrays
exhibiting new tradeoffs between search time and space occupancy for a given text (or
sequence) of n symbols over an alphabet α, where each symbol is encoded by log | α, |
bits. They showed that compressed suffix arrays use just nHh+O(n log log n/ log| α, | n)
bits, while retaining full text indexing functionalities, such as searching any pattern
sequence of length m in O(mlog | α, |+polylog(n)) time. The term Hh<log | α, | denotes the
hth-order empirical entropy of the text, which means that the index is nearly optimal in
space apart from lower-order terms, achieving asymptotically the empirical entropy of the
text (with a multiplicative constant 1). If the text is highly compressible so that H h=O(1)
29
41. and the alphabet size is small, they obtained a text index with O(m) search time that
requires only O(n) bits.
X. Long and T. Suel [Lon 03] studied pruning techniques that can significantly improve
query throughput and response times for query execution in large engines in the case
where there is a global ranking of pages, as provided by Page rank or any other method,
in addition to the standard term-based approach. They described pruning schemes for this
case and evaluated their efficiency on an experimental cluster based search engine with
million Web pages. Their results showed that there is significant potential benefit in such
techniques.
V. N. Anh and A. Moffat [Anh 04] described a scheme for compressing lists of integers
as sequences of fixed binary codewords that had the twin benefits of being both effective
and efficient. Because Web search engines index large quantities of text the static costs
associated with storing the index can be traded against dynamic costs associated with
using it during query evaluation. Typically, index representations that are effective and
obtain good compression tend not to be efficient, in that they require more operations
during query processing. The approach described by Anh and Moffat results in a
reduction in index storage costs compared to their previous word-aligned version, with no
cost in terms of query throughput.
Udayan Khurana and Anirudh Koul [Khu 05] presented a new compression scheme for
text. The same is efficient in giving high compression ratios and enables super fast
searching within the compressed text. Typical compression ratios of 70-80% and reduc-
ing the search time by 80-85% are the features of this paper. Till now, a trade-off between
high ratios and searchability within compressed text has been seen. In this paper, they
showed that greater the compression, faster the search.
Stefan Buttcher and Charles L. A. Clarke [But 06] examined index compression tech-
niques for schema-independent inverted files used in text retrieval systems. Schema-inde-
pendent inverted files contain full positional information for all index terms and allow the
structural unit of retrieval to be specified dynamically at query time, rather than statically
during index construction. Schema-independent indexes have different characteristics
30
42. than document-oriented indexes, and this difference can affect the effectiveness of index
compression algorithms greatly. There experimental results show that unaligned binary
codes that take into account the special properties of schema-independent indexes
achieve better compression rates than methods designed for compressing document in-
dexes and that they can reduce the size of the index by around 15% compared to byte-
aligned index compression.
P. Farragina et al [Fer 07] proposed two new compressed representations for general se-
quences, which produce an index that improves over the one in [Gro 03] by removing
from the query times the dependence on the alphabet size and the polylogarithmic terms.
R. Gonzalez and G. Navarro [Gon 07a] introduced a new compression scheme for suffix
arrays which permits locating the occurrences extremely fast, while still being much
smaller than classical indexes. In addition, their index permits a very efficient secondary
memory implementation, where compression permits reducing the amount of I/O needed
to answer queries. Compressed text self-indexes had matured up to a point where they
can replace a text by a data structure that requires less space and, in addition to giving
access to arbitrary text passages, support indexed text searches. At this point those
indexes are competitive with traditional text indexes (which are very large) for counting
the number of occurrences of a pattern in the text. Yet, they are still hundreds to
thousands of times slower when it comes to locating those occurrences in the text.
R. Gonzalez and G. Navarro [Gon 07b] introduced a disk-based compressed text index
that, when the text is compressible, takes little more than the plain text size (and replaces
it). It provides very good I/O times for searching, which in particular improve when the
text is compressible. In this aspect the index is unique, as compressed indexes have been
slower than their classical counterparts on secondary memory. They analyzed their index
and showed experimentally that it is extremely competitive on compressible texts.
A. Moffat and J. S. Culpepper [Mof 07] showed that a relatively simple combination of
techniques allows fast calculation of Boolean conjunctions within a surprisingly small
amount of data transferred. This approach exploits the observation that queries tend to
contain common words, and that representing common words via a bitvector allows
31
43. random access testing of candidates, and, if necessary, fast intersection operations prior to
the list of candidates being developed. By using bitvectors for a very small number of
terms that (in both documents and in queries) occur frequently, and byte coded inverted
lists for the balance can reduce both querying time and query time data-transfer volumes.
The techniques described in [Mof 07] are not applicable to other powerful forms of
querying. For example, index structures that support phrase and proximity queries have a
much more complex structure, and are not amenable to storage (in their full form) using
bitvectors. Nevertheless, there may be scope for evaluation regimes that make use of
preliminary conjunctive filtering before a more detailed index is consulted, in which case
the structures described in [Mof 07] would still be relevant.
Due to the rapid growth in the size of the web, web search engines are facing enormous
performance challenges. The larger engines in particular have to be able to process tens
of thousands of queries per second on tens of billions of documents, making query
throughput a critical issue. To satisfy this heavy workload, search engines use a variety of
performance optimizations including index compression, caching, and early termination.
J. Zhang et al [Zha 08] focused on two techniques, inverted index compression and index
caching, which play a crucial rule in web search engines as well as other high-
performance information retrieval systems. We perform a comparison and evaluation of
several inverted list compression algorithms, including new variants of existing
algorithms that have not been studied before. We then evaluate different inverted list
caching policies on large query traces, and finally study the possible performance benefits
of combining compression and caching. The overall goal of this paper is to provide an
updated discussion and evaluation of these two techniques, and to show how to select the
best set of approaches and settings depending on parameter such as disk speed and main
memory cache size.
P. Ferragina et al [Fer 09] presented an article to fill the gap between implementations
and focused comparisons of compressed indexes. They presented the existing implemen-
tations of compressed indexes from a practitioner's point of view; introduced the
Pizza&Chili site, which offers tuned implementations and a standardized API for the
32
44. most successful compressed full-text self-indexes, together with effective test-beds and
scripts for their automatic validation and test; and, finally, they showed the results of ex-
tensive experiments on a number of codes with the aim of demonstrating the practical rel-
evance of this novel algorithmic technology.
Ferragina et al [Fer 09], first, presented the existing implementations of compressed in-
dexes from a practitioner’s point of view. Second, they introduced the Pizza&Chili site,
which offers tuned implementations and a standardized API for the most successful com-
pressed full-text self-indexes, together with effective test beds and scripts for their auto-
matic validation and test. Third, they showed the results of their extensive experiments on
these codes with the aim of demonstrating the practical relevance of this novel and excit-
ing technology.
H. Yan et al [Yan 09] studied index compression and query processing techniques for
such reordered indexes. Previous work has focused on determining the best possible or-
dering of documents. In contrast, they assumed that such an ordering is already given,
and focus on how to optimize compression methods and query processing for this case.
They performed an extensive study of compression techniques for document IDs and pre-
sented new optimizations of existing techniques which can achieve significant improve-
ment in both compression and decompression performances. They also proposed and
evaluated techniques for compressing frequency values for this case. Finally, they studied
the effect of this approach on query processing performance. Their experiments showed
very significant improvements in index size and query processing speed on the TREC
GOV2 collection of 25.2 million Web pages.
2.3 Recent Research on Bit-Level Data Compression Algorithms
This section presents a review of some of the most recent research on developing an
efficient bit-level data compression algorithms, as the algorithm we use in thesis is a bit-
level technique.
A. Jardat and M. Irshid [Jar 01] proposed a very simple and efficient binary run-length
compression technique. The technique is based on mapping the non-binary information
33
45. source into an equivalent binary source using a new fixed-length code instead of the
ASCII code. The codes are chosen such that the probability of one of the two binary
symbols; say zero, at the output of the mapper is made as small as possible. Moreover,
the "all ones" code is excluded from the code assignments table to ensure the presence of
at least one "zero" in each of the output codewords.
Compression is achieved by encoding the number of "ones" between two consecutive
"zeros" using either a fixed-length code or a variable-length code. When applying this
simple encoding technique to English text files, they achieve a compression of 5.44 bpc
(bit per character) and 4.6 bpc for the fixed-length code and the variable length
(Huffman) code, respectively.
Caire et al [Cai 04] presented a new approach to universal noiseless compression based
on error correcting codes. The scheme was based on the concatenation of the Burrows-
Wheeler block sorting transform (BWT) with the syndrome former of a low-density
parity-check (LDPC) code. Their scheme has linear encoding and decoding times and
uses a new closed-loop iterative doping algorithm that works in conjunction with belief-
propagation decoding. Unlike the leading data compression methods, their method is
resilient against errors, and lends itself to joint source-channel encoding/decoding;
furthermore their method offers very competitive data compression performance.
A. A. Sharieh [Sha 04] introduced a fixed-length Hamming (FLH) algorithm as
enhancement to Huffman coding (HU) to compress text and multimedia files. He
investigated and tested these algorithms on different text and multimedia files. His results
indicated that the HU-FLH and FLH-HU enhanced the compression ratio.
K. Barr and K. Asanovi’c [Bar 06] presented a study of the energy savings possible by
lossless compressing data prior to transmission. Because wireless transmission of a single
bit can require over 1000 times more energy than a single 32-bit computation. It can
therefore be beneficial to perform additional computation to reduce the number of bits
transmitted.
If the energy required to compress data is less than the energy required to send it, there is
34
46. a net energy savings and an increase in battery life for portable computers. This work
demonstrated that, with several typical compression algorithms, there was actually a net
energy increase when compression was applied before transmission. Reasons for this
increase were explained and suggestions were made to avoid it. One such energy-aware
suggestion was asymmetric compression, the use of one compression algorithm on the
transmit side and a different algorithm for the receive path. By choosing the lowest-
energy compressor and decompressor on the test platform, overall energy to send and
receive data can be reduced by 11% compared with a well-chosen symmetric pair, or up
to 57% over the default symmetric scheme.
The value of this research is not merely to show that one can optimize a given algorithm
to achieve a certain reduction in energy, but to show that the choice of how and whether
to compress is not obvious. It is dependent on hardware factors such as relative energy of
the central processing unit (CPU), memory, and network, as well as software factors
including compression ratio and memory access patterns. These factors can change, so
techniques for lossless compression prior to transmission/reception of data must be re-
evaluated with each new generation of hardware and software.
A. Jaradat et al. [Jar 06] proposed a file splitting technique for the reduction of the nth-
order entropy of text files. The technique is based on mapping the original text file into a
non-ASCII binary file using a new codeword assignment method and then the resulting
binary file is split into several sub files each contains one or more bits from each
codeword of the mapped binary file. The statistical properties of the sub files are studied
and it was found that they reflect the statistical properties of the original text file which
was not the case when the ASCII code is used as a mapper.
The nth-order entropy of these sub files was determined and it was found that the sum of
their entropies was less than that of the original text file for the same values of
extensions. These interesting statistical properties of the resulting subfiles can be used to
achieve better compression ratios when conventional compression techniques were
applied to these sub files individually and on a bit-wise basis rather than on character-
wise basis.
H. Al-Bahadili [Bah 07b, Bah 08a] developed a lossless binary data compression scheme
35
47. that is based on the error correcting Hamming codes. It was referred to as the HCDC
algorithm. In this algorithm, the binary sequence to be compressed is divided into blocks
of n bits length. To utilize the Hamming codes, the block is considered as a Hamming
codeword that consists of p parity bits and d data bits (n=d+p).
Then each block is tested to find if it is a valid or a non-valid Hamming codeword. For a
valid block, only the d data bits preceded by 1 are written to the compressed file, while
for a non-valid block all n bits preceded by 0 are written to the compressed file. These
additional 1 and 0 bits are used to distinguish the valid and the non-valid blocks during
the decompression process.
An analytical formula was derived for computing the compression ratio as a function of
block size, and fraction of valid data blocks in the sequence. The performance of the
HCDC algorithm was analyzed, and the results obtained were presented in tables and
graphs. The author concluded that the maximum compression ratio that can be achieved
by this algorithm is n/(d+1), if all blocks are valid Hamming codewords.
S. Nofal [Nof 07] proposed a bit-level files compression algorithm. In this algorithm, the
binary sequence is divided into a set of groups of bits, which are considered as minterms
representing Boolean functions. Applying algebraic simplifications on these functions
reduce in turn the number of minterms, and hence, the number of bits of the file is
reduced as well. To make decompression possible one should solve the problem of
dropped Boolean variables in the simplified functions. He investigated one possible
solution and their evaluation shows that future work should find out other solutions to
render this technique useful, as the maximum possible compression ratio they achieved
was not more than 10%.
H. Al-Bahadili and S. Hussain [Bah 08b] proposed and investigated the performance of a
bit-level data compression algorithm, in which the binary sequence is divided into blocks
each of n-bit length. This gives each block a possible decimal values between 0 to 2 n-1. If
the number of the different decimal values (d) is equal to or less than 256, then the binary
sequence can be compressed using the n-bit character wordlength. Thus, a compression
ratio of approximately n/8 can be achieved. They referred to this algorithm as the
36
48. adaptive character wordlength (ACW) algorithm, since the compression ratio of the
algorithm is a function of n, it was referred to it as the ACW(n) algorithm.
Implementation of the ACW(n) algorithm highlights a number of issues that may degrade
its performance, and need to be carefully resolved, such as: (i) If d is greater than 256,
then the binary sequence cannot be compressed using n-bit character wordlength, (ii) the
probability of being able to compress a binary sequence using n-bit character wordlength
is inversely proportional to n, and (iii) finding the optimum value of n that provides
maximum compression ratio is a time consuming process, especially for large binary
sequences. In addition, for text compression, converting text to binary using the
equivalent ASCII code of the characters gives a high entropy binary sequence, thus only a
small compression ratio or sometimes no compression can be achieved.
To overcome all drawbacks that mentioned in the ACW(n) algorithm, Al-Bahadili and
Hussain [Bah 10a] developed an efficient implementation scheme to enhance the
performance of the ACW(n) algorithm. In this scheme the binary sequence was divided
into a number of subsequences (s), each of them satisfies the condition that d is less than
256, therefore it is referred to as the ACW(n,s) scheme. The scheme achieved
compression ratios of more than 2 on most text files from most widely used corpora.
H. Al-Bahadili and A. Rababa’a [Bah 07a, Rab 08, Bah 10b] developed a new scheme
consists of six steps some of which are applied repetitively to enhance the compression
ratio of the HCDC algorithm [Bah 07b, Bah 08a], therefore, the new scheme was referred
to as the HCDC(k) scheme, where k refers to the number of repetition loops. The
repetition loops continue until inflation is detected. The overall (accumulated)
compression ratio is the multiplication of the compression ratios of the individual loops.
The results obtained for the HCDC(k) scheme demonstrated that the scheme has a higher
compression ratio than most well-known text compression algorithms, and also exhibits a
competitive performance with respect to many widely-used state-of-the-art software. The
HCDC algorithm and the HCDC(k) scheme will be discussed in details in the next
Chapter.
37
49. S. Ogg and B. Al-Hashimi [Ogg 06] proposed a simple yet effective real-time compres-
sion technique that reduces the amount of bits sent over serial links. The proposed tech-
nique reduces the number of bits and the number of transitions when compared to the
original uncompressed data. Results of compression on two MPEG1 coded picture data
showed average bit reductions of approximately 17% to 47% and average transition re-
ductions of approximately 15% to 24% over a serial link. The technique can be employed
with such network-on-chip (NoC) technology to improve the bandwidth bottleneck issue.
Fixed and dynamic block sizing was considered and general guidelines for determining a
suitable fixed block length and an algorithm for dynamic block sizing were shown. The
technique exploits the fact that unused significant bits do not need to be transmitted. Also,
the authors outlined a possible implementation of the proposed compression technique,
and the area overhead costs and potential power and bandwidth savings within a NoC en-
vironment were presented.
J. Zhang and X. Ni [Zha 10] presented a new implementation of bit-level arithmetic cod-
ing using integer additions and shifts. The algorithm has less computational complexity
and more flexibility, and thus is very suitable for hardware design. They showed that their
implementation has the least complexity and the highest speed in Zhao’s algorithm [Zha
98], Rissanen and Mohiuddin. (RM) algorithm [Ris 89], Langdon and Rissanen (LR) al-
gorithm [Lan 82] and the basic arithmetic coding algorithm. Sometimes it has higher
compression rate than basic arithmetic encoding algorithm. Therefore, it provides an ex-
cellent compromise between good performance and low complexity.
38
50. Chapter Three
The Novel CIQ Web Search Engine Model
This chapter presents a description of the proposed Web search engine model. The model
incorporates two bit-level data compression layers, both installed at the back-end
processor, one for index compression (index compressor), and one for query compression
(query or keyword compressor), so that the search process can be performed at the
compressed index-query level and avoid any decompression activities during the
searching process. Therefore, it is referred to as the compressed index-query (CIQ)
model. In order to be able to perform the search process at the compressed index-query
level, it is important to have a data compression technique that is capable of producing
the same pattern for the same character from both the query and the index.
The algorithm that meets the above main requirements is the novel Hamming code data
compression (HCDC) algorithm [Bah 07b, Bah 08a]. The HCDC algorithm creates a
compressed file header (compression header) to store some parameters that are relevant
to compression process, which mainly include the character-binary coding pattern. This
header should be stored separately to be accessed by the query compressor and the index
decompressor. Introducing the new compression layers should reduce disk space for
storing index files; increase query throughput and consequently retrieval rate. On the
other hand, compressing the search query reduces I/O overheads and query processing
time as well as the system response time.
This section outlines the main theme of this chapter. The rest of this chapter is organized
as follows: The detail description of the new CIQ Web search engine model is given in
Section 3.2. Section 3.3 presents the implementation of the new model and its main
procedures. The data compression algorithm, namely, the HCDC algorithm is described
in Section 3.4. In addition in Section 3.4, derivation and analysis of the HCDC
compression ratio is given. The performance measures that are use to evaluate and
compare the performance of the new model is introduced in Section 3.5.
39
51. 3.1 The CIQ Web Search Engine Model
In this section, a description of the proposed Web search engine model is presented. The
new model incorporates two bit-level data compression layers, both installed at the back-
end processor, one for index compression (index compressor) and one for query
compression (query compressor or keyword compressor), so that the search process can
be performed at the compressed index-query level and avoid any decompression
activities, therefore, we refer to it as the compressed index-query (CIQ) Web search
engine model or simply the CIQ model.
In order to be able to perform the search process at the CIQ level, it is important to have a
data compression technique that is capable of producing the same pattern for the same
character from both the index and the query. The HCDC algorithm [Bah 07b, Bah 08a]
which will be described in the next section, satisfies this important feature, and it will be
used at the compression layers in the new model. Figure (3.1) outlines the main
components of the new CIQ model and where the compression layers are located.
It is believed that introducing the new compression layers reduce disk space for storing
index files, increases query throughput and consequently retrieval rate. On the other
hand, compressing the search query reduces I/O overheads and query processing time as
well as the system response time.
The CIQ model works as follows: At the back-end processor, after the indexer generates
the index, and before sending it to the index storage device it keeps it in a temporary
memory to apply a lossless bit-level compression using the HCDC algorithm, and then
sends the compressed index file to the storage device. So that it requires less-disk space
enabling more documents to be indexed and accessed in comparatively less CPU time.
The HCDC algorithm creates a compressed-file header (compression header) to store
some parameters that are relevant to compression process, which mainly include the
character-to-binary coding pattern. This header should be stored separately to be accessed
by the query compression layer (query compressor).
40
52. On the other hand, the query parser, instead of passing the query to the index file, it
passes it to the query compressor before accessing the index file. In order to produce
similar binary pattern for the similar compressed characters from the index and the query,
the character-to-binary codes used in converting the index file are passed to be used at the
query compressor. If a match is found the retrieved data is decompressed, using the index
decompressor, and passed through the ranker and the search engine interface to the end-
user.
Figure (3.1). Architecture and main components of the CIQ Web search engine model.
41
53. 3.2 Implementation of the CIQ Model: The CIQ-based Test Tool
(CIQTT)
This section describes the implementation of a CIQ-based test tool (CIQTT), which is
developed to:
(1) Validate the accuracy and integrity of the retrieved data. Ensuring that the same
data sets can be retrieved using the new CIQ model.
(2) Evaluate the performance of the CIQ model. Estimate the reduction in the index
file storage requirement and processing or search time.
The CIQTT consists from six main procedures; these are:
(1) COLCOR: Collecting the testing corpus (documents).
(2) PROCOR: Processing and analyzing the testing corpus (documents).
(3) INVINX: Building the inverted index and start indexing.
(4) COMINX: Compressing the inverted index.
(5) SRHINX: Searching the index file (inverted or inverted/compressed index).
(6) COMRES: Comparing the outcomes of different search processes performed by
SRHINX procedure.
In what follows, we shall provide a brief description for each of the above procedures.
3.2.1 COLCOR: Collects the testing corpus (documents)
In this procedure, the Nutch crawler [Web 6] is used to collect the targeted corpus
(documents). Nutch is an open-source search technology initially developed by Douglas
Reed Cutting who is an advocate and creator of open-source search technology. He
originated Lucene and, with Mike Cafarella, Nutch, both open-source search technology
projects which are now managed through the Apache Software Foundation (ASF). Nutch
builds on Lucene [Web 7] and Solr [Web 8].
42