DBMS stores data on hard disks
• This means that data needs to be
– read from the hard disk into memory (RAM)
– Written from the memory onto the hard disk
• Because I/O disk operations are slow query
performance depends upon how data is stored
on hard disks
• The lowest component of the DBMS performs
storage management activities
• Other DBMS components need not know how
these low level activities are performed
Basics of Data storage on hard
• A disk is organized into a number of
blocks or pages
• A page is the unit of exchange between
the disk and the main memory
• A collection of pages is known as a file
• DBMS stores data in one or more files
on the hard disk
• The physical arrangement of data in a file into records and
pages on the disk
• File organization determines the set of access methods for
– Storing and retrieving records from a file
• We study three types of file organization
– Unordered or Heap files
– Ordered or sequential files
– Hash files
• We examine each of them in terms of the operations we
perform on the database
– Insert a new record
– Search for a record (or update a record)
– Delete a record
5. Organization of Records in Files
• Heap – a record can be placed anywhere in the file where there
• Sequential – store records in sequential order, based on the
value of the search key of each record.
• Hashing –
This function computed on some attribute of each record.
The term hash indicates splitting of key into pieces.
Records of each relation may be stored in a separate file.
Unordered Or Heap File
• Records are stored in the same order in which they
• Insert operation
– Fast – because the incoming record is written at the end of
the last page of the file
• Search (or update) operation
– Slow – because linear search is performed on pages
• Delete Operation
– Slow – because the record to be deleted is first searched
– Deleting the record creates a hole in the page
Ordered or Sequential File
• Records are sorted on the values of one or more fields
– Ordering field – the field on which the records are sorted
• Search (or update) Operation
– Fast – because binary search is performed on sorted records
• Delete Operation
– Fast – because searching the record is fast
• Insert Operation
– Poor – because if we insert the new record in the correct
– we need to shift more than half the subsequent records in
– Alternatively an ‘overflow file’ is created which contains all
the new records as a heap
– Periodically overflow file is merged with the main file
Sequential access vs random
• sequential access means
that a group of elements is
• Random Access files will
be spited in to pieces and
will be stored wherever
• Sequential file may load
faster and random access
files may take time
• Is an array of buckets
– Given a record, k a hash function, h(k) computes the index
of the bucket in which record k belongs
– h uses one or more fields in the record called hash fields
– Hash key - the key of the file when it is used by the hash
– h(K)=K mod M
• Example hash function
– Assume that the staff last name is used as the hash field
– Assume also that the hash file size is 26 buckets - each
bucket corresponding to each of the letters from the
– Then a hash function can be defined which computes the
bucket address (index) based on the first letter in the last
A bucket is a unit of storage containing one or more records
(a bucket is typically a disk block).
Hash function is used to locate records for access, insertion
as well as deletion.
Hashing is an effective technique to calculate direct location
of data record on the disk without using index structure.
• Insert Operation
– Fast – because the hash function computes the
index of the bucket to which the record belongs
• If that bucket is full you go to the next free one
• Search Operation
– Fast – because the hash function computes the
index of the bucket
• Delete Operation
– Fast – once again for the same reason of hashing
function being able to locate the record quick
-Proceeding from occupied position specified by the hash address,
program check the subsequent position in order until an unused empty
position is found.
-Various overflow locations are kept, usually by extending the array
with number of overflow position
-A pointer field is added to each record location.
- Hashing for disk file is called External Hashing
- The Goal of good hashing function is to distribute the record
uniformly over the address space so as to minimize collisions.
!!! ….Problem with static hashing
is that it does not expand or
shrink dynamically as the size of
database grows or shrinks….???
Dynamic hashing provides a
mechanism in which data buckets are
added and removed dynamically and
Overflow Chaining: When buckets are
full, a new bucket is allocated for the
same hash result and is linked after the
This mechanism is called Closed
Linear Probing: When hash function
generates an address at which data is
already stored, the next free bucket is
allocated to it.
This mechanism is called Open Hashing.
Hash file organization of account file, using branch_name as key
For a string search - key, the binary representations of all the characters in the
string could be added and the sum modulo the number of buckets could be
Use of Extendable Hash Structure: Example
Initial Hash structure, bucket size = 2
• Index File (same idea as textbook index) : auxiliary structure designed to
speed up access to desired data.
• Indexing field: field on which the index file is defined.
• Index file stores each value of the index field along with pointer
(eg:page no.) pointer(s) to block(s) that contain record(s) with that field value
or pointer to the record with that field value:<Indexing Field, Pointer>
• To find a record in the data file based on a certain selection criterion on an
indexing field , we initially access the index file, which will allow the access
of the record on the data file.
• Index file much smaller than the data file => searching will be fast.
• Indexing important for file systems and DBMSs:
Choosing Indexing Technique
• Five Factors involved when choosing the
• access type
• access time
• insertion time
• deletion time
• space overhead
Two Types of Indices
• Ordered index (Primary index or clustering
index) – which is used to access data sorted by
order of values.
• Hash index (secondary index or non-clustering
index ) - used to access data that is distributed
uniformly across a range of buckets.
Types of Indexes
• Indexes on ordered vs. unordered files
• Dense vs. non-dense (i.e. sparse) indexes
- Dense: An entry in the index file for each record of the data file.
- Sparse: only some of the data records are represented in the index, often
one index entry per block of the data file.
• Primary indexes vs. secondary indexes
• Ordered Indexes – Hash indexes
- Ordered Indexes: indexing fields stored in sorted order.
- Hash indexes: indexing fields stored using a hash function.
• Single-level vs. multi-level
– single-level index is an ordered file and is searched using binary search.
– multi-level ones are tree-structured that improve the search and require a
more elaborate search algorithm.
• Index on a single indexing field –
• Index on multiple indexing fields (i.e. Composite Index).
Index built on ordering key field of a file
Index built on ordering non-key field of a file
Index built on any non-ordering field of a file
Single-Level Ordered Index : Primary Index
A primary index file is an index that is constructed using the
sorting attribute of the main file.
• Physical records may be kept ordered on the primary key.
• The index is ordered but only one entry record for each block
• Each index entry has the value of the primary key field for
the first record (or the last record) in a block and a pointer to
First perform a binary search on the primary index file, to find the
address of the corresponding data.
Performance: Very fast!
Problem: The Primary Index will work only if the main file is a sorted file.
The new records are inserted into an unordered (heap) in the overflow file for the
table. Periodically, the ordered and overflow tables are merged together; at this time,
the main file is sorted again, and the Primary Index file is accordingly updated.
Dense and Sparse Indices
There are Two types of ordered indices:
• An index record appears for every search key value in file.
• This record contains search key value and a pointer to the actual
• Index records are created only for some of the records.
• We start at that record pointed to by the index record, and proceed
along the pointers in the file (that is, sequentially) until we find the
Figures 1 and 2 show dense and sparse indices for the deposit file.
Figure 1: Dense index.
•Notice how we would find records for Perryridge branch using both methods.
Figure 2: Sparse index.
• Dense index requires more space overhead and more
• Data can be accessed in a shorter time using Dense
• It is preferable to use a dense index when the file is
using a secondary index, or when the index file is
small compared to the size of the memory.
Single-Level Ordered Index: Clustering Index
• Records physically ordered by a non-key field
• Same general structure as ordered file index
– <Clustering field, Block pointer>
• One entry in the index for each distinct value of the clustering field with
a pointer to the first block in the data file that has a record with that value
for its clustering field.
– Possibly many records for one index entry (non-dense)
• Sometimes entire blocks reserved for each distinct clustering field value
• secondary index must contain pointers to all the records.
• A pointer does not point directly to the file but to a
bucket that contains pointers to the file.
• Secondary indices must be dense, with an index entry for
every search-key value, and a pointer to every record in
the file. Secondary indices improve the performance of
queries on non-primary keys.
Choosing Multi-Level Index
• In some cases an index may be too large for efficient
• In that case use multi-level indexing.
• In multi-level indexing, the primary index is treated as a
sequence file and sparse index is created on it.
• The outer index is a sparse index of the primary index whereas
the inner index is the primary index.
• B-tree is the most commonly used data
structures for indexing.
• It is fully dynamic, that is it can grow
Three Types B-Tree Nodes
• Root node - contains node pointers to
• Branch node - contains pointers to leaf
nodes or other branch nodes.
• Leaf node - contains index items and
horizontal pointers to other leaf nodes.
38. Dynamic Multilevel Indexes
– Retain the benefits of using multilevel indexing while reducing index
insertion & deletion
– Dynamic multilevel indexes are implemented as B-trees and often as B+-
Allow an indexing field value to appear only once at some level in the tree ;
. pointer to data at each node.
. pointers to data are stored only at the leaf nodes of the tree
. Leaf nodes have an entry for every indexing field value.
. The leaf nodes are usually linked together to provide ordered access on the
indexing field to the records.
All the leaf nodes of the tree are at the same depth: retrieval of any record
takes the same time.
In a B tree search keys and data stored in internal or leaf nodes.
But in B+tree data store only leaf nodes.
Searching of any data in a B+ tree is very easy because all data are found in leaf
nodes otherwise in a B tree data cannot found in leaf node.
In B tree data may found leaf or non leaf node. Deletion of non leaf node is very
complicated. Otherwise in a B+ tree data must found leaf node. So deletion
is easy in leaf node.
Insertion of a B tree is more complicated than B+ tree.
B +tree store redundant search key but B-tree has no redundant value.
In B+ tree leaf node data are ordered in a sequential linked list but in B tree the
leaf node cannot stored using linked list. Many database system
implementers prefer the structural simplicity of a B+ tree