On National Teacher Day, meet the 2024-25 Kenan Fellows
Normalization
1. Normalization 1
Introduction
In this exercise we are looking at the
optimisation of data structure. The example
system we are going to use as a model is a
database to keep track of employees of an
organisation working on different projects.
Objectives
By the end of the exercise you should be able to:
Show understanding of why we normalize
data
Give formal definitions of 1NF, 2NF & 3NF
Apply the process of normalization to your
own work
2. Normalization 2
The data we would want to store could be
expressed as:
Project Project Employee Employee Rate Rate
No Name No Name category
1203 Madagascar 11 Jessica A £90
travel site Brookes
12 Andy B £80
Evans
16 Max Fat C £70
1506 Online 11 Jessica A £90
estate Brookes
agency
17 Alex B £80
Branton
3. Normalization 3
Three problems become apparent with our
current model:
Tables in a RDBMS use a simple grid structure
Each project has a set of employees so we can’t
even use this format to enter data into a table.
How would you construct a query to find the
employees working on each project?
All tables in an RDBMS need a key
Each record in a RDBMS must have a unique
identity. Which field should be the primary key?
Data entry should be kept to a minimum
Our main problem is that each project contains
repeating groups, which lead to redundancy and
inconsistency.
4. Normalization 4
We could place the data into a table called:
tblProjects_Employees
Project Project Employee Employee Rate Rate
No. Name No. Name category
1203 Madagascar 11 Jessica A £90
travel site Brookes
1203 Madagascar 12 Andy B £80
travel site Evans
1203 Madagascat 16 Max Fat C £70
travel site
1506 Online 11 Jessica A £90
estate Brookes
agency
1506 Online 17 Alex B £70
estate Branton
agency
5. Normalization 5
Addressing our three problems:
Tables in a RDBMS use a simple grid structure
We can find members of each project using a
simple SQL or QBE search on either Project
Number or Project Name
All tables in an RDBMS need a key
We CAN uniquely identify each record. Although
no primary key exists we can use two or more
fields to create a composite key.
Data entry should be kept to a minimum
Our main problem that each project contains
repeating groups still remains. To create a
RDBMS we have to eliminate these groups or
sets.
6. Normalization 6
Did you notice that Madagascar was misspelled
in the 3rd record! Imagine trying to spot this error
in thousands of records. By using this structure
(flat filing) we create:
Redundant data
Duplicate copies of data – we would have to key
in Madagascar travel site 3 times. Not only do we
waste storage space we risk creating;
Inconsistent data
The more often we have to key in data the more
likely we are to make mistakes. (see IT01 notes
on the importance of accurate data).
7. Normalization 7
The solution is simply to take out the duplication.
We do this by:
Identifying a key
In this case we can use the project no and
employee no to uniquely identify each row
Project No Employee Unique Identifier
No
1203 11
120311
1203 12
120312
1203 16
120316
Note: Project 1056 is not shown for reasons of space
8. Normalization 8
We look for partial dependencies
We look for fields that depend on only part of the
key and not the entire key.
Field Project No Employee No
Project Name
Employee
Rate Category
Rate
We remove partial dependencies
The fields listed are only dependent on part of
the key so we remove them from the table.
9. Normalization 9
We create new tables
Clearly we can’t take the data out and leave it out
of our database. We put it into a new table
consisting of the field that has the partial
dependency and the field it is dependent on.
Looking at our example we will need to create
two new tables:
Dependent Partially Dependent Partially
On Dependent On Dependent
Project No Project Name Employee Employee Name
No
Rate category
Rate
10. Normalization 10
We now have 3 tables:
tblProjects
tblProjects_Employees Project No Project Name
Project Employee
1023 Madagascar
No No
travel site
1023 11
tblEmployees 1056 Online estate
agency
1023 12 Employee Employee Rate Rate
No Name Category
1023 16 11 Jessica A £90
Brookes
1056 11 12 Andy B £80
Evans
1056 17 16 Max Fat C £70
17 Alex A £80
Branton
11. Normalization 11
Looking at the project note the reduction in:
Redundant data
The text “Madagascar travel site” is stored once
only, not for each occurrence of an employee
working on the project.
Inconsistent data
Because we only store the project name once we
are less likely to enter “Madagascat”
The link is made through the key, Project No.
Obviously there is no way to remove this
duplication without losing the relation altogether,
but it is far more efficient storing a short number
repeatedly, than a large chunk of text.
12. Normalization 12
Our model has improved but is still far from
perfect. There is still room for inconsistency.
Employee Employee Rate Rate
No Name Category Alex Branton is
11 Jessica A £90 being paid £80
Brookes while Jessica
Brookes gets £90 –
12 Andy B £80 but they’re in the
Evans same rate category!
16 Max Fat C £70
17 Alex A £80
Branton
Again, we have stored redundant data: the hourly
rate- rate category relationship is being stored in
its entirety i.e. We have to key in both the rate
category AND the hourly rate.
13. Normalization 13
The solution, as before, is to remove this excess
data to another table. We do this by:
Looking for Transitive Relationships
Relationships where a non-key attribute is
dependent on another non-key attribute. Hourly
rate should depend on rate category BUT rate
category is not a key
Removing Transitive Relationships
As before we remove the redundant data and
place it in a separate table. In this case we create
a new table tblRates and add the fields rate
category and hourly rate. We then delete hourly
rate from the employees table.
14. Normalization 14
We now have 4 tables:
tblProjects
tblProjects_Employees Project No Project Name
Project Employee
1023 Madagascar
No No
travel site
1023 11
tblEmployees 1056 Online estate
agency
1023 12 Employee Employee Rate
tblRates
No Name Category
Rate Rate
1023 16 11 Jessica A Category
Brookes
A £90
1056 11 12 Andy B
Evans
B £80
1056 17 16 Max Fat C
17 Alex A C £70
Branton
15. Normalization 15
Again, we have cut down on redundancy and it is
now impossible to assume Rate category A is
associated with anything but £90.
Our model is now in its most efficient format
with:
Minimum REDUNDANCY
Minimum INCONSISTENCY
16. Normalization 16
What we have formally done is NORMALIZE the
database:
At the beginning we had a data structure:
Project No
Project Name
Employee No (1n)
Employee name (1n)
Rate Category (1n)
Hourly Rate (1n)
(1n indicates there are many occurrences of the
field – it is a repeating group).
To begin the normalization process we start by
moving from zero normal form to 1st normal form.
17. Normalization 17
The definition of 1st normal form
There are no repeating groups
All the key attributes are defined
All attributes are dependent on the primary key
So far, we have no keys, and there are repeating
groups. So we remove the repeating groups and
define the keys and are left with:
Employee Project table
Project number – part of key
Project name
Employee number – part of key
Employee name
Rate category
Hourly rate
This table is in first normal form (1NF)
18. Normalization 18
A table is in 2nd normal form if
It’s already in first normal form
It includes no partial dependencies (where an
attribute is dependent on only part of the key)
We look through the fields:
Project name is dependent only on project
number
Employee name, rate category and hourly rate
are dependent only on employee number.
So we remove them, and place these fields in a
separate table, with the key being that part of the
original key they are dependent on. We are left
with the following three tables:
19. Normalization 19
Employee Project table
Project number – part of key
Employee number – part of key
Employee table
Employee number - primary key
Employee name
Rate category
Hourly rate
Project table
Project number - primary key
Project name
The tables are now in 2nd normal form (2NF). Are
they in 3rd normal form?
20. Normalization 20
A table is in 3rd normal form if
It’s already in second normal form
It includes no transitive dependencies (where a
non-key attribute is dependent on another non-
key attribute)
We can narrow our search down to the Employee
table, which is the only one with more than one
non-key attribute. Employee name is not
dependent on either Rate category or Hourly
rate, the same applies to Rate category, but
Hourly rate is dependent on Rate category. So,
as before, we remove it, placing it in it's own
table, with the attribute it was dependent on as
key, as follows:
21. Normalization 21
Employee project table
Project number – part of key
Employee number – part of key
Employee table
Employee number - primary key
Employee name
Rate Category
Rate table
Rate category - primary key
Hourly rate
Arial
Project number - primary key
Project name
These tables are all now in 3rd normal form, and
ready to be implemented.
22. Normalization 22
There are other normal forms - Boyce-Codd
normal form, and 4th normal form, but these are
very rarely used for business applications. In
most cases, tables in 3rd normal form are already
in these normal forms anyway.
Before you start normalizing everything, a word
of warning. No process is better than common
sense. Take a look at this example.
Customer table
Customer Number - primary key
Name
Address
Postcode
Town
23. Normalization 23
What normal form is this table in? Giving it a
quick glance, we see:
no repeating groups, and a primary key defined,
so it's at least in 1st normal form.
There's only one key, so we needn't even look
for partial dependencies, so it's at least in 2nd
normal form.
How about transitive dependencies? Well, it
looks like Town might be determined by
Postcode. And in most parts of the world that's
usually the case.
So we should remove Town, and place it in a
separate table, with Postcode as the key?
24. Normalization 24
No! Although this table is not technically in 3rd
normal form, removing this information is not
worth it. Creating more tables increases the load
slightly, slowing processing down. This is often
counteracted by the reduction in table sizes, and
redundant data. But in this case, where the town
would almost always be referenced as part of the
address, it isn't worth it. Perhaps a company that
uses the data to produce regular mailing lists of
thousands of customers should normalize fully.
It always comes down to how the data is going to
be used. Normalization is just a helpful process
that usually results in the most efficient table
structure, and not a rule for database design.
25. Normalization 25
Further Reading:
Paper
Heathcote – pages 110 -114
De Watteville et al – pages 299 – 300
Mott et al – pages 106 - 123
Web
http://phoenix.ucr.edu/mis/mgt230/Lecture5/sld001.html
http://www.wamoz.com/rood/normalis.htm
(read “A concise dictionary of normal forms”)
http://www.problemsolving.com/codecorn/norm.htm
http://www.acm.org/classics/nov95/s1p4.html