Description
Database normalization is the process of organizing the attributes and tables of a relational database to minimize data redundancy.
Normalization involves re-factoring a table into smaller (and less redundant) tables but without losing information; defining foreign keys in the old table referencing the primary keys of the new ones. The objective is to isolate data so that additions, deletions, and modifications of an attribute can be made in just one table and then propagated through the rest of the database using the defined foreign keys.
Normalization can be described as:
- It is a technique for designing relational database tables to minimize duplication of information.
- Normalization is a practice to safeguard the database against logical and structural anomalies.
- Normalization is also termed as Canonical synthesis by the experts.
- It is used to keep data consistent and check that no loss of data as well as integrity is there.
- Its complexity may lead to higher degree of join operations which sometimes lead to the degraded throughput times.
Normalization Objective
- To free the collection of relations from undesirable insertion, update and deletion dependencies;
- To reduce the need for restructuring the collection of relations, as new types of data are introduced, and thus increase the life span of application programs;
- To make the relational model more informative to users;
- To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Database Modification Anomalies
When an attempt is made to modify (update, insert into, or delete from) a table, undesired side-effects may follow. Not all tables can suffer from these side-effects; rather, the side-effects can only arise in tables that have not been sufficiently normalized. An insufficiently normalized table might have one or more of the following characteristics:
Update Anomaly:
The same information can be expressed on multiple rows; therefore updates to the table may result in logical inconsistencies. For example, each record in an "Employees' Skills" table might contain an Employee ID, Employee Address, and Skill; thus a change of address for a particular employee will potentially need to be applied to multiple records (one for each skill). If the update is not carried through successfully—if, that is, the employee's address is updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table provides conflicting answers to the question of what this particular employee's address is. This phenomenon is known as an update anomaly.
An update anomaly. Employee 519 is shown as having different addresses on different records.
Insertion Anomaly:
There are circumstances in which certain facts cannot be recorded at all. For example, each record in a "Faculty and Their Courses" table might contain a Faculty ID, Faculty Name, Faculty Hire Date, and Course Code—thus we can record the details of any faculty member who teaches at least one course, but we cannot record the details of a newly hired faculty member who has not yet been assigned to teach any courses except by setting the Course Code to null. This phenomenon is known as an insertion anomaly.
An insertion anomaly. Until the new faculty member, Dr. Newsome, is assigned to teach at least one course, his details cannot be recorded.
Deletion Anomaly:
Under certain circumstances, deletion of data representing certain facts necessitates deletion of data representing completely different facts. The "Faculty and Their Courses" table described in the previous example suffers from this type of anomaly, for if a faculty member temporarily ceases to be assigned to any courses, we must delete the last of the records on which that faculty member appears, effectively also deleting the faculty member, unless we set the Course Code to null in the record itself. This phenomenon is known as a deletion anomaly.
A deletion anomaly. All information about Dr. Giddens is lost if he temporarily ceases to be assigned to any courses.
First Normal Form (1NF)
As per First Normal Form, no two Rows of data must contain repeating group of
information i.e each set of column must have a unique value, such that multiple
columns cannot be used to fetch the same row. Each table should be organized
into rows, and each row should have a primary key that distinguishes it as
unique.
The Primary key is usually a single column, but sometimes more than
one column can be combined to create a single primary key. For example consider
a table which is not in First normal form
Student Table :
Student | Age | Subject |
---|---|---|
Adam | 15 | Biology, Maths |
Alex | 14 | Maths |
Stuart | 17 | Maths |
In First Normal Form, any row must not have a column in which more than one
value is saved, like separated with commas. Rather than that, we must separate
such data into multiple rows.
Student Table following 1NF will be:
Student | Age | Subject |
---|---|---|
Adam | 15 | Biology |
Adam | 15 | Maths |
Alex | 14 | Maths |
Stuart | 17 | Maths |
Using the First Normal Form, data redundancy increases, as there will be many
columns with same data in multiple rows but each row as a whole will be unique.
Second Normal Form (2NF)
As per the Second Normal Form there must not be any partial dependency of any
column on primary key. It means that for a table that has concatenated primary
key, each column in the table that is not part of the primary key must depend
upon the entire concatenated key for its existence. If any column depends only
on one part of the concatenated key, then the table fails Second normal
form.
In example of First Normal Form there are two rows for Adam, to include
multiple subjects that he has opted for. While this is searchable, and follows
First normal form, it is an inefficient use of space. Also in the above Table in
First Normal Form, while the candidate key is {Student, Subject},
Age of Student only depends on Student column, which is incorrect as per
Second Normal Form. To achieve second normal form, it would be helpful to split
out the subjects into an independent table, and match them up using the student
names as foreign keys
New Student Table following 2NF will be :
Student | Age |
---|---|
Adam | 15 |
Alex | 14 |
Stuart | 17 |
In Student Table the candidate key will be Student column, because all
other column i.e Age is dependent on it.
New Subject Table introduced for 2NF will be :
Student | Subject |
---|---|
Adam | Biology |
Adam | Maths |
Alex | Maths |
Stuart | Maths |
In Subject Table the candidate key will be {Student, Subject}
column. Now, both the above tables qualifies for Second Normal Form and will
never suffer from Update Anomalies. Although there are a few complex cases in
which table in Second Normal Form suffers Update Anomalies, and to handle those
scenarios Third Normal Form is there.
Third Normal Form (3NF)
Third Normal form applies that every non-prime attribute of table must
be dependent on primary key. The transitive functional dependency should
be removed from the table. The table must be in Second Normal form. For
example, consider a table with following fields.
Student_Detail Table:
Student_id | Student_name | DOB | Street | city | State | Zip |
---|
In this table Student_id is Primary key, but street, city and state depends
upon Zip. The dependency between zip and other fields is called transitive
dependency. Hence to apply 3NF, we need to move the street, city and
state to new table, with Zip as primary key.
New Student_Detail Table :
Student_id | Student_name | DOB | Zip |
---|
Address Table:
Zip | Street | city | state |
---|
The advantage of removing transtive dependency is,
Boyce and Codd Normal Form (BCNF)
- Amount of data duplication is reduced.
- Data integrity achieved.
Boyce and Codd Normal Form (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal
form. This form deals with certain type of anamoly that is not handled by 3NF. A
3NF table which does not have multiple overlapping candidate keys is said to be
in BCNF.
Normalizing an Example Table
These steps demonstrate the process of normalizing a fictitious student table.- Unnormalized table:
Student# Advisor Adv-Room Class1 Class2 Class3 1022 Jones 412 101-07 143-01 159-02 4123 Smith 216 201-01 211-02 214-01 - First Normal Form: No Repeating GroupsTables should have only two dimensions. Since one student has several classes, these classes should be listed in a separate table. Fields Class1, Class2, and Class3 in the above records are indications of design trouble.Spreadsheets often use the third dimension, but tables should not. Another way to look at this problem is with a one-to-many relationship, do not put the one side and the many side in the same table. Instead, create another table in first normal form by eliminating the repeating group (Class#), as shown below:
Student# Advisor Adv-Room Class# 1022 Jones 412 101-07 1022 Jones 412 143-01 1022 Jones 412 159-02 4123 Smith 216 201-01 4123 Smith 216 211-02 4123 Smith 216 214-01 - Second Normal Form: Eliminate Redundant DataNote the multiple Class# values for each Student# value in the above table. Class# is not functionally dependent on Student# (primary key), so this relationship is not in second normal form.The following two tables demonstrate second normal form:Students:
Student# Advisor Adv-Room 1022 Jones 412 4123 Smith 216
Registration:Student# Class# 1022 101-07 1022 143-01 1022 159-02 4123 201-01 4123 211-02 4123 214-01 - Third Normal Form: Eliminate Data Not Dependent On Key
In the last example, Adv-Room (the advisor's office number) is functionally dependent on the Advisor attribute. The solution is to move that attribute from the Students table to the Faculty table, as shown below:
Students:Student# Advisor 1022 Jones 4123 Smith
Faculty:Name Room Dept Jones 412 42 Smith 216 42
0 comments:
Post a Comment