Normal Forms

There is an easy mnemonic that describes our usual goal of "third normal form". Here it is:

Data should depend on the key, the whole key, and nothing but the key! ... Rule #3.

What does this mean?

First Normal Form

If a relation contains no repeating groups, then it is in first normal form. Conversely, if we find that for a single primary key, we have two data items that we need to 'squeeze' into one column, then the data cannot be in first normal form. Let's say that we've decided to slap together a database containing information about patients and the (medicinal) drugs they are taking. So we create a table with the following columns, including the primary key "Patient ID Number":

Patient ID Number Patient Name Drug Trade Name Formulation Size Dose Frequency Side Effect

Let's try filling in some information..

Patient ID Number Patient Name Drug Trade Name Formulation Size Dose Frequency Side Effect
6009315011076 Soap,Joe Carbimazole
6003325011074 Green,Anne Carbimazole Neomercazole Tab 10mg 10mg OD nil

Woops. At the intersection of some rows and columns, we've had to fill in two different data items. It's also obvious that there is considerable redundancy in the table - the drug Carbimazole (with all the associated information) is represented twice.

Fixing to First Normal Form

It should be clear that we can fix things by simply introducing a "composite primary key" - we can group the drug and the patient ID together as a composite primary key, so now each row has a unique primary key, and at no intersection of row and column do we have to put in two data items:

Patient ID Number Drug Patient Name Trade Name Formulation Size Dose Frequency Side Effect
6009315011076 Carbimazole Soap, Joe Neomercazole Tab 10mg 30mgOD Agranulocytosis
6009315011076 Carbamazepine Soap, Joe Tegretol Caps 200mg 200mgTDS nil
6003325011074 Carbimazole Green,Anne Neomercazole Tab 10mg 10mgOD nil

Equally clearly, this is a clumsy fix, and really hasn't changed much. We still have immense data redundancy. We need to go further to..

Second Normal Form

Note that, although the above table is now in 'first normal form', nothing much has changed. Look at the table carefully, and you'll see that there are what we call partial dependencies. In other words, the columns Trade Name, Formulation, and Size depend only on the Drug component of the key, and not on the other part of the key, Patient ID Number. Likewise, the Patient Name depends only on the Patient ID Number, and is not dependent at all on the Drug column! To get rid of this problem, we create two new tables, a Patient Table:

Patient ID Number Patient Name
6009315011076 Soap, Joe
6003325011074 Green,Anne

and a Drug Table:

Drug Trade Name Formulation Size
Carbimazole Neomercazole Tab 10mg
Carbamazepine Tegretol Caps 200mg

Finally, we relate the two in a diminished form of our original table, thus:

Patient ID Number Drug Dose Frequency Side Effect
6009315011076 Carbimazole 30mgOD Agranulocytosis
6009315011076 Carbamazepine 200mgTDS nil
6003325011074 Carbimazole 10mgOD nil

Third Normal Form

.. but still we're not finished. Look at the last table carefully. There's an intimate relationship between the the dose and the frequency of giving a drug. Here's where knowledge of your subject comes in. Many people would be happy to stop at this point. There is no absolutely fixed rule that says you have to give your carbamazepine at a dose of 200mg TDS - in fact, if you're religiously doing so, you're probably mistreating a lot of your epileptics. But let's say the whole point of our drug survey is to relate different drug regimens and associated side effect! In this context, we have a strong association between the dose and frequency of a drug. We call such a relationship between two non-key attributes a transitive dependency.

Third normal form is about removing transitive dependencies. How could we do this? Well, we could label each dose + frequency as a "regimen", and proceed to break up the PatientRx table into two, thus:

Patient ID Number Drug Regimen Side Effect
6009315011076 Carbimazole R1 Agranulocytosis
6009315011076 Carbamazepine R3 nil
6003325011074 Carbimazole R2 nil

with a new DrugRegimen table:

Regimen Dose Frequency
R1 30mgOD
R2 10mgOD
R3 200mgTDS

Common Sense Intervenes?

A moment's consideration will suggest that we could have done things several different ways!
  1. Firstly, what was to prevent us from lumping the drug, dose and frequency together in the DrugRegimen table? In many ways, this makes more sense. One's never going to give carbamazepine in a dose of 10mg daily, or (heaven forbid!) carbimazole 200mg TDS, so it's silly to have these sort of redundant options enshrined within our database! You can see the importance of knowing your subject, and not just blindly putting tables together. If we move drug name into the DrugRegimen table..

    Patient ID Number Regimen Side Effect
    6009315011076 R1 Agranulocytosis
    6009315011076 R3 nil
    6003325011074 R2 nil

    with a new DrugRegimen table:

    Regimen Drug Dose Frequency
    R1 Carbimazole 30mgOD
    R2 Carbimazole 10mgOD
    R3 Carbamazepine 200mgTDS

    then we can immediately see that:

    Regimen Drug Dosing
    R1 Carbimazole D1
    R2 Carbimazole D2
    R3 Carbamazepine D3
    Dosing Dose Frequency
    D1 30mgOD
    D2 10mgOD
    D3 200mgTDS

    Personally, I wouldn't bother to do this last step, unless I was dealing with a whole lot of drugs that had similar dosing schedules. You can argue it both ways! This shows how, although there are a lot of hard rules in database design, a lot of what you do depends on an in-depth appreciation of the subjects you are dealing with, and what you want to get out of the database!

  2. Secondly, one could argue that in the PatientRx table, "Side Effect depends only on Regimen, so the table violates second normal form". This is very much a personal opinion, but my contention would be that such an assertion is completely false, as there is an intimate relationship between the patient (and his/her unique metabolism), and the likelihood of a side effect occurring. Tricky, isn't it?

  3. Thirdly, one should note that there is likely to be more than one side effect, in certain cases. So, just when we thought that we were home and dry, we find that our PatientRx table isn't even in first normal form, all because we left an 's' off the title of this column, and didn't think hard enough!

A Summary of Normalisation

We can see the usefulness of our mnemonic Rule #3 - A datum should depend on a key, individual data items should depend on the whole key (not just part of the key, which would be a partial dependency), and nothing but the key (to avoid transitive dependencies).

What are the merits of such extensive normalisation (besides giving us a lot of little tables??). The merit of first normal form should be intuitively obvious, but what about second and third? Think about our example of first normal form. We could not insert a new drug into our database (say aspirin) until at least one of our patients was on aspirin. This insertion anomaly, and a host of other consequences besides the obvious redundancy, make second normal form a good idea. Likewise (after we've accepted that defining a drug regimen is a good idea, and decided that a regimen constitutes a particular dose and frequency), we cannot add a new drug regimen until we've normalised our tables to third normal form. Note that a lot depends on how we are actually using the data!

There are other 'higher' forms of normalisation, for even third normal form doesn't guarantee that all 'anomalies' have been removed. Generally, we don't often go beyond 3NF (third normal form), but here are a few 'higher' normal forms, just to whet your appetite!

Next - Starting SQL Home

Date of First Draft: 2001/11/11 Date of Last Update: 2001/-/- Web page author: