How do you recognise someone's face when you see them? We're still not sure, but you seem to do two things:

  1. You extract key information from the person's facial features when you look at them;
  2. You integrate this information, and compare it with stored information about "faces you've seen before", eventually associating the information with someone you know.

The above is an extremely simplistic summary of a wonderfully complex process that we cannot reliably duplicate with current technology. But we can think about this example and extract certain useful principles. Firstly, we need appropriate information, and secondly we need to compare this information with previously stored information about faces. This is the essence of what we should be doing with a database - not only do we need to store information, but the information must be integrated so that when we come to interrogate the database, we can get something valuable from it. Although in many respects our most subtle and advance database is no match for the human brain, our goal should be to design useful databases based on the principles of:

  1. Appropriate representation and integration of information (data)
  2. Ready access to that information in a 'user-friendly' format!

Integration of information

Data is always an approximation of what is happening in 'real life'. Errors creep in, but even without these errors, we can never be 100% certain that our symbolic representation of circumstances are 'true' - even if we're sure that we have 101 widgets in our store-room, or 53 patients in ward 567, we cannot be certain that since we made our last check, one of the widgets disappeared into the pocket of an employee, or one of the patients quietly slipped away! Widget number seventeen may be defective - does our coding system support this? And so on..

Our goal should be to provide not only an accurate representation of real life, but also a useful representation. Often there are trade-offs. However, there are certain good rules that have been worked out over the last forty years or so, and one ignores these rules at one's peril. Possibly the most important rule is:

Avoid redundancy and repetition (heh)!

It should be intuitively obvious that if the same information is stored at several points in a database (a more-or-less integrated collection of useful information) then we have a problem. The problem is this:

What happens when we change the information?

It is clear that in this disorganised scenario, every identical piece of information (stored all over the place) needs to be ferreted out and updated simultaneously! This is clearly excessively silly (and laborious) so the mantra of designing databases is clearly

The consequences of not enforcing Rule #1 are clearly:

  • Inconsistencies creep in (the same data may receive two different labels - synonyms; or even worse, the same label may be applied to two quite different data items - a homonym);
  • Valuable information becomes lost or hidden;
  • Updating (maintaining) information becomes progressively more difficult;
  • Standardisation of data capture and retrieval becomes impossible;
  • If one alters the structure of the database, there is the ever present danger that one might lose information, further duplicate information, or incorrectly associate certain items!

Disadvantages

Every approach (to anything!) has its downside. The major downside of a tightly integrated database with no redundancy, is that an error anywhere (absolutely anywhere) will result in loss of information. This loss may be catastrophic. So hand-in-hand with database design we should always consider maintenance of multiple copies of the database, to avoid data loss.

The other major downside is that, because each database necessarily provides a unique representation of data, that not everyone who uses the database will agree on this representation. It is therefore vital that (as far as possible) participants all at least nominally agree on the way data is represented, organised, and made available within the database.

Components

Clearly in order to have a useable database we need:
  1. Data capture - with people to perform this vital task, usually considered serfs and villeins;
  2. The actual database program that handles the data ("DBMS" or database management system);
  3. The data itself, and its logical representation (including clarification as to what each label refers to, or 'definitions', something which is often sorely inadequate)!
  4. A method for users to extract data from the database (prayer is not generally considered sufficient);
  5. Users (both flunkeys and pointy-haired managers);
  6. People who maintain the database (and talk directly to god).
Clearly you know exactly where you want to be in the hierarchy!

Dataspeak

As with any field (Medicine is a prime example) database people have developed their own arcane, ambiguous and super-redundant neologisms. We will try and cut through the jargon, but here (just to titillate you) is some of the terminology you may encounter:

  • entity - something (object or event) which you want to describe (accumulate information about) in your database,
  • entity class (entity set, entity type) - a collection of similar entities;
  • attribute - one of the properties of an entity that we, in our wisdom, have chosen to record;
  • identifier - something that (within the entity class) uniquely identifies a particular entity. Joe Soap has a six inch long bright red nose, or whatever; the identifier may be 'composite' - Joe Soap and Anne Green may both have big red noses, but Joe Soap may be perpetually drunk, while Anne Green is teetotal (with rhinophyma)!
  • An association relates two pieces of information - for example, Joe Soap may be a patient of Anne Green.
  • A data item is, as you might expect, a unitary 'piece of data'. It's often convenient to clump data items together as "data aggregates" (but be careful of doing this willy nilly)!

Armed with the above definitions (which are very convenient labels), we rush on to look at how databases are put together. A database is usually made up of tables, which are made up of data items. Within a table, it is convenient to group together rows of related data items. For example we might have a table called people, and this may contain information about, say, Anne Green and Joe Soap - their first names, surnames, ages, (ahem) drinking habits, and so on. Each person could be allocated one row of information (A row is often referred to as a record, which is simply a synonym for row).

It's clearly a good idea to be able to uniquely identify each record within a database. Now we could use any arbitrary characteristic of the row, for example, Joe's nose, .. but wait a bit, this isn't unique, and most other people have fairly similar noses! It's a much better idea to choose a row characteristic that is truly unique, and that will always remain so. For example, with people we're looking at some sort of national identity number, or other unique identifying code - within an organisation, staff members will probably have a unique staff code.

We call such an identifier (that uniquely identifies a row) a primary key. Note that in some cases, one may need to lump several items together in order to uniquely identify a row, although we'll try and avoid this, as it's messy! Alert readers will by now be asking themselves "Why the word 'primary'? What's a 'secondary' key?" Good question, but one that we're not going to answer right away!

Further terminology to confuse you

It is common in database terminology to refer to a table as a "file". We avoid this practice, as the word 'file' is commonly used in a much more generic way in computer programming.

Associations

We already know what an association is, but this knowledge is only the start! Most database books at this point wax eloquent, and draw lots of confusing arrows. We'll limit our exuberance to pointing out that if we have a single data item and several other data items, there may be no meaningful association between the first item and the others, there may be an association with just one of the other items, or the single item may be logically associated with several other items. Based on this simple observation, volumes have been written, and sorting out confusion in databases often depends on re-organising things to eliminate complex and confusing associations between data items.

Different Views

From the above it should be clear that there are different ways of looking at a database. The three common ones are:
  1. The logical structure of the database;
  2. The way the database is actually implemented;
  3. The view of the end-user.

In the ideal database, these should mesh seamlessly (dream on)! (You may occasionally come across references to this simple three-pronged approach as, for example, "three schema architecture", or "ANSI/SPARC model". Don't let such terminology put you off).


Planning a Database

If you strip out the fancy terminology from our previous discussion, it would seem relatively easy to sit down and create a database. In fact, to anyone who has a rudimentary knowledge of say, Windows, this is the case - it's excessively easy to sit down with a program like Microsoft's Access and throw together a quick database, often with catastrophic long-term consequences!

What's the problem? Well, we've already talked about Rule #1, and, frankly, this rule is a bit of a blighter. It turns out that sticking to good old rule #1 is more difficult than it would first appear. The problem is something we term normalisation - if you're not excessively careful in your database design, then subtle associations creep into your data, and before you know where you are, you have redundancy and repetition! We come to our second rule:

We cannot stress Rule #2 too much. Most databases that turn out to be total stuffups (often after great initial promise) fail because Rule #2 has been flagrantly ignored. A quick fix often turns into expensive long-term dependency! McFadden and Hoffer outline a stepwise approach to design that is still relevant over ten years down the line (although one needn't be too anally retentive about sequentially moving from one step to the next). They talk about:

  1. Getting committment from management (the most onerous task of all)
  2. Define goals and objectives
  3. Say who is going to administer the database
  4. Analyse the 'business' (define what business 'functions' or activities the business actually does; how decisions are actually made ['business processes']; what actually happens ['business activities']; and business 'entities' - we've discussed entities above, but what we mean here is relevant people, things and events).
  5. Model what is going on - this information model is central to the whole process
  6. Decide how data will be distributed to those who need it!
  7. Plan the implementation of the database
  8. Review the whole plan, and when ready, get going!

You can see such planning is a non-trivial task, and considerable thought is required. Simply sitting down and knocking together a database is not on! (But, having said this, we later on give you a practical example of how to sit down and knock together a database. Do as we say, not as we do)!


Modelling a Database - The Relational Approach

Although a variety of different 'database models' have been used in the past, most current databases are designed around what is called the relational model. Although this sounds intimidating, we've already described key components of a relational database:

relational database is made up of two-dimensional tables (relations), which in turn are made up of rows of related data items. The data are grouped as follows:

  • Each row is distinct;
  • The order of rows doesn't matter;
  • Data in rows are organised so that there are distinct columns each of which contains a single piece of information about an attribute of the row (In our fanciful example from above, one column might have the title "nose", and the entries for Joe Soap and Anne Green might both read "red", while other chaps might have the value "normal". Other columns might be called "Given Name" and "Surname", and so on..)
  • The order of columns also doesn't matter!

Relational Dataspeak

There is a wealth of confusing dataspeak associated with relational databases. Here's a sample of terms:
  • You might think that we would be content with talking about a row. You would be wrong! What we sometimes do is count up the number of data items in the row (say four) and refer to this as a 4-tuple! (Not content with this we might even talk about the "cardinality" of a relation - in our 4-tuple, the cardinality would be four).
  • We also talk about the domain of an attribute. Let's say we limit our description of noses to either "red" or "normal". The domain for "noses" is then simply the list ("red", "normal"). The domain for "Age in years" might be (0, .. , 137), but be careful - someone might dredge up a hundred-and-fifty year old Russian, for example!
  • We already (heh) know what a primary key is. (An "alternate key" is a possible candidate for a primary key that didn't quite make it past the selection stage!) Note that a primary key might be simple (represented by one data item) or composite (made up by amalgamating several data items, and referred to previously in our discussion). A synonym for a composite key is a concatenated key.

Think

A little thought will let you deduce that you can only relate two tables (relations) in a relational database if the two tables share common attributes. There is a deep relationship between what we do to relations, and set theory. We won't explore this relationship, we'll merely give you the bottom line! The bottom line is that we can logically deduce a set of rules that dictate how we create a database. If we stick to these rules (collectively called normalisation) then the database will predictably have the magic, desirable properties of easy maintenance and minimal redundancy!

You can also deduce certain properties of relational databases. Let's say that we have a table containing a column stuffed with numbers which refer to a second table. We go to the second table, look at the column containing the primary key of this second table, and note that the numbers in this column correspond to the column we examined in the first table. But wait .. woops .. one of the primary key values we expected to be there (because we saw it in the relevant column of the first table) isn't actually there - the database has lost referential integrity.

An example might dispel the gloom .. Assume we have a table called Customer with the person's ID number as the primary key, and another table called Banking, with the Customer's banking details, including their bank account number as the primary key in that table. Assume that there is a column in the Customer table called "BankNum", which contains a reference to the primary key of the Banking table. Let's now say we go to Joe Soap's entry in the Customer table, and find a reference in the "BankNum" column that says his bank number is "012399994". If we cannot find this number in the primary key column of the "Banking" table, then we've lost referential integrity.

There are a few more terms we should probably introduce here:

  • In the above example, the Customer table has a primary key, but it also refers to another table using a cross-reference key, a pretty self-explanatory term.
  • Let's assume we have a table with a primary key. Now we go to another table, and use our primary key in this table as part of a composite primary key. We then refer to this usage as a foreign key.
  • A well-defined relational algebra, and a relational calculus exist for manipulating relational database. Basically a relational algebra allows you to take one or two relations and manipulate them to make a new relation; relational calculus is (effectively) a way of combining many commands from relational algebra into one terse statement!

The relational model - the downside!

Although the relational model generally works extremely well (if properly implemented), there is a down side. Firstly, extensive maintenance may be needed when we change the value of a key. Secondly, a poorly designed relational database management system (RDBMS) may be lethargic and inefficient. Thirdly, relational integrity violations may cause catastrophe!


Next - Normal Form Home

Date of First Draft: 2001/11/11 Date of Last Update: 2001/-/- Web page author: jo@anaesthetist.com