Database Essentials

A discussion of basic database concepts.

The Holy Grail Of Productivity

A wise man once said that information is power. And in the Internet age, that statement has been proved correct more than once.

As more and more companies move their businesses online, a vast sea of digital data comes into being, bits and bytes that control both where businesses have come from, and where they will go in the future. Everything a company knows about its business, its customers and its partners is slowly being converted into ones and zeroes, turning paper-based offices into cyberspace-based virtual workplaces, increasing productivity and streamlining business processes.

Of course, there's a flip side to this as well. As the volume of data grows, it becomes harder to track and manage it effectively. And so some bright spark came up with the idea of organizing all that data into databases - essentially, data storage containers that impose a structure on the data they contain, so as to simplify the task of managing and using all that information.

This is a fundamentally good idea, since it first makes it possible to organize large amounts of information, and then search through this information for specific items of data. It also offers benefits from the point of view of portability and compatibility (once the data is organized and stored in a database, it can be extracted and displayed in any manner you choose), provides a centralized storage location for important information, and makes it easier to identify relationships between different data segments.

Over the next few pages, I'm going to give you a crash course in some of the basic concepts of database theory, in the hope that it will offer a starting point for your own exploration of this field of study. If you're a novice when it comes to databases, some of the concepts explained here should help you put things in perspective, provide you with an explanation of some of the terms used by database engineers, and also offer you some insight into the capabilities of today's most popular database engines.

The Customer Is King

First, though, let's start with the basics. What exactly is a database?

A database, quite simply, is a collection of data. Every database is composed of one or more "tables" - these tables, which structure data into rows and columns, are what lend organization to the data.

Here's an example of what a typical table looks like:

# "names" table
+------------+-----------+----------+
| CustomerID | FirstName | LastName |
+------------+-----------+----------+
|     234673 | John      | Doe      |
|     734736 | Julius    | Caesar   |
|       1243 | Daffy     | Duck     |
+------------+-----------+----------+

As you can see, a table divides data into rows, with a new entry (or "record") on every row. If you think of a table as a filing cabinet, you'll see that every file in the cabinet corresponds to one row in the table.

The data in each row is further broken down into cells (or "fields"), each of which contains a value for a particular attribute of the data. For example, if you consider the record for the user "John Doe", you'll see that the record is clearly divided into separate fields for customer ID, and first and last names.

The rows within a table are not arranged in any particular order - they can be sorted alphabetically, by id, by member name, or by any other criteria you choose to specify. Therefore, it becomes necessary to have some method of identifying a specific record in a table. In the example above, each record is identified by a "CustomerID", which is a number unique to each row or record; this unique field is referred to as the "primary key" for that table. A table may or may not have a primary key - it's not essential - but a primary key does make it easier to locate records in a table.

Relationships

Many of today's databases are "relational database management systems", referred to by the acronym RDBMS. A relational database is typically composed of multiple tables, which contain inter-related pieces of information. SQL, or Structured Query Language, allows you to combine the data from these tables in different ways, thereby allowing you to create and analyze new relationships.

The example you just saw consists of a single table. While this is fine by itself, it's when you add more tables, and relate the information between them, that you truly start to see the power inherent in this system. Here's an example:

# "names" table
+------------+-----------+----------+
| CustomerID | FirstName | LastName |
+------------+-----------+----------+
|     234673 | John      | Doe      |
|     734736 | Julius    | Caesar   |
|       1243 | Daffy     | Duck     |
+------------+-----------+----------+

# "addresses" table
+------------+------------------------------+---------------+-------+--------+---------+
| CustomerID | StreetAddress                | City          | State | Zip    | Country |
+------------+------------------------------+---------------+-------+--------+---------+
|     234673 | 12, Some Street              | Nowheresville | ZZ    | 748202 |    56 |
|     734736 | The Palace Behind The Lake   | Ancient Rome  | NA    |      0 |     0 |
|       1243 | Next Door To The Rabbit Hole | Toontown      | TT    |  98765 |     0 |
+------------+------------------------------+---------------+-------+--------+---------+

# "balance" table
+------------+----------------+
| CustomerID | AccountBalance |
+------------+----------------+
|     734736 |     6546867.67 |
|     234673 |        3275.99 |
|       1243 |           3.00 |
+------------+----------------+

Each of these three tables is an independent entity; however, they can be linked to each other by means of the CustomerID field common to all of them. For example, you could easily use the CustomerID 1243 to find out that user "Daffy Duck" lives in Toontown, and has a balance of exactly $3.00 in his account. Or you could go about it the other way: among all the customers with a balance less than $10.00 in their accounts, CustomerID 1243, aka "Daffy Duck", is the only one living in Toontown, TT.

These relationships are the fundamental basis of a relational database system. By making it possible to easily relate records in different tables to one another, an RDBMS makes it possible to analyze date in different ways, while simultaneously keeping it organized in a systematic fashion, with minimal redundancy.

Invasion Of The Foreign Keys

You will notice that it is the CustomerID field, which appears in all three of the tables, that makes it possible to link different tables together. In such a case, the CustomerID field is also referred to as a "foreign key", since it is used to refer to records in different tables. In the example above, the name of the foreign key field is the same for all tables; this is not necessary at all times.

When each and every foreign key value is related to a field in another table, this relationship being unique, the system is said to be in a state of "referential integrity". In other words, if the CustomerID field is present in all the tables once and only once, and if a change to the CustomerID field in any single table is reflected in all other tables, referential integrity is said to exist.

This concept of referential integrity is a basic one, and one that becomes very important when designing a database with more than one table. When foreign keys are used to link one table to another, referential integrity, by its very nature, imposes constraints on inserting new records and updating existing records. For example, if a table only accepts certain types of values for a particular field, and other tables use that field as their foreign key, this automatically imposes certain constraints on the dependent tables. Similarly, referential integrity demands that a change in the field used as a foreign key - a deletion or new insertion - must immediately be reflected in all dependent tables.

Many of today's databases take care of this automatically - if you've worked with Microsoft Access, for example, you'll have seen this in action - but some don't. In the case of the latter, the task of maintaining referential integrity falls to the application developer, by manually updating all dependent tables to reflect changes in foreign key values.

Looking Up The Index

If your table contains a large number of records, you can often speed up queries by "indexing" certain columns. Similar in concept to the index you find at the end of a book, a database index is a list of sorted field values, used to simplify the task of locating specific records in response to queries.

Typically, it's a good idea to index fields on which you carry out a large number of searches. For example, since most of the queries on the tables above take place via the CustomerID field, it's a good idea to index this field. Normally, if you had a query like this,

sql> SELECT * FROM names WHERE CustomerID = 30000;

the database server would have to scan each and every record in the table to locate the one with CustomerID 30000. By indexing this field, like this,

sql> CREATE INDEX ids ON names (CustomerID);

the database server can use the index to locate ID 30000, and then use the information stored in the index to quickly locate the relevant record on the disk drive.

The downside of using indices is that they usually occupy some amount of disk space, and can affect the time taken to alter (insert or update) records, since the index needs to be refreshed each time the data changes.

Joined At The Hip

Most databases also allow you to query two or more tables at a time, and display a combined set of results. This is technically referred to as a "join", since it involves "joining" different tables at specific points to create new views of the data.

Here's an example of a simple join:

sql> SELECT names.CustomerID, FirstName, LastName, StreetAddress, City, State, Zip, Country FROM names, addresses WHERE names.CustomerID = addresses.CustomerID;

+------------+-----------+----------+------------------------------+---------------+-------+--------+---------+
| CustomerID | FirstName | LastName | StreetAddress                | City   | State | Zip    | Country |
+------------+-----------+----------+------------------------------+---------------+-------+--------+---------+
|     234673 | John      | Doe      | 12, Some Street              | Nowheresville | ZZ    | 748202 |      56 |
|     734736 | Julius    | Caesar   | The Palace Behind The Lake   | Ancient Rome  | NA    |      0 |       0 |
|       1243 | Daffy     | Duck     | Next Door To The Rabbit Hole | Toontown   | TT    |  98765 |       0 |
+------------+-----------+----------+------------------------------+---------------+-------+--------+---------+

In this case, the "names" and "addresses" tables have been joined together through the common column "CustomerID".

And you can also nest one query within another, such that the result of the inner query provides data for the outer query. Such a query is referred to as a "subquery", and it allows a great deal of flexibility when formulating long and complex queries.

Let's suppose you wanted to find out the names and addresses of all customers with an account balance greater than $1000.00

sql> SELECT FirstName, LastName, StreetAddress, City, State, Zip, Country FROM names, addresses WHERE names.CustomerID = addresses.CustomerID  AND names.CustomerID IN (SELECT CustomerID FROM balance WHERE AccountBalance >= 1000);

In this case, SQL will first execute the inner query

sql> SELECT CustomerID FROM balance WHERE AccountBalance >= 1000;
+------------+
| CustomerID |
+------------+
|     734736 |
|     234673 |
+------------+

and then assign the return value(s) to the outer query, which will display the result.

sql>SELECT FirstName, LastName, StreetAddress, City, State, Zip, Country FROM names, addresses WHERE names.CustomerID = addresses.CustomerID  AND names.CustomerID IN ('734736', '234673');

+-----------+----------+----------------------------+---------------+-------+--------+---------+
| FirstName | LastName | StreetAddress              | City          | State | Zip    | Country |
+-----------+----------+----------------------------+---------------+-------+--------+---------+
| John      | Doe      | 12, Some Street            | Nowheresville | ZZ    | 748202 |      56 |
| Julius    | Caesar   | The Palace Behind The Lake | Ancient Rome  | NA    |   0 |       0 |
+-----------+----------+----------------------------+---------------+-------+--------+---------+

There is a limit on the number of sub-queries you can use in a single SQL statement, but it's usually quite a comfortable number.

Room With A View

Closely related to joins and subqueries is the concept of "views", which offers yet another way of displaying specific data slices.

Unlike joins and subqueries, which are based on actual database tables, a view is a table derived from the content present in other tables. Think of a view as a virtual table, one which contains values from other tables (these values selected on the basis of user-defined rules), and which can be manipulated in exactly the same way as a regular table.

Since a view is based on the data present in "real" tables, the data you see in the view changes as the data in the base tables changes. A view also provides a simple way to restrict access to sensitive fields in a database table, allowing you to exclude certain fields from display, and can be used in combination with joins to discover new relationships between different data items.

For example, the following command creates a view named "highRollers", which only contains a list of accounts with a balance in excess of $1000.00

sql> CREATE VIEW highRollers AS SELECT CustomerID FROM balance WHERE AccountBalance >=1000.00;

And now, when you display the contents of the "highRollers" view, this is what you'll see:

sql> SELECT * FROM highRollers;

+------------+
| CustomerID |
+------------+
|     734736 |
|     234673 |
+------------+

Any changes made to the "customers" table will be immediately reflected in the view; similarly, changes made to the data in the view will be executed on the underlying table.

Pulling The Trigger

If your database supports it, you can also use "triggers" to automate certain database actions. A "trigger", as the name suggests, is a built-in database notification of specific events - inserting a new record, updating an entry, deleting a set of entries - and you can use this notification to automatically perform pre-defined actions.

To illustrate this, let's go back to our example database for a minute. In this database, each time a customer closes an account, the entry is deleted from the database, never to be recreated. If, instead, you'd like to set things up so that the entry is also backed up to another archive table, you could define a trigger that would run on all DELETE events, and copy the relevant data to another table prior to deleting it.

Here's an example of what an Oracle trigger might look like, in the situation just described:

sql> CREATE TRIGGER backupdata BEFORE delete ON balance FOR EACH ROW
BEGIN
INSERT INTO backup VALUES (CustomerID, AccountBalance);
END;

Typically, triggers can be customized to a great extent - you can decide whether the action is to be taken before or after the event, once or multiple times, for specific records or all records, and even specify additional criteria to further focus the effect of the trigger.

For complex Web applications (typically those with mission-critical data), triggers come in very useful, since they make it possible to export simple rules to the database layer, thereby reducing the dependency on the application layer and sometimes even improving performance. I say "sometimes", because triggers can also increase database load significantly if used incorrectly - take the situation above, and extrapolate to imagine 5,000 records being deleted every second, and you'll immediately see why this could significantly hamper database performance.

So That's Where All My Money Went...

Database servers like Oracle also come with the ability to create "stored procedures" at the database layer - essentially, blocks of program code that can be executed at will, similar to the user-defined functions you can write in Perl or PHP.

The advantage of creating such stored program blocks at the database level is very simple, and very powerful - all applications which interact with the database can use these stored procedures, thereby reducing the time and effort required for application development. For example, if your application uses both Perl and PHP to communicate with a database, you would need to write procedures to accomplish the same tasks in both languages. By writing the procedure once and moving it to the common point of intersection - the database - application code becomes easier to write and maintain, and your time is used more effectively.

For example, here's a simple procedure, again from Oracle's PL/SQL language, which would reduce a specified customer's balance by 10%:

sql> CREATE PROCEDURE reducebalance (ID IN int, Balance IN int) AS
BEGIN
UPDATE balance SET AccountBalance=(Balance-Balance/10) WHERE CustomerID=ID
END;

This procedure would accept a CustomerID and AccountBalance as parameters, and use this information to update a database entry in the table. As you can see, this is similar to a Perl or PHP function, which accepts values as parameters and performs actions based on those parameters (incidentally, you can also have the stored procedure return a value to you, based on the results of code execution - just like a regular function)

And that's about it. I hope you found this information useful, and that it served as a good starting point for your journey into the world of databases. If you're interested in learning more, you should also take a look at the "Speaking SQL" tutorials at http://devshed.com/Server_Side/MySQL/Speak/ , and at the very interesting article on database normalization at http://devshed.com/Server_Side/MySQL/Normal/ . If you want to get right into building data-driven Web applications, take a look at PHP at http://www.php.net/, or check out the PHP 101 series of tutorials at http://devshed.com/Server_Side/PHP/ . See you soon!

Note: Some code snippets in this article are specific to particular database servers, and may not work correctly on all servers. All examples are illustrative only. YMMV!

This article was first published on23 Feb 2001.