Relational Data Theory: A Few Concepts Worth Knowing

Relational data theory is the process of modeling how data should be organized in a database system. E.F. Codd introduced the relational data model in 1970 while working at IBM. Prior to this, data was stored in files, in a hierarchical structure. Accessing data in this kind of structure was challenging and required custom programming.

Relational data theory revolutionized how data could be stored, retrieved, & manipulated. With the creation of the SQL language, it became relatively easy to access data, as well as insert new data, update the data and delete selected data. In its creation, relational data theory also included a number of advanced concepts supported by mathematical principles from set theory and first-order logic.

Functional dependencies, join dependencies, Rissanen’s Theorem, Heath’s Axiom, Armstrong’s Axioms, and Closure are the concepts that I will be reviewing in this article. I think that each of these will help you to understand relational data theory in more depth. My hope is that you will also appreciate the richness of relational data theory. Let’s get started!

Functional Dependencies in Relational Data Theory

Functional dependencies (FDs) are a fundamental concept in relational data theory that describe the relationships between attributes (columns) within a relation (table). They are essential for understanding data integrity, normalization, and ensuring a database is structured efficiently without unnecessary redundancy.

A functional dependency occurs when one attribute (or a set of attributes) in a relation uniquely determines another attribute or set of attributes. The left-hand side can be considered the determinant, while the right-hand side can be considered the dependent.

For example, a student id uniquely determines the student’s SSN & vice versa. You might think that the student’s SSN would uniquely determine the student’s name, but that would be incorrect since more than one student can have the same name. However, it would likely uniquely determine the student’s email address.

The way to write this in mathematical notation is A→BA then BA→B. Or student SSN→ student SSN + student’s email address then student’s SSN + student’s email address → student’s email. This example also introduces trivial dependencies, A→A and B→B that can be eliminated.

The non-trivial dependencies would then be A→B, as well as B→A. Seems logical, so why bother? One, to account for all dependencies in the data. In a formal theory (which this is) you have to account for all dependencies. Two, you will often come across database systems (poorly) designed with trivial dependencies incorrectly introduced as non-trivial dependencies. Three, to introduce two of the five types of functional dependencies. The others are full functional dependencies, partial dependencies, and transitive dependencies.

Trivial Dependencies

The goal with trivial dependancies is to recognize them, then eliminate them. This is a key step in normalization. Another way to think of this is that if the right-hand side of the equation is a subset of the left-hand side, it is trivial.

Non-Trivial Dependencies

As shown above, A→B then B→A is a non-trivial dependency. These refer to the genuine integrity constraints of the data. A student’s SSN would likely depend on their email address and their email address would then likely depend on their student SSN. Non-trivial dependencies ensures B is not redundant in A.

Full Functional Dependencies

Full functional dependancies are dependancies where removing any attribute means the other attributes are no longer dependent. Another way to express this is that it ensures that all attributes in A are necessary and that no subset of A can determine B. This is similar to non-trivial dependencies, but non-trivial dependencies focus on the right-hand side of the equation. In other words, non-trivial dependencies ensure that B is not simply a subset of A, while full functional dependencies ensure that all attributes in A are necessary to determine B.

For example, a student’s ID, course & grade is dependent on student’s ID & course. This is a full functional dependency. However, it still includes the trivial dependency of student’s ID & course dependent on student’s ID & course. The non-trivial dependency is student’s ID & course → grade.

Partial Dependency

Partial dependencies can arise when there are composite attributes determining another attribute. We can see the partial discrepancy in the above where student’s ID, course & grade is dependent on student’s ID & course. The grade attribute is only dependent on the student’s ID and course, but not the student’s address. If we added that attribute (student’s ID, course & address → grade), we would create a partial dependency – we can eliminate the student’s address. In 2nd Normal Form (2NF), we strive to eliminate partial dependencies by ensuring all non-prime attributes are fully dependent on the primary key of the database table.

Transitive Dependencies

If A→B and B→C then A→C. As an example, if the student’s name & address is dependent on the student’s ID and the student’s email address is dependent on the student’s name & address, then the student’s ID is dependent on the student’s email address.

This is the case with multiple candidate keys in a database table. They are not always incorrect, but they may not be in proper normal form, thereby allowing redundancy and anomalies in the data. It is important to have a critical eye for all transitive dependencies and whether they can be eliminated or not. In Third Normal Form (3NF), we strive to remove transitive dependencies by ensuring that all non-prime attributes depend only on the primary key.

Join Dependencies in Relational Data Theory

A join dependency (JD) is a constraint that specifies how a relation (table) can be decomposed into multiple smaller relations (subtables) and then reconstructed (or “joined”) without losing any data. It is a generalization of functional dependencies (FDs) and plays a crucial role in database design, particularly in 5th Normal Form (5NF).

I think this concept can be confusing. Basically, in attempting to eliminate duplication of data, relations (tables) that can be decomposed (broken down) into two tables are fairly straight-forward to identify. Some tables cannot be decomposed into two tables but can be decomposed into three.

However, care must be taken to ensure that the decomposed tables can be rejoined without any loss of data. A join dependency allows a table to be decomposed into multiple tables without losing information, ensuring that the original table can be reconstructed by joining the decomposed tables.

For example, if we had a table with the following attributes: student ID, class, class time, then we can decompose this table into one table with student ID & class, another with student ID & class time and a third with class & class time. With these three tables, we can reconstruct the original table.

If instead, we had decomposed the original table into one table with student ID & class, another with student ID & class time, we would not be able to reconstruct the original table since we would lose the relationship between class and class time.

Rissanen’s Theorem

Rissanen’s Theorem is a foundational result in the field of information theory and minimum description length (MDL) principle. Jorma Rissanen proposed it in 1977. It refines, broadens and simplifies earlier work by Heath (read on to learn more about that) by offering a more general theorem to Heath’s Axiom.

It provides a framework for understanding the relationship between the complexity of a model, the data it describes, and the trade-offs involved in encoding the data efficiently.

The basic concept behind Rissanen’s Theorem is minimizing how much data to store while preserving information. An example of this is normalization. Normalization reduces redundant data by breaking it into smaller, related tables and categorizing it into normal forms (1st, 2nd, 3rd, Boyce-Codd, 4th, and 5th).

Mathematically, Rissanen’s Theorem looks like this:

L(M) + L(D | M)

where L(M) is the length of the data model & L(D | M) is the length of storing the data, given the data model. For example, if a database is fully normalized, it will have a longer data model than a denormalized database. However, it will store data more efficiently because it eliminates redundancies. Overall, the total length of data will be shorter.

Relational data theory applies Rissanen’s Theorem to optimize queries as well. It does this by using execution plans that minimize computation and I/O costs. The execution plans also reduce query costs without bloating storage. It also compresses data through columnar storage or dictionary encoding to reduce data size.

Researchers also use Rissanen’s Theorem and the minimum description length (MDL) principle to determine the minimum data size required to create a model that avoids overfitting and minimizes error.

Heath’s Axiom for Relational Data Theory

Heath’s Axiom is a fundamental principle in Relational Data Theory. David Heath introduced it in 1971 as part of the early research on database normalization and decomposition. There were no relational database systems at that time. The first one developed was System R (so appropriate for me since I love the R programming language – no relation). IBM developed it between 1974-1977, which introduced SQL. It was not a commercial product. That came a few years later in 1979 with the release of Oracle V2.

Heath’s Axiom aided relational data theory by helping to identify and resolve redundancy issues. It did this by guiding the decomposition of relations (tables) into projections (smaller tables). This eliminated unnecessary duplication of data, while also ensuring that the decomposition is lossless.

For example, if we have a relation consisting of SSN, Email Address, Class, then we can decompose this into the projections SSN, Email Address and SSN, Class. This decomposition is lossless because SSN determines Email Address. In other words, R(A, B, C) can be decomposed into R1(A, B) and R2(A, C), where A→B (B functionally dependent on A). Even though C is not functionally dependent on A, it still demonstrates a lossless projection.

Armstrong’s Axioms

William W. Armstrong introduced these axioms in 1974. They are the foundation of functional dependency theory. They are used to infer all possible functional dependancies (FDs) from a given data set. I’m going to go through these briefly.

There are three Armstrong’s Axioms:

Reflexivity: If Y is a subset of X, then X determines Y.
Augmentation: If X determines Y, then XZ determines XY for any set of Z.
Transitivity: If X determines Y and Y determines Z, then X determines Z.

Armstrong’s Axioms are particularly used for 3rd Normal Form (3NF) and Boyce-Codd Normal Form (BCNF). They offer a systematic method for computing the closure of a set of functional dependencies, which is expressed as F⁺.

Relational Data Theory Closure

In the context of relational data theory, closure refers to the set of attributes or functional dependencies that can be inferred from a given set of attributes or functional dependencies, based on a given set of rules. Closures are critical when analyzing and working with functional dependencies and normal forms.

That’s a mouthful! Basically, when a data set is fully normalized, then closure has been achieved – congratulations! There are, however, two types of closure. Attribute Closure (X*) helps to identify superkeys and candidate keys. Functional Dependency Closure (F*) helps determine all implied dependencies in a database schema.

Summary

Yay! You made it to the end!! I hope you have a new found appreciation of the hard work that went in to the creation of the relational data theory and how deep, complex and well thought out it is.

I wrote this article to emphasize that creating a relational database involves more than just modeling a set of tables that seem to fit. There are several theories and axioms that support the optimal relational structure, and these can be broken down mathematically.

So next time you have the chance to create a relational database (or review one), consider reflecting on all this work, so you can create a better system and maybe have a better way of communicating possible issues when modeling the data.