Wednesday, May 14, 2014

When you should pick a relational database

In this post I'd like to underline the reasons why you should choose to use a relational database for your next application, and, in contrast, the reasons why you shouldn't.

Not many years ago, applications were fundamentally supporting businesses for boring things like accounting, warehouse management and so on. In the 80's relational databases were still hot (and expensive) technology, but in the 90's their usage was a given. Any application that needed to process business data stored the data in relational form.
In the 90's we saw the unstoppable growth of object oriented languages. OO languages allow you to create data models in the application programming language, independently from databases. This started the trend that came next. 
Before OO languages the tools we had to build applications were specifically tailored for relational data processing. I could make some examples: former PowerSoft PowerBuilder, Oracle Forms, Gupta SQLWindows, Borland Delphi, Microsoft Visual Basic. What these tools had in common was an approach where data processing was delegated to the RDBMS engine, often in the form of stored procedures, while the application was simply displaying the data through a graphical user interface. The so called client/server architecture.

With the advent of OO languages programmers felt it easier to model data in their programming language than leaving the processing of the data to the database.
The reasons for this are still relevant today:
  • by implementing the data model in the application you can write code that is not dependent on the specific RDBMS vendor and your code can be easily ported from one RDBMS to another;
  • application code is far easier to debug than stored procedures;
  • you need not learn the intricacies of SQL...
In order to move data from application memory to the database, ORM tools, like Hibernate, were implemented (actually I wrote one myself). This encouraged the perception of simplicity that is prevalent today: using an ORM makes the database transparent and allows you to concentrate on coding the business logic.
It is now clear that things are not that simple. ORMs come with a lot of complexity which is hidden when you start coding simple transactions but becomes relevant when your task is to perform non trivial data processing on significant data volumes or you need to scale to a large user population.
I'd like to talk about the impedance mismatch between OOP and the relational model in another post since it is a huge and generally misunderstood topic. Instead here I want to suggest a different perspective on when relational databases matter and why OO data models in application programs are often a bad choice.

The fundamental problem I see with the OO approach to data modelling is what I would call the "survival presumption". I have been developing business software for thirty years. I started using relational databases in 1984 when Oracle version 5 came out on an IBM PC AT which had 10MB of hard disk drive and 640KB of main memory. You had to add an expensive memory expansion board in order to reach the 1,64 MB of RAM that Oracle required to run.
Since then I have been using relational databases and, guess what I have been able to save from one application generation to the next: the data and the database schema. I have been developing for dumb terminals, Unix workstations, MS-DOS PCs, Windows, Linux and the browser based applications in the original and Web 2.0 styles and can tell you it is far more probable that in a few years you will be replacing your programming language than the database your business is running on. SQL is the only know-how that I have been able to carry on from 1984. For this reason learning to use it well is the best skill you can invest on for your future.
The problem with modelling data in the application programming language is that it is short sighted. Programmers with little experience are selfish and think databases are needed in order to "persist" their beautifully designed objects. So, no wonder they find relational databases slow and inflexible for their needs. This explains in part the success of document oriented databases. They allow programmers to persist their objects with basically zero overhead and minimal code. And retrieving the same objects is simple and amazingly fast.
If this is what you need, then you should definitely use document databases, or even flat files for that matter. Obviously document databases offer capabilities that file systems don't have.

But relational databases have a far greater scope than persisting objects. 
First, relational databases model data independently from application programs and the programming languages used. This explains in part why it's harder from a programmer perspective to interact with a relational database. But programmer's convenience is not a greater priority than data independence (I think programmer's convenience can be obtained in another manner, I will talk about this in the future). It is far more important to be free to interact with the database from any programming language and be able to migrate to new technology in a seamless manner. 
Data and metadata never becomes irrelevant because data can be moved and reprocessed differently. On the other hand, programs have a well-defined life cycle. Once they are technologically obsolete they have to be replaced.
It is also far more important that multiple applications be able to share access to the same data concurrently and with different access patterns. If data is designed from the perspective of one application, it will not be able to fit well to the requirements of another application you design tomorrow that needs the same data but in a different "shape". Relational databases are very capable of accommodating new requirements and new access mechanisms to existing data. Other kind of databases are far less capable of doing the same.
At the same time it is far more important for users to be able to use general purpose tools to work with their data outside the application boundary. If data consists of "serialized objects", the risk is users will not be able to understand them and use them outside the application that saved them.
Data must be designed for users, to survive programs and be adaptable.
Another reason relational databases are not persistent object managers is constraints. A relational database consists of relations and constraints. Bad data is of no good use because when you have bad data you can't trust the result of your queries. So if you base your decisions on bad data, you can take the wrong decisions.
Since relational databases are meant to be used to share data access across applications, constraints should be implemented in the database itself to ensure that, no matter which client is performing a transaction, data will be consistently validated and correctly updated. If we leave this to application programs inconsistencies can occur from code duplication. There is also another valid reason to process constraints in the database: it is very difficult to guarantee the correct transaction isolation semantics in application programs. Database transactions should be written by knowledgeable people once and for all. It is easier to guarantee correctness of transactions in presence of high concurrency when constraints are implemented in the database. And it should be clear that this approach also improves performance and scalability dramatically.
Yet today this approach is not common. 
I personally think it is a cultural problem coming mainly from the dominant education on OOP and imperative programming.
But is also a problem that comes from the lack of good relational database development tools.
This is the focus of my current research and development activity and I will talk about this in future posts.

No comments:

Post a Comment