In the beginning was the flat file – a simple list of entries (records) where one followed another. Here is an example:
Fred, 35, Male
Joe, 26, Male
Helen, 45, Female
…
Each collection of details (the record – for example Fred’s details) was stored as a separate entry in the file, and typically each new entry was tagged at the end of the file. If you wanted to retrieve a record then it was necessary to search through all records until the one required was found – this became a very slow process as the number of records increased. Various structures were added to make the retrieval process faster, typically in the form of some sort of index. While this speeded things up it soon became clear that businesses might have a dozen, a hundred and maybe thousands of different files that held related information. A Customer file would be related to an Orders file, which in turn might be related to a Sales file – and so on.
One of the first attempts to improve matters came in the form of the CODASYL (Conference on Data Systems Languages) database. These made an early appearance in the 1960s. They employed various indexing techniques and allowed all information to be grouped together in a single entity – the database. It was also possible to relate records within different files in the database via parent-child relationships. But we won’t dwell here – these were fairly inflexible things and required high levels of technical skill to use.
The big breakthrough in database technology came when researchers created the Relational model for data. As the name suggests this allows records in different files (tables) within the database to be related to each other. Customers are related to Orders. The technology was far easier to use and the widespread introduction of this technology in the 1980s established the relational database as the dominant database technology – up until the present day. In parallel with this came the development of Structured Query Language (SQL) – an easy to use language that allowed data to be stored and retrieved and for databases to be created and modified. This has served us very well, but in the very largest applications cracks are starting to show. Relational databases do a lot of work ensuring that all the pieces hang together, and in some applications this just makes them too slow. And so we have recently seen the emergence of Big Data and NoSQL databases. In many ways these strip away all the relational structures and make the job of tying things together that of the programmer. They also support massive distribution of data and processing across multiple computers so that the database can be scaled as much as required.
For most of us the relational database works just fine. From Microsoft Access for small to moderate size databases, up to Oracle, SQL Server or the free relational database such as MySQL the relational database is here to stay. But for the largest database applications Big Data provides a solution that relational database often cannot. Big Data is in its infancy. This means there will be problems and a rapid evolution. But it is interesting to note that the most widely used Big Data technology is Hadoop with associated utilities – and open source offering. And so maybe there will be some level of standardization – which is always a good thing.