Database: Making The Case for Keeping Data
Occasionally, temporarily useful data is systematically erased arbitrarily. The reasons may vary, but it’s rarely for a good reason. My situation was this: A table that records tracking numbers were cleared with each update, every six hours. The role of the table was to simply transfer information from one system to another. The problem occurred if the system that retrieves the information was not able to perform its import before the next update (due to server failure, code error, exceptions, etc.). Because the data was cleared at the next update, it was never transferred.
WHY KEEP YOUR INFORMATION INSTEAD OF ERASING IT?
- Hard disk space is inexpensive.
- It reduces the difficulty of tracking problems.
- It maintains visibility in case of a bug or error.
HOW DO YOU OVERCOME A PRODUCTION SYSTEM THAT ERASES USEFUL INFORMATION?
If you have control over the code that deletes the data, remove it or modify it. If not, use a cron job that copies the information into a permanent table. Copy the table containing the information to back up, and to make sure it has the following columns:
id - INT
created_at - TIMESTAMP
updated_at - TIMESTAMP
status - INT
The dates make it possible to position the information on a timeline so that we can make sure that the information is created or modified at the right time. The “status” is used to manage the state of information using an integer. Depending on the operations you want to perform on the information, you can assign an integer that will identify the information quality (eg 0 = NEW_DATA, 1 = DATA_DONE, 2 = BAD_DATA, 3 = REMOVE_DATA, etc.)
You can erase the information if you wish as it is processed. A cron job that erases data with a certain status and duration can do the job as well. But by making it live for a longer duration and establishing a real process for deletions, you keep a history that can be very useful should problems arise.
Some disadvantages to this are that there is now a duplication of data (2 tables storing the tracking number in my case), which also leads to a bigger database, and taking up more space on the server.
However, if we follow my example, the data itself only needs to live as long as the next import. So in this case, the duplication is minimal. We are only duplicating data until the next update, and import. At the next update, the existing cleanup script deletes the original data. And, as mentioned above, at every import, we can clean the duplicated data (i.e. delete imported rows that we no longer require), or have a cron job delete them.
Since we’re not duplicating vast amounts of data, or keeping the duplication for long periods of time, the impact on the disk space is also minimal.