Direct naar content

Cloning and Cleansing: Master or Disaster

Copying data seems to have become almost a daily requirement. And, I’ll admit it, when asked to automate this, I’m happy to work with you. Automating data and database copies to exits for development, test and acceptance environments is a daily occurrence and often has a valid business case.

The simplicity and power of an automated procedure immediately presents a major pitfall. Storage is expensive, especially flash storage in an enterprise computing environment, such as HyperConverged Infrastructure (HCI). Our data usage keeps growing and sometimes seems to become unmanageable. In this blog Tino Dudink, DBA Consultant and Senior Database Reliability Engineer, talks about his vision on data usage.

Tino Dudink

DBA Consultant en Senior Database Reliability Engineer
Tino Dudink - DBA Consultant en Senior Database Reliability Engineer

Current demand

I often compare it with reports; it has become an unconscious custom to transfer these one-to-one during migrations and renewals, while it would make more sense to ask yourself: does this report still meet a current demand? It’s the same with cloning and copying data. Copying only a selection of data (tables, sources) to a staging area or datalake is somewhat more efficient in terms of data usage and storage requirements, but more costly in terms of maintenance when making changes, despite all the innovations there are.

Data virtualization has started a trend where data is no longer stored in multiple places. Disadvantage: these solutions come with a price tag and require an entire implementation process.

After cloning comes cleansing

Choices and choices. And then we haven’t even talked about cleansing. Because after all that saving and copying, there comes a time when it becomes technically, operationally, but also legally and financially necessary to clean up. Legally, in the context of the AVG. Financially, because scalability in the field of enterprise storage will eventually become quite expensive.

From a technical management perspective (contingency, backup) and from an operational perspective (lead times of daily backups and maintenance windows) it is also necessary to keep a close eye on the size of databases, storage volumes and data. I still come across a lot of financial, operational and other business software that writes data away into databases without having arranged a facility somewhere to archive and clean that data in due course.

Magic word

‘Cleansing by design’ could and should be the magic word, just as ‘privacy by design’ is today. Otherwise, the databases of logistical and financial data will continue to grow every year. The fact that this theme is not old-fashioned but more topical than ever, due to ongoing digitalization, is something I would like to argue for. It would be a good idea if every developer who develops procedures whereby data is stored also designed and implemented an archiving and deletion procedure.

Yet this is often not thought of beforehand, and is only given priority when the need is greatest. Whether it’s about partitioning, archiving and cleansing, whether it’s a data breach, mission-critical system failure or some other escalation, only then is it prioritized and funded.

Data anonymization

That brings me to the connecting final piece: anonymizing data. These days, data cloning must always be accompanied by anonymizing customer data and other confidential data. This is, of course, related to the AVG or GDPR, but can also be for business reasons. The potential reputational damage you can incur from incidents where things have not been properly processed is great.

Good packages are available to arrange this, but you can also opt for customization: control, manageability and future-proofing are good arguments for wanting to keep it in your own hands.

Data-efficient approach

This can range from very simple to more complex procedures, as part of data integration and data copy procedures. In addition to the proper use of least privilege in the IT security context, a data economical approach is not out of place in this context. As – according to good practice – in a Select query no * or unnecessary columns are included, this applies even more to data integration and data-movement solutions.

Each duplication of personal data doubles the risk and thus the effort to keep this risk manageable. Nothing new under the sun, but drawing extra attention to something as important as this can do no harm.

Data governance, the forgotten systems

Finally, I call attention to forgotten systems. In mergers and acquisitions, IT infrastructures are integrated and consolidated. But what to do with the obsolete systems that are labelled as legacy systems and where the original administrators and owners have left. Servers and databases that remain operational for years(!) on underlying systems that are no longer patched because the underlying operating systems and database management systems are no longer (able to be) updated. I encounter this on a regular basis and this requires attention, time and effort.


At OptimaData, we work with a team of driven specialists on these and other topics in which we are happy to assist organizations in designing, implementing and transitioning to a future-proof and robust (cloud) solution that always pays appropriate attention to databases and data. Please feel free to contact us without any obligation.