Do you recognize this? You look back at your work and see that you've spent far too long doing repetitive, boring, manual tasks that are not very sustainable. Moreover, their number only increases as the environment grows. As a Product Owner or IT Manager, this is also hard to understand. You only see that a team does not get around to producing deliverables in the development sprints. You don’t see them drowning in putting out fires and making manual database changes. How beautiful the world would be if the DBA could lean forward in DevOps teams, speak the same language as the software and system engineers, empower team members and act as a trusted advisor. Everyone wants a DBA like that! In this blog, we go one step further in talking about useful redundancy and how you can create a stable and secure work environment with the shift to Database Reliability Engineering.
What is Database Reliability Engineering?
There are several blogs, articles, keynote-talks and even a complete book dedicated to Database Reliability Engineering. Yet it has not really landed in the Netherlands yet. In the US – think Google, Pinterest, Twitter and more of those tech giants – it has become a standard role in DevOps teams. To understand the vision behind Database Reliability Engineering, Laine Campbell and Charity Majors’s book on Database Reliability Engineering is a must-read. Hamish Watson’s blogs (The Hybrid DBA) are also inspiring reads. But, what exactly is it? An unequivocal and short definition of Database Reliability Engineering (DBRE) is not easy to give, but I think it is best described this way:
Database Reliability Engineering creates and facilitates a stable, secure and scalable data platform that is integral to the infrastructure and application landscape within your organization, and it is the connecting bridge between software development and operations.
You could describe the basic principles of DBRE as follows:
Securing the data with a flawless backup and recovery strategy and a secure operation. If this works well and there is one hundred percent trust, this also creates a safe working environment where mistakes can be made and people need not fear that something will be broken. In no time, a new server is spun up and a backup is restored, even automatically. No worries. This also speeds up development because there is much more room for experimentation.
Elemination of toil
By efficiently applying standardization and automation, boring, unchallenging, error-prone, repetitive manual operations are eliminated, freeing up time for expert input, advice and optimization. Behold the transition from reactive to proactive.
Based on his knowledge and expertise, the DBRE establishes a set of guidelines that helps software and system engineers do a lot independently without fear of something going wrong. The gatekeeper becomes a co-thinker.
Self-service for scale
We no longer build databases, but we prepare deployment scripts, including configuration, best practices and even health-check parameters, so that any software engineer can spin up a database instance of any flavor (MySQL, PostgreSQL, MongoDB). A platform where different datastores can be used, with a lifecycle of days or even hours if necessary. From release to deployment.
Databases are no special snowflakes
Even though we traditional DBAs used to think so. Much like the metaphor of the difference between pets and livestock. There used to be DBAs who named their servers, and in nightly hours even talked to them and pampered them. In modern data-driven cloud platforms, databases are like livestock, a number, which when ill are isolated and even replaced.
Eleminate the barriers between software and operations
Nothing is someone else’s problem anymore. We do it together. Modern DBAs are engineers, not administrators; we build things and create things. We say goodbye to built parts very easily, because the goal is the final result. Learn code, embrace the chosen tooling, don't come up with self-developed tools like many independent DBAs do; that would again encourage compartmentalization.
One blog does not provide enough space to give a broad and comprehensive view of Database Reliability Engineering. To make a start, in this blog we highlight 'Elimitation of toil', as a follow-up to our previous blog on useful redundancy.
Where did we come from?
The traditional DBA worked best in a silo, a pillar. Thus, the DBA fit seamlessly into the OTAP street and the waterfall method that was still the standard in software development at the time. This integrated beautifully and so the DBA could function primarily as a gatekeeper and guardian of the data and the stability of the database. It was his domain and island and most DBAs knew how to keep everyone at a distance. We all know the jokes. But how to bring down that database silo and break stigmas? How to integrate databases more into development processes where, for example, manual database changes also become deployments?
Where to start with DBRE?
Creating useful redundancy or ‘elimination of toil’ – eliminating difficult, boring and error-prone work – starts with:
- Time management
These are three good initial steps towards Database Reliability Engineering.
We provide Managed Services for a large number of different clients, with 24/7 support. We insist on standardization and solid documentation. From experience. You don’t want to spend unnecessary time at 2 am at night looking for that one logfile or config.sys. It happens all too often, and we too are sometimes confronted with it when dealing with new or non-Managed Services customers: spending the first hour after the call searching. Searching for the right config files, for which server exactly is active and which application connects or communicates with which database server. Let alone logging in. And the documentation? Oh yeah, that’s what that outgoing DBA was supposed to do before he left...
There are as many standards as people. What makes sense to some may be abracadabra to others. By using standards, it is easy and quick to understand the setup and what has already been done. In this way, you can quickly see what the deviations are and resolve the issue. The same applies to choosing the right tool. Don't necessarily choose the newest and coolest tool, but together choose a tool that is the best for this task. Also in documentation we all have our own way of talking. Therefore, have your documentation reviewed and updated by another colleague or team member. Have your documentation tested by having an uninvolved colleague log in and search for X.
Another point of attention regarding tools: use as much as possible the tools which the customer and its developers use. Embrace that language, that code or that tool. That way you build a bridge. That is the first step towards integration. By bringing in your own toolsets or ones you even created yourself (independent database professionals have a habit of doing this), you reinforce the compartmentalization and create a silo. Stop this as soon as possible!
Doing so will save you a lot of time already. No more lost time searching, almost directly on the spot where you need to be. You can skip the homework.
There are many tasks you can automate. For example, with Terraform or Ansible you can compile deployment scripts that deploy a complete HA clustered DBMS instance, including best practices, health-check triggers, and full backup and recovery setup. And you can do it in a few minutes. Whereas before you'd be toiling away for half a day with a high risk of errors and the fear of having to do it again tomorrow because it broke down. But also consider:
- High Availability (with automatic failover)
- Server Discovery (knowing where your servers are and which server is ‘in charge’)
- Observability (or monitoring what happens)
- Disaster recovery (backup and recovery strategy or plan)
By automating and testing this regularly, you push repetitive work into the background and can rely on certain processes to run automatically, which will save you a lot of time here as well. But also think about rest and regularity. How much more peace and confidence do you create when your team knows that the backup is always working? Try it: turn off your server and restore your backup. Stressful? But really nice when it works just the way you thought it would.
3. Time Management
We are quite capable of multitasking. But there is a limit. On average, you can say that in addition to one primary task, you can manage two side tasks at the same time. But then it stops. You will lose focus, while focus is key as a DBRE. Being focused on the details is one of the strongest qualities and added values of the database professional. That can be lost when you multitask too much. How to prioritize, control the issues of the day and be able to show what you are doing? A tool like Kanban can be of great help here. You can’t really operations. Agile and incident management are not necessarily the best match. The trick is to get Agile and ITIL to fit together. For example, set aside 16 hours for changes. These are available for the kanban board. The other 24 hours are for incident management, acting as a trusted advisor and expanding and optimizing automation scripts, training developers on new database engines, and helping and advising on query optimization, application connection and data modeling.
The primary task of the DBRE is to make sure the platform runs well and can take a hit. But you also need to be sure that failover works, that backups do what they are supposed to do, and that if something breaks down, the platform itself will designate a new master and spin up a new slave. This gives you the opportunity to link up with other DevOps teams, to brainstorm about query optimization, how the application can best communicate with the platform and which database engine fits the data model or the application best. Listen to and contribute expertise, up front. Managing after the fact is time consuming and slows things down.
Do you also want to apply Database Reliability Engineering?
Do you see the advantages and want to adopt this way of working? Do you want to build bridges and break down silos in your organization? Come and talk to us, we can advise you and help you with the first steps or even take over your database management. Feel free to contact us without any strings attached.