What is a resilient Infrastructure
Before we define what a resilient infrastructure is, I will like to give a glimpse into what is meant when we use the word infrastructure. An infrastructure in the software world is usually a platform of support for applications built by programmers or software developers and/or engineers. This usually consist components needed for the smooth running and functioning of an application supporting a business. There is no rule of thumb to how to design or set up your infrastructure, it is usually dependent on the demands of the applications the infrastructure has to support. Some of the components that makes up an application's infrastructure are servers, network model, storage, application monitoring tools, runtime interpreters, application logging tools (logs every detail happening within the application, this can be used to detect issues) as well as security framework.
Having had an insight into what infrastructure means, let us talk about what a resilient infrastructure is. You would call an infrastructure resilient when the entire system that makes it up has an ability to handle and eventually recover from unexpected conditions.
The rising need for Resilience in modern infrastructures
In the world that now recognizes powering businesses with software solutions, and with the rise in the adaptation of software applications by users, the need for resilient infrastructure cannot be over-emphasized. I will at this point list what I call
measures of resilience in infrastructures.
i. Availability ii. Disaster recovery iii. Fault tolerance
Availability, just as the name implies, gives a measure to how available your software system is to your users or customers. This measure usually vary from company to company depending on how critical the service they render is. For example, a company might want to prioritize shipping features faster more than having a high percentage of availability because hitting the market first with features is more important to them and if customers experience hiccups on first release, they just fix on the go. For another company, a Netflix or an Amazon for example cannot afford to be down at anytime hence they are targeting 99.999% or 99.9999% availability while the other company referenced earlier is okay with 99% availability. So, availability differ from business to business depending on their priorities.
Availability = MTTF/(MTTF+MTTR) *MTTF = Mean Time to Failure* *MTTR = Mean Time to Recover*
The idea is to have a larger MTTF to a very negligible MTTR so that the result of this tend to one or when multiplied by 100 give the 2, 3, 4 or 5 nines of availability in percentage. (i.e. 99%, 99.9%, 99.99% or 99.999%)
This is the ability to respond to severe activities that affects the smooth running of an infrastructure. This means that to be able to respond to these severe activities, there has to be a plan, which can be termed disaster recovery plan. This plan can be made to help rollback to normalcy whenever a disaster capable of shutting down your infrastructure occurs. One of the ways this is ensured in DevOps process is to have the processes of recovery automated and setup adequate backups which runs in the background on certain specified schedules. Your infrastructure needs to always be equipped with processes that get it back from fatality.
Fault tolerance can be said to be the process that enables a system to respond to failure in hardware or software. Fault Tolerance can be built into a system to mitigate a single point of failure and this can be achieved by ensuring the system is built in such a way that when one tiny aspect of the system fails the entire system doesn't go down as a result of this failure.
What it takes to have a resilient infrastructure
Harness the power of Diversified Infrastructure
It is now more common that a more businesses are moving to cloud-first. This is very good, however, to have a resilient set up, relying on a single cloud or CDN provider might not give the desired result. Hence a diversified infrastructure approach is fast becoming one of the strategies to having a resilient system. This gives the business flexibility and allows you share your workloads among different cloud providers while also minimizing cost. What is the advantage of this? This is advantageous for minimal impact of downtime of one provider on users and the business by the time you have set up your failover between this two providers, the other provider keeps serving your business needs.
Redundancy can be built at the software development/deployment level. It means we can have an instance of the software application on cloud infrastructure from a provider and as well have in-house infrastructure or what we call on-premise infrastructure. There will be an automated pipeline to keep the state of the two instances the same (more like a system of data synchronization). So when the cloud service fail, the on-premise instance keeps the business online
I have a previous article where I talked about microservices . You might want to check that out. How does this help in building resilience into our infrastructure? The microservices architecture by default works as a distributed system. This in-turn mean that at design stage, redundancy can be built into the individual components of the entire application and ultimately when stress comes on an aspect of the application at scale, it is only that part that is affected and can be scaled individually. This is evident resilience.
Introduce Chaos Engineering
Chaos engineering is a practice that intentionally introduce problems to identify points of failure in systems. Intentionally introducing chaos into controlled production environments can help identify weaknesses in our infrastructure and in turn help engineers proactively mitigate problems before they actually occur. This can provide insights into how much of resilience needs to be put in place for systems.
Craft out SLAs and Monitor Performance Regularly.
By having a well defined Service Level Agreement (SLA) between your business and your users, you can know how to not fall short of your promised service delivery to your clients. Your Service Level Objectives (SLO) which are high quality standard internal metrics that you intend to abide by while delivering value to your customers inform your SLA, which is now usually lower than your SLO. You want to break your SLO internally first before you break your SLA which might cost you a lot. With the advent of the cloud there are many tools you can use to monitor the performance of your systems. From Processor speed to memory spikes to failed requests or slow endpoints. These are the kind of information these tools gives real-time and that can help you and your business stay on the top of your game by responding to these reports promptly.
There are different factors that determine which methods or practices to adopt. You have to consider how mission-critical your business is, how fast your business scales also you have to consider cost implications and try to minimize the cost. However, in all these considerations, prioritize resilience.
Thanks for reading, if there is any practice you think is important and I missed in this article, please drop in the comment or send me a DM @oyhetola on twitter. Cheers!