How state machines can affect your Agility

November 13, 2022

On a Monday morning I went into work to find one of the systems down and it reminded me of one of the rules I had developed working out the 'Ops' side of DevSecOps.

A lot of people gain experience in Operations, managing applications, these tend to be fairly limited in their scope (e.g. UI -> Middleware -> Database). this kind of system is typically event triggered and so changing the system and making a few calls against the UI will quickly test the entire chain of components. This allows you to immediately confirm the change was successful.

However as systems become more complex, a typical pattern I have seen in a few Open source and proprietary systems will create distributed state machines, typically there will be multiple services which require data to have certain fields with specific values. The service will then perform an action and once complete update the data object. Once you have a sequence of services operating in this manor you have a state machine.

This directly impacts your ability to confirm a change, as you need to confirm the the change has successfully completed from change point. This can typically take 6-96 hours.

If your team works a standard week (Mon-Friday 9-5), there is usually a rule preventing changes from being performed on a Friday. With an user created event driven service, one day is generally enough time to confirm the change was successful. There is always a risk if you deploy on Friday it fails and the person you need to fix it has finished or worse it fails on the weekend and you are brought in on call or have a very stressful Monday morning.

The problem with a state machine is a change can take much longer than 24 hours. If each service is triggered every 6 hours and you have 3 services, then 24 hours allows you to confirm the system is working as expected.

But what if you have a Apache Cassandra data centres and you want to change the tombstone rules? A 'data centre' is a cluster of Cassandra nodes, Cassandra will keep multiple copies of data. There is a process which compares changed data to confirm each copy is correct. To delete data Cassandra adds a 'tombstone' flag to the data, this results in a change to the data, the flag is replicated to all copies. Once all copies of the data have the flag it is purged from the data centre.

Cassandra runs this as a series of services, each of which has their own state machine. Each service runs at a set time and doesn't run over all the data. Depending on your cluster and data size, to go from a flagged data object to a purged object can take 6 -72 hours.

If the tombstone process takes 72 hours and you want to change that process, you need to plan your change so the whole process completes before Friday (giving you time to fix/roll back). This means the latest you can perform the change is Tuesday Morning.

Factoring this into your deployment process is very important. If changes are limited to Monday/Tuesday it will directly impact your ability to deliver and the your teams ability to react to issues.

Once you acknowledge the impact this places on your team, you can prioritise addressing that impact. Quite often when I have focussed on this kind of problem, I have been surprised about how a number of small changes can have a drastic effect on this long tail.

Search This Blog

Thoughts from a life in software

How state machines can affect your Agility

Comments

Post a Comment

Popular posts from this blog

Why you should always check your requirements

Continuous Integration is an organisation problem

How tools have their own workflow