On OpsDev

Don’t build software. Except when you can’t not.

 

I’ve heard it said that working in insurance is more boring than banking and without the pressure. But from an IT perspective there’s so much that’s done with handshakes and informal nods that creating and maintaining systems to handle all the exception cases is … tricky. So underwriting systems tend to be very custom and very particular to a company’s own way of doing business.

To complicate things, a few years ago the strategic decision was taken at the site I was working at to Not Build Software. Rather, the firm would buy in whatever they could and run their business on packaged software. Unfortunately the result was that for the more specialized bits of the business — like, for instance, underwriting — the software they needed was pretty specialized in turn and so quite tricky to get right. So in practice they were really outsourcing their development to a third party, who sold them a “package that can be customized,” which really meant a fixed-price, fixed term contract.

The particular package they bought is quite clever and can probably be made to do pretty much anything an insurance firm needs. Technically it’s a big ball o’mud, written in Java with a single back-end database. Customisation can be done through database scripts or by bits of Java code slapped around the central core.

In theory, then, the package should have been really easy to deploy. Take a JBoss application server, shove in the EAR, point it at a database, job’s done. But like most enterprise software the package wasn’t designed with maintenance and deployment in mind: too much Dev, not enough Ops.

On paper this suited the insurance company. “We’ll take the package as it is. A little bit of customisation and everything will be fine. The operations guys can deploy the software, heck, we can have the support guys do it.”

But in practice, two years down the line, there were four parallel development streams and over 30 test environments. Deployment frequency to production decreased from one every two weeks to ‘big bang’ releases every three months with a succession of small patches in the meantime. The time to deploy went up from fifteen minutes to three hours. Quality went down and so too did the rate of delivery of new features.

What went wrong?

No plan survives contact with the enemy. Or a contract lawyer

So there’s a reason why the term is ‘DevOps’.

Sometimes I’ll see job ads for ‘DevOps Engineers’ which need 4+ years experience with shell scripting in Bash and a bit of Puppet knowledge. At a rate rather less than half that of a competent developer. Which makes me wonder with those jobs what’s happened to the ‘Dev’ part of ‘DevOps’ — this is really a job description for a ‘operations team member that knows a bit of scripting.” A reasonable place to start but – those aren’t DevOps engineers.

The insurance company had plenty of network engineers, server engineers and DBAs. There was a complete department who could have Active Directory do their bidding. Marvellous knowledgeable people all. But scripting? ‘No, that’s development. Don’t know how to do that.’

Some of the build engineers put together some neat scripts in Puppet to automatically create IaaS virtual machines in Azure and deploy clustered SQL Servers there. Clever stuff. Unfortunately nobody thought to check with the teams who’d use those servers, so nobody found out that the three or four hours that it would take to create a test environment would be – suboptimal. “Well, then, the database should be made smaller.”

Er, no – this is packaged software, we have to deal the hand we’re given and put up with the package we have. It might be better to split customer data from standing data, or split the application into microservices. But without development effort that won’t happen.

Similarly, requests to move off Windows to Linux, or to update the JVM, or re-code the presentation layer so it wasn’t stateful, or rework the persistence layer to handle retries in case of failure… there’s no developers to do that work. You’d have to ask the business analysts to prioritize those demands against underwriting requirements – and that’s a tough sell.

The anti-pattern: Ops with no Dev

So in a nutshell: non-functional requirements aren’t just response times, resilience, and database sizes: they’re also operability.

That can cover “can we monitor the system”, “can we control it easily” and “can we deploy updates easy”. It can also cover more subtle requirements like “can we have two streams of development working in parallel”, “can we automate our acceptance testing” and “can we quickly deploy small changes”.

It’s those last that got lost at the insurance company. There was no ‘Dev’ in the ‘Ops’ – so totally reasonable operational NFRs like 100% infrastructure automation were prioritized above development lifecycle requirements like rapid deployment.

This approach was endemic. Some other examples:

  • Everything was deployed to fresh application servers and database servers on each release. The application servers were considered ‘cattle’, so they had random names. If an application server failed there was no way to quickly find out what its role or function was – rather, a Puppet database query was necessary or an API call to Azure. Sacrifice was made of operability for ideological purity.
  • The logical unit of deployment was a set of Azure resource groups, comprising network infrastructure, load balancers, virtual machines, SQL databases, storage, a full database backup, and the application. Everything apart from the application was effectively static, changing perhaps once or twice a month – yet the whole infrastructure had to be re-deployed every time a single code change was made.
  • The tools chosen for deployment – principally Puppet – were really hard to debug and control. Even getting hold of log files (and figuring out which server they might be on) was a challenge. Debugging with anything more than print statements wasn’t an option.

Sure, buy don’t build, but remember the NFRs

Nobody would buy a program that doesn’t run on the computer they use, right? Well, the firm I moved to is doing just that – they have zero Oracle knowledge and no Solaris expertise but a key part of their authentication system is going to use a package based on that platform…

That’s an extreme example, but the point is clear: involve your operations and systems administrations teams in the selection and vetting of your packaged applications.

Well, that was interesting: on changing jobs.

Both of my long-term readers know that I work as a contractor. That basically means I stick around in a job for a little while until either I get bored or burnt out, or my client gets bored or burnt out.

It’s a precarious existence, and not for everyone, but I do get to work with some interesting people and see some interesting things.

I’ve just changed jobs so it’s about time I put together a few posts on things I’ve learned in my last couple of roles. And, perhaps, some things to remember when I’m next interviewing.

More stuff about .NET

In my travels using .NET and other technology I find that I’m often being asked the same questions. That, and I forget stuff myself, which is probably associated with the grey bits appearing on my head.

So this blog is really my notepad for stuff I really shouldn’t forget. An electronic version of my battered old physical notepad, if you like.