Saturday, March 21, 2015

The curious case of #RPSFail

The Register screamed 'Another GDS cockup: Rural Payments Agency cans £154m IT system' which it then cried 'escalated to £177m' and so the media told of us a lamentable Government IT failure. 

These things happen, the private sector has a terrible record of IT project failures but fortunately a big carpet to sweep it under. The NAO is bound to investigate.

The Register told us how GDS championed "the new agile, digital approach by the Rural Payments Agency" whilst apparently "The TFA has always been opposed to the 'digital-by-default' dogma"

The Register made it plain, it "understands that the Government Digital Service was responsible for throwing out a small number of suppliers working on RPA instead and went for a 40-plus suppliers approach - focusing too much attention on the front end, and little attention to integration between front and back." 

The finger of blame pointed firmly at GDS.  So, what actually happened? How did GDS get this so wrong? Well, we won't know until the NAO investigates or GDS posts a post mortem. Everything else is speculation and El Reg loves to try and cause a bit of outrage. But since the media is in the speculation game, I thought I'd read up more and do a little bit of digging.

The first thing I noticed in the mix was Mark Ballard's article. It started with the line "Mark Grimshaw, chief executive of the Rural Payments Agency is either an imbecile or a charlatan, if Farmers Weekly is anything to go by."

What? I though this was GDS' fault?

"He's been telling the agricultural press that his agency's prototype mapping tool is a failure. That's like saying a recipe is duff because your soufflé collapsed on your first trial run."

Prototype? A £177 million prototype?

"Farmers were apparently unhappy the prototype was not working as well as a production-quality system. So Grimshaw called a press conference yesterday and announced that the sky was falling down."

Hmmm. Something doesn't seem right here.

"The odd thing was that the mapping tool had only just been released as a public beta prototype. A date hadn't even been scheduled for a live roll-out."

Hence, I decided to check on the Rural Payment System (RPS) that was being built for the Rural Payment Agency (RPA). Sure enough, the RPS had only recently been released as a beta. Also the cost is only (cough) £73.4 million. Still, an awful lot but how did it get upto £177m? I checked El Reg, it was apparently a "source".

One eye opener when checking on the Government site was that RPA would "help prevent fines (‘disallowance’) for making payments that don’t comply with CAP rules (~£600m since 2005)"

Really? How comes we've been paying through the nose for compliance failures? A quick search and I stumble upon the fact that the RPS is the second incarnation of a system. The first incarnation (SPS) was so shockingly poor that in 2009 NAO urged DEFRA agency to replace the £350m system even though it was only 4 years old

£350m? … Oh but it gets worse.

In 2009, according to the NAO then along with a £350m system, we had incurred an additional £304m administration cost and £280 million for disallowance and penalties and £43m irrecoverable overpayments. The cost per transaction was £1,743 and rising (22% over 4 years). This compared to £285 per claim under the simpler Scottish system. This was 2009? Heaven's knows the cost to date.

What marvels of genius had created this 'complex software that is expensive and reliant on contractors to maintain' - why expensive consultants. In fact, 100 contractors from the system's main suppliers at an average cost to taxpayers of £200,000 in 2008/2009. The SPS is so complex and "cumbersome" because of "customisation which includes changes to Oracle's source code"

Now, £73m is a bad loss but nowhere near as bad as £650m (system + administration) for the previous system with two main contractors. El Reg's wisdom of keeping it down to a few suppliers has just gone up in smoke. However, just because the past was a debacle doesn't mean the future has to be. This was one of UK Gov's exemplars and GDS had been pretty upbeat about it.

A bit more digging and I come across Bryan Glicks article. Of note - 

'The Government Digital Service (GDS) introduced new controls over IT projects, designed to avoid big, costly developments depending on contracts with large suppliers. When Defra/RPA went to GDS with its initial proposal – a 300-page business case, according to one source – it was quickly knocked back.'

'Instead of a few big suppliers – , for example – RPA would be agile and user-led, with multiple small- and medium-sized suppliers.'

This sounds all very sensible. But still it went wrong … even if far less money was lost. Now the project sounds complex, according to the article there are - "multiple products involved that need integrating – more than 100, according to one insider".

Two suppliers seemed to be called out - one in the article Kainos and one in the comments Abaco along with the line 'this Italian company have not delivered most of the work on time and is a factor in the whole project being delayed'.

Interesting ... who are they? are they really involved? what is the role of any delays? So, more digging.

Kainos is the sole delivery partner for the original mapping prototype which was comprehensively praised by Defra. It's an agile development house, regarded by the Sunday Times as one of the top 100 places to work and seems to have a pretty strong pedigree.

Abaco provide SITI AGRI -  first thing I noticed, which caused my heart to sink was 'Oracle'. Don't tell me we're customising again?  It mentions "open" but I can find no record of their involvement in the open source world. It talks about SOA and even provides a high level diagram (see picture) and attractive web shots.

But are they involved or is this a red herring? I dig and discover a DEFRA document from 29/09/2014 which does in fact talk about SITI AGRI's spatial rules engine and use in the RPA.

So why do I mention this? Well Bryan had also noted :-

'even with very few users, back-end servers would quickly reach 100% utilisation and “fall over”'

'core of the problem was identified as the interface between the mapping portal and the back-end rules engine software'

OK, so we can now take a guess that part of the RPA solution involved the graphical Kainos mapping solution providing the front end with a connection to the services layer of the Abaco 'Oracle' based spatial rules engine system.  This sent alarm bells ringing. 


Well, Abaco had a mapping tool and it claims to be web based - it even provides good looking screen shots. If this is the case, why not use it?

I'm clutching at straws here and this is into wild speculation based upon past experience but being able to provide a system through a browser doesn't mean it's designed to scale for web use, especially not 100,000+ farmers. Systems can easily be designed based upon internal consumption. There is a specific scenario called 'Lipstick on a Pig' where someone tries to add a digital front end to a back end not designed to cope with scale. It's usually a horror story.

Could this be what happened? A digital front end designed for scale attached to a back end that wasn't? That might explain the 'interface' and 'utilisation' rates and given an Oracle back end then I could easily believe the license fees and costs would be high.  However, it doesn't ring true and that's the problem with such speculation. Context.

GDS is full of highly experienced engineers. I can't see them commissioning a back end system that doesn't scale and trying to simply bolt on a digital front end. They would know that internal web based systems are rarely capable of scaling to public usage. This would have have had red flags all over it. 

Something else must have happened. As to what, we'll have to wait to find out.

Bryan also noted "Contrary to some reports, the £155m RPA system has not been scrapped entirely. According to the RPA 80% of farmers have already registered using existing online". This raises the question of what can be saved for future use? What has actually been lost? What is the cost?

At best we know that there was an original monumental IT disaster caused by customising an Oracle system with expensive consultants. It cost over £650m and had mounting fees. Certainly some lessons seem to have been learned by breaking the project into small components and using a broad supply base.

An approach of prototype and testing early with feedback seems to have been used but there is an issue about how this has been handled by RPA. If Ballard is correct then it might be worth explaining to Chief Execs of departments that if a prototype is still undergoing development then a) don't invite everyone b) don't run around claiming the sky has fallen on a prototype c) do communicate clearly.

However, there is no denying that something has obviously gone wrong with the current approach, though it's unclear how much of the £73m has been lost. However there are some questions I'd like to see asked.

1) What is the the actual cost spent and where did this go - license fees, consultants, other?
2) How much of what has been developed can be recovered for future or other use including other projects?
3) Was this system broken down into small components?
4) Was the interface between mapping system and the rules engine the cause of failure?
5) Is SITI AGRI the rules engine? 
** a) Were there delays as claimed?
** b) Was the rules engine designed for web scale?  
** d) If it wasn't designed for web scale then why did GDS commission the system for use on the web?
6) Did they adopt an open source route? Was there an open source alternative?
7) Who pulled the plug and why? 
** a) Should the plug have been pulled earlier?
8) Was this a case of 'Lipstick on a Pig?'

Of course, this is all speculation based upon a little digging and reading my own past experience of disasters into the current affair. What the actual story is we will have to wait to find out.

Since I'm in the mood for wild speculation, I'll also guess that despite the headline grabbing bylines - the Register will pretty much get everything wrong in terms of cost, the cause (number of suppliers) and its tirade.