Europeana, numbers and scalable architectures
I just got around to reading the press release issued after the collapse of Europeana (previously the more easily pronounced 'European Digital Library') following its launch a couple of weeks ago. If you go to the site now, you are greeted with the following message:
The Europeana site is temporarily not accessible due to overwhelming interest after its launch (10 million hits per hour). We are doing our utmost to reopen Europeana in a more robust version as soon as possible. We will be back by mid-December.
(my emphases) The press release explains what happened. Or rather, it explains whose fault it is that the site couldn't cope with the traffic it received. The blame is laid squarely at those pesky 'experts' who predicted a peak demand of 5 million hits per hour, and at the public who disregarded this and whose demand reached a peak of 10 million hits per hour. Or 13 million hits. Or nearly 20 million hits. Each of which is claimed in the press release or on the website. We'll come back to these numbers in a moment. Piecing together the limited information provided by the website, press release, and a recording of the press-conference following the site being taken down, one arrives at the following sequence of events:
- Europeana is launched, following a good deal of publicity
- Peak usage approaches 8/10/13/nearly-20 million hits. Site begins to behave unpredictably and to become unresponsive
- Hardware capacity doubled from 3 servers to 6 servers
- Decision made to take site down to 'ease pressure' on it.
In a breath-taking display of 'spin', this rather faltering start to Europeana's fortunes is being hailed as a very positive development. According to spokesman Martin Selmayr in a recorded press conference it demonstrates unequivocally that there is a huge demand for the service. Rather intriguingly we are told that the service fell over because thousands of people were searching for the query term 'Mona Lisa' at exactly the same time. When one of the journalists points out that this seems a little suspicious, the spokesman tells him that this is because the press specifically used the 'Mona Lisa' as its example when discussing the impending launch of the service - so the crash is partly the press's fault as well! When another journalist suggests that thousands of concurrent requests for the same resource has some of the characteristics of a distributed-denial-of-service attack, the response is to claim that actually there was a wide range of content searched for. I sensed that the assembled press corps were becoming a little puzzled by this point. Yet another journalist asked why the site would be down until mid-December. Apparently, this is to 'take pressure off the system', which doesn't make a whole lot of sense. So, what now? It is possible to speculate based on what is hinted at in the press release. Clearly, Europeana did not scale to cope with 10 million hits an hour - double what was predicted. One might suppose that Europeana would experience a peak-load at launch, given the publicity, which might then ease off a little. The Europeana team have already tried to respond by 'scaling out' - adding more hardware. In fact they have doubled the hardware, to no avail. If scaling-out was going to work, then why not double again? Surely the cost is not the issue? I suggest that someone has realised that scaling out will not work, and that some deeper adjustments to the system's architecture will need to be done. In which case, I wonder if it can be achieved by mid-December. And why did they think that scaling out would work in the first place? Europeana seems to be driven by numbers. This seems like an increasingly anachronistic approach to the design and measurement of success of a web-service. A memo about the service states:
The objective of the European Commission is that in 2010, the number of digitised works available online through Europeana should reach 10 million.
Numbers again.... No matter what estimations have been calculated, or plans been laid, this number can't be much more than arbitrary. Why 10 million? Why not 20? Or 5?. These metrics are just not helpful or interesting. If Europeana hadn't crashed it would be making 2 million objects available now, apparently. I have no way of appreciating the benefit of 5 times as many objects. A better statement would have talked about growth in terms of responding to user feedback perhaps. (Incidentally, in this memo there is also a table of percentage contributions from EU 'member states' to Europeana - France has contributed 52% of the content so far, compared with 10% from the UK). When you're facing a global scale of usage you need a global-scale architecture to cope. Global isn't a number. Global implies that you might need to keep growing. Continuously. If your service is an instant hit, you might even need to grow rapidly. There is much which has been learned and developed about this in recent years, and new architectures have evolved to meet huge levels of demand. Another sensible strategy which has emerged is the 'soft-launch'. Launch the application, and allow word to spread; give yourself some breathing space to tweak the application, fix bugs, and grow. So, while the Europeana engineers are working away to try to meet their mid-December deadline, I urge them to consider two things:
- forget the numbers you've been saddled with - get a global-scale architecture in place, something which allows you to scale out at will, regardless of the numbers. If this means sacrificing functionality then do it nonetheless.
- don't be persuaded into a big launch with fanfares - do a soft launch and allow some time for the system to shake-down and for its reputation to grow. If it's a good service, the users will come.
I hope Europeana is allowed to launch when it's good and ready - and I hope that it does launch. And I hope that, in time, we can learn from the mistakes of this ambitious project, rather than be fed marketing claims and face-saving spin.