Inside the Struggle to Preserve the World's Data

7.04_DT0102_Archivists_01
Rich Hendry/Gallery Stock

When masked gunmen smashed their way into the offices of the Crimean Centre for Investigative Journalism, the staff knew what was coming. For years, the centre had been a lone crusading voice exposing deep-rooted corruption in Ukrainian government circles, defying threats of violence and attempts to close it down. Now it was payback time as the thugs trashed the premises, seizing computers, hard drives and confidential files. Their objective was clear: erase the centre?s ?institutional memory? by taking its website, containing some 16,000 pages of dedicated investigative reporting, off the internet for good.

With no time to lose, a hasty distress call was made to another office 10,000 km away in San Francisco, a converted church that is home to the Internet Archive and its vast digital library. Within minutes, archive technicians began to ?harvest? the website, methodically copying and storing the contents to add to its Ukraine Conflict collection.

By the time the centre?s site was finally shut down in March, albeit to be put back up soon, everything it contained, including more than 5,000 videos, had been saved, preserved and made accessible online. It was another coup for Internet Archive?s visionary founder, Brewster Kahle, who is also behind the hugely popular Wayback Machine, best described as an enormous online museum that enables users to see what a particular website looked like at any time since 1996, the year Kahle set up his first commercial web crawl company, whose revenues were used to create Internet Archive. Kahle?s operation now costs around $10m a year, coming from revenue generated by web crawling services, grants, donations and a foundation established by Kahle and his wife.

Kahle is a history-maker: on the front line with other archivists around the world deciding what digital artefacts should be gleaned from the universe of information on the world wide web and stored safely for future generations to access.

?Websites disappear into the digital equivalent of a black hole all the time .?.?. their average online life in the UK is just 75 days,? says Helen Hockx-Yu, who leads the British Library?s web archiving operation, rated by experts in the field as among the best in the business.

?A lot of material, including sites dealing with the 7/7 bombings in London in 2005 and the 2008 financial crisis, has been lost because we weren?t able to capture it in time. That could have happened to the website for Anthony Gormley?s One and Other project for the fourth plinth in Trafalgar Square. He couldn?t afford to maintain it properly, so he offered it to us and I?m proud that the public can now access the material through our archives.?

It is only 25 years since the British scientist and visionary, Sir Tim Berners-Lee, invented the World Wide Web. There are more than 600 million websites in existence and some 4,000 domain names registered every hour. So great is the volume of data being generated and stored that a new number may soon be required to quantify what is being held, replacing the current largest number known as the ?yottabyte?, which contains 24 zeros after the first digit.

Yet whether as a result of hostile governments closing down domains or organisations no longer being able to pay internet service providers to maintain them, the drain of valuable material is relentless.

We met in Hockx-Yu?s office at the beautiful British Library complex designed by the late Sir Colin Wilson, and a short step from the neo-Gothic splendour of St Pancras station. Brisk, articulate and impeccably qualified, smiling indulgently when I produce my stone-age technology in the shape of notebook and tape recorder, Hockx-Yu is now in the front line of the most ambitious expansion of the British Library?s archiving capability for more than 300 years. At the stroke of midnight on April 5th 2013, legislation known as the Legal Deposit regulations came into force, charging it with capturing the contents of the entire UK web domain ? every site carrying the .uk suffix ? preserving the material and making it publicly accessible.

?The most exciting thing about web archiving is that it allows us to capture living history as it?s being made and to store essentially ephemeral material for posterity,? she says.

By way of illustration, she offers the contrasting topics of the upcoming referendum on Scottish independence and the 2014 elections to the European parliament. ?In both instances, we were archiving all kinds of material while the debates were still swirling around us.?

Together with the National Libraries of Scotland and Wales, the Bodleian in Oxford, the Cambridge University Library and Trinity College?s library in Dublin, it is now empowered to receive a copy of every UK electronic publication. A logical enough extension in the digital age of its ancient right to receive and store all books, newspapers, magazines and other printed matter, but still a challenging task. Eventually, Hockx-Yu?s team could also be responsible for collecting copies of every Tweet or Facebook page in the UK web domain, providing a unique snapshot of the way we live today.

The first full-scale internet ?crawl? (digital shorthand for browse and copy) was launched from its West Yorkshire computer centre shortly after the law took effect. Covering 4.8 million UK sites, it took three months to complete, with another two months required to process the 1 billion captured web pages. The expectation is that the Library will collect in a single year about the same amount of material as its famous newspaper and periodicals archive has amassed over the course of three centuries (a costly programme to digitise some 40 million of its 750 million printed pages is now under way).

As chief executive Roly Keating pointed out when the initial crawl began, the ambitious archiving project represented a reassertion of what it means to be a library in the 21st century. ?Ten years ago, there was a very real danger of a black hole opening up and swallowing our digital heritage,? Keating noted. ?Millions of web pages, e-publications and other non-print items were falling through the cracks of a system devised primarily to capture ink and paper.?

7.04_DT0102_Archivists_02
Sean Gallup/Getty

Professor Niels Br?gger, the head of the Centre for Internet Studies at Denmark?s Aarhus University, could not agree more: he tips his hat to Hockx-Yu for establishing the British Library archiving project. ?More and more of our societal, cultural and political activities now take place either on the web or are closely related to it,? he says. ?Since the mid-1990s, you simply couldn?t be a university, a company or a political party without having a website. If we want to document our present or study our past on the web, get it into an archive before it disappears.

?Whenever I?m asked why web archiving matters, I think of the Bob Dylan line from The Times They Are A-Changin? ? ?the present now will later be past?. Material is disappearing before our eyes at an unprecedented rate and with it goes precious source material for the future historian who will be trying to shed light on the present. Capturing the past for posterity through web archiving matters just as much as preserving other aspects of our cultural heritage, whether it?s kitchen utensils, buildings, warships or collections of newspapers. Studies suggest that 40% of what?s on it at any given moment is deleted a year later, while another 40% has been altered, leaving just 20% of the original content.?

Almost every major national library in Europe now undertakes web archiving, though the scale and cost of such operations varies widely according to their individual remit. The British Library?s project cost some ?3m to set up, the money coming entirely from its grant from the Department of Culture, Media and Sport. It?s not possible to separate out the project?s annual running costs from the Library?s overall budget.

But while swift action by the Crimean journalists and Internet Archive saved their precious source material, there was no such option for staff at a human rights organisation in El Salvador. Last November, heavily armed men stormed their office, intent on destroying the only reliable records of more than a thousand children who went missing during the civil war that ravaged the tiny Central American nation in the early 1980s. After overcoming guards, the intruders removed the group?s computers and methodically torched its files: an eyewitness said that they were directed by a man receiving instructions over a two-way radio.

As much as 80% of the organisation?s research material was destroyed during the raid, widely attributed to extreme right-wing factions opposed to the investigation of horrific crimes committed during the war. There is an urgent lesson to be learned from that incident, said a veteran US campaigner: ?The human rights community needs to think about systematic strategies to protect these archives.? Sub-text, the technology is there, ignore it at our own risk.

A certain wariness is apparent when Hockx-Yu explains the British Library?s approach to the challenge of harvesting social media sites, especially the increasingly sensitive issue of privacy. The UK pressure group Big Brother Watch has argued that plans to harvest the mountains of material posted on Facebook and other platforms represent a step too far. Its then director, Nick Pickles, warned that many people using such sites would be unaware anything they upload to the web could be preserved indefinitely. ?The danger of unintended consequences is magnified by how wide the Library has cast their net,? he told the Financial Times.

So what would happen if somebody used Facebook to post a vitriolic diatribe in the wake of a bitter divorce? ?We?re very conscious of the privacy issues that might arise in the process of web archiving,? she insisted. ?The law incorporates a ?take-down? clause, activated when individuals are able to establish that something in a site has caused them damage or distress.?

The Library is already committed to appointing an independent arbitrator to deal with such cases, she added, ?but I do think people also need to become more aware of the possible consequences of putting something out on the web.?

And if she were ordered to remove files by police or government? ?In that event, we would still follow our normal notice and takedown policy. Otherwise, court injunctions and legal requirements are valid reasons for withdrawing or removing material.?

Hockx-Yu confided rather sheepishly that despite all the hard work she had put into preparing for the Legal Deposit regime, she was not present when colleagues staged an impromptu countdown as the seconds ticked away for the regulations to come into effect. Why on earth not? ?I must confess that I was on a pre-arranged holiday in China.