Friday 3 April 2015

Breakout Session B: Digital Preservation: We Know What it Means Today, But What Does Tomorrow Bring?

This was a presentation made by Randy S Kiefer of the CLOCKSS Archive, looking at the preservation of digital content.

Long-term preservation refers to processes and procedures required to ensure content remains accessible well into the future. It is an attempt to replicate the situation with paper journals. There is a market demand by libraries that want to be assured there is an independent third-party preservation of electronic content. There is a centrally managed preservation of national collections preserved on national soil for safe-keeping. Publishers want to be good stewards of their content and want people to be happy that nothing will be lost.

Preservationists become "keepers" of the content in case a Trigger Event is needed (publisher failure, discontinuation, disaster). It is an "insurance policy" for e-resources.

Commercial hosting, journal hosting platforms (HighWire, Metapress, Ingenta etc) and aggregators are not preservation archives! They have the right to remove discontinued content (and Metapress actually went under last week).

There are two types of digital preservation archives: global (i.e., CLOCKSS, LOCKSS, Portico) and regional (i.e., BL and the Dutch KB).

The CLOCKSS archive began in 2006 as a collaboration between top research libraries and scholarly publication to create a dark archive. This means there is no access to content but it is preserved. 12 sites around the world hold CLOCKSS servers.

The principles:
  • Community governed with responsibility shared. There are 12 governing libraries including OCLC and Edinburgh/EDINA. Both publishers and libraries are on the CLOCKSS Board.
  • A global approach with decentralised preservation, proven open-source technology via LOCKSS (Lots Of Copies Keep Stuff Safe) which is both the award-winning software and the network. (In CLOCKSS, C is for Controlled, as in dark).
  • There is a commitment to open access
  • Content can be activated via Trigger Event, which include the publisher no longer being in business; the title is no longer offered; back issues are no longer available; a catastrophic failure of the server.
  • There is a vote to trigger content. There has to be agreement of at least 75% of the board, with no more than 2 not agreeing. 16 journals have been triggered to date. In each case the publisher itself has come to CLOCKSS.
The CLOCKSS Community is in three parts:
  • Scholars/students & readers of content;
  • Libraries who purchase and manage content;
  • Publishers of above content.
Services provided:
Charitable organisation providing dark archive, delivery of open access of triggered content (not hosted by CLOCKSS, but by the Universities of Stanford and Edinburgh), content insurance for libraries and peace of mind for publishers. The triggered material has a CC non-commercial license.

Brazil is at the moment completing its application to be 13th node (other 12 nodes spread all over the world, including Scotland). The CLOKSS Board has authorized 15 archive nodes in total, so there is an opportunity for another European node.

The CLOCKSS is a trusted digital repository. It is the only digital repository to have scored a perfect 5 in technology and security.

So, where are they going in the future?

Here are the biggest challenges:
  • Formats - discussion of HTML5 and AJAX, with publishers less enthusiastic about the latter than expected.
Funding from the Mellon Foundation has been awarded for CLOCKSS to look at formats. Content is captured as it is ("'just-in-time" translation). The problem is with presentation not content. This needs to be improved.
  • What to do with databases, datasets and supplementary materials
- The key issue is that of space. Everything has to be x12 (i.e. the amount of servers). Partnerships with Figshare, Reveal Digital, etc but CLOCKSS have stepped back from these to re-examine
- Must look at the value of CLOCKSS to the community - talking to various organisations about what the value is and identify what is of most value to keep. Avoid duplication.
- Open vs Closed databases.Open databases, such as Facebook, cannot be captured fully. Only a snapshot (a picture in time) can be taken. You take an initial picture, things change and you take another picture. The first picture is then thrown away. However, this means that what was once in the first picture and now not not in the second will be lost. You can also pick up updates, but these have to be really well tagged.
  • Funding issues: in particular, underwriting small independent publishers who are most at risk. The funding has not changed since CLOCKSS started back in 2008. It is now being discussed.
  • OA access and library support - where do they stand in the priority of preservation? It is wrong to assume that OA sites will stay forever! They have trouble getting funding too. Supporting the library base is also being looked at further.
Randy took questions from the crowd:

More clarification on preserving different formats was given. Any format can be preserved, including video. The issue is that of space, cost and presentation (especially if the format is now not in use/supported). Preservation is built into browsers. e.g. plug-ins to play MP3 music from the 1980s. This helps CLOCKSS.

Storage drive costs had been going down until a fire in 2010 that affected a major disk drive manufacturer. The prices have not recovered since.

All the universities who agreed to host CLOCKSS servers have agreed to US law (and copyright); legally, this can't happen on a cloud-based system. Also, there is no legal precedent with a cloud-based preservation system, and no protection with regards to security. Who would lead the preservation initiative if it was cloud-based?

Discussion of differences between CLOCKSS and Portico.
In the CLOCKSS system the 12 boxes talk to each other all the time to check if there are any problems, e.g. some data is missing. If there are any differences then the majority wins. This happened when there was the natural disaster in Japan. Portico is run using different technology. It has two sites and has a file structure system. It also has a different strategy. It is an archive and post-cancellation service, whereas CLOCKSS is solely an archive. (LOCKSS is used in conjunction to provide post-cancellation access.) It is good to have the two different services as then there isn't a monopoly and both feed off of each other.

All triggered content in CLOCKSS is released in OA, whereas in the case of Portico only its members get the triggered content. No one server or organisation can preserve all materials; it will be a shared global initiative to keep going with this.

A publisher has to agree to be in LOCKSS, but it doesn't cost the publisher anything to participate.

David Rosenthal created the LOCKSS software. His blog on preservation is recommended. It contains both technical and pragmatic discussion.

No comments:

Post a Comment