![]() |
|
|
|
|
|
Persistent IdentifiersExpert level Table of Contents
1. IntroductionThis module aims to provide research administrators and technical staff with a thorough understanding of the issues involved in setting up a persistent identifier infrastructure.
1.1 Identifier servicesTo interact with identifiers in software, identifier services are used, which are brought into effect through an identifier management system. Identifier services may offer:
A useful description of the kinds of services used to interact with identifiers, and how they form workflows, is given in the PILIN Project Service Usage Model[1]. Additionally, it is possible that in the future a persistent identifier service could provide information as to whether an object has changed since its identifier was created. 1.1.1 Identifier StandardsThere are several persistent identifier standards that provide the basis for online identifier services, these include:
There are also many informal standards for content-based persistent identification. ANDS Persistent Identifiers are based on The Handle System. 1.1.2 Creating identifiersCreating identifiers involves an association of a label and a name to produce an identifier. A name is an association of a label (a symbol) with a context that the label is in. Contexts define how the label is to be made sense of. There is only one instance of a given label in any one context. For instance, an early episode of the original Star Trek television series was called (labelled) ‘The Enemy Within’. Within the context of this series, this label is unique and is a name. ‘The Enemy Within’ was also the label for an episode of Stargate SG-1 and for a comic book series related to the Terminator movies. The label ‘The Enemy Within’ is only helpful in determining precisely what the object is when we know its context. The process of creating identifiers can be broken down into the following steps:
While these three identifier creation actions often happen simultaneously, these may be broken up to deal with constraints on the workflows around the data itself. For example, an organisation may choose to register a block of names in advance. Or, a data provider may embed an identifier in a data object when creating it (which is good practice for persistence), but will later provide an association (e.g. a URL) to register with the identifier when object is actually published. In that case, the provider would need to register the name first; create the object with the embedded name; publish the object online to get a URL, and only then register the URL against the name, creating the identifier proper. 1.1.3 Updating identifiersThe information registered with an identifier can keep being updated even after the identifier is made available; identifiers relying on indirection (discussed below) periodically update the URL associated with the identifier, without updating the identifier name. Keeping the information about the identifier up to date is a key part of ‘managing the identifier’. Some updates are constrained by policy: the identity of who created the identifier cannot be changed without falsifying the record, and the label of an identifier cannot be changed without causing confusion. Even if the URL for an identifier changes, a persistent identifier should not end up associated with a completely different object: this defeats the purpose of having the identifier be persistent. 1.1.4 Publishing identifiersAs with data objects, publishing an identifier means that the identifier is made available to users who are not already managing it. Publishing identifiers involves crossing the curation boundary. The notion of a curation boundary is explained in the diagram below[7] and is useful in understanding how identifiers interact with end users: it defines what it means to publish data. The curation boundary model defines publication to be when the data is exposed to people not involved in managing the data, i.e. when it crosses a curation boundary. For example, a research collaboration can have geographically widespread access to a resource, and edit it frequently as part of their work. Once a copy of the resource is available outside that group, to users with read-only access, it is expected to be reasonably stable, and straightforward to locate. As part of its stability and to establish trust, any modifications to the resource should be documented, as change metadata. The resource is then published. (This compares to the distinction between alpha releases of software and official software distributions.) A persistent identifier makes more sense for data which has crossed the curation boundary (is publicly cited by a large number of people, is stable, and will change network location only as a well-defined object). It may be less urgent for data still in flux and only accessed by a small number of active users. Consequently, the responsibilities of the identifier manager are greater once the object crosses the curation boundary. This may not mean the identifier is available to the general public; any read-only access to an identifier counts as publishing it, even if it is subject to authorisation.
Since an identifier is a complex object, different aspects of an identifier could be published at different times. A user may know what the name of the identifier is, but not be authorised to do anything with that name online; a user may be able to find out when the identifier was created, but not authorised to find out who created it.
Publishing an identifier is normally synchronised with publishing the data it identifies. However, there are circumstances when they are published separately. For example, if the data object is under embargo, it cannot be retrieved through the identifier; but the identifier name can be made public in advance, e.g. in a paper under review (so that it does not have to be changed or added in when the object is published). In that case, the identifier may not yet be allowed to resolve (the URL it would redirect to is not public), and the identifier resolution has not yet been published. Alternatively, it may resolve to limited information about the object, as opposed to downloading the object itself. 1.1.5 Using identifiers onlineWays in which identifiers can be used online are discussed below. If the identifier is persistent, then its manager needs to keep the information registered with the identifier up to date; that keeps its online use accurate. 1.1.6 Archiving identifiersIt may become impossible or impractical to update an identifier. In that case, an identifier can be archived. This can mean freezing the identifier information, so there is no longer any expectation that it will be updated; users should also be warned that the identifier may be out of date. Alternatively, aspects of the identifier may be withdrawn from public access, particularly the ability to resolve it. (Note that the name of the identifier cannot be withdrawn from public access, because users already know that the name exists.) Archiving an identifier may not synchronise with archiving (or deleting) the thing it identifies. Persistent identifiers are expected to outlive the objects they identify, for historical use. Even after a data object has been deleted, scholarly literature or the Web may continue to point to the object through an identifier; the identifier can continue to be useful by giving information on what the object used to be. An identifier may also need to be archived if the object continues to exist, but the identifier can no longer be kept up to date, e.g. if the object is managed by some other party, it may not be possible to keep the object’s identifier up to date. 1.1.7 Deleting an identifierAn identifier can be deleted, by removing the record of the identifier from the identifier system. This does not 'destroy' the identifier: anyone who knows that name used to identify something has a mental record of the identifier. But deleting it does let the identifier name be reused to identify something else on the same system. If an identifier is persistent, it is only ever expected to identify one thing, so it cannot be reused. This means persistent identifiers should not be deleted from identifier systems and identifiers should never be reused, even if they will no longer be available publicly. 1.2 Resolution and retrievalAn identifier can be used to name a thing. In order for the audience to be able to understand the identifier, they must already have a shared understanding of what is being named. 1.2.1 ResolutionIf there is no shared understanding of the identifier with its audience, a resolution service is required, which maps identifiers onto things. In the broadest sense, resolving an identifier is getting information about the thing identified, to help distinguish it from other things. For example, a person’s name could resolve to a listing of various identifying characteristics (like date of birth), which can be used to distinguish that person from all others. A name for a publication could resolve to its bibliographic citation, which can be used to distinguish that publication from all others. 1.2.2 Offline and online resolutionResolution does not require that the identifier be a digital object itself, so offline identifiers (such as personal names) can be resolved too, such as by consulting a reference book. However, when the identifier is a digital object online, resolution generally means an online service which returns metadata about the object named. This metadata distinguishes the object from all others. Similarly, the resolution of an identifier can take place online, while the thing being identified is offline: digital identifiers are not restricted to identifying online content. For instance, ‘http://person.example.com/johnston/fred’ can be resolved to a page listing Fred’s contact details and description. The identifier is an online URL, and the page accessed is online metadata about Fred, but Fred himself is not an online object—Fred is not his website. However, the most common case is that the identifier, the resolution and the thing being identified are all online. Uniform Resource Locators (URLs), which are usually web page addresses, are one type of Resource Identifier (URI). Using URLs to identify offline objects is problematic. We have brought up the concern that ‘Fred is not his website’ — an online identifier for a person (or any offline entity) should not be confused with its online representation. This confusion has been longstanding with URLs, because they traditionally conflated resolution and retrieval, so that ‘http://person.example.com/johnston/fred’ could only identify a web accessible resource, not Fred himself. To deal with this, a Uniform Resource Identifier (URI) is now allowed to resolve to an online resource, without that resource being exactly what the URI identifies. The URI is an abstract identifier, resolving to metadata, rather than a locator. So http://person.example.com/johnston/fred can be made to identify Fred, though it downloads a web page about Fred. The distinction is made with different HTTP status codes, or through attaching #-fragments after URIs[8]. 1.2.3 RetrievalIn general, we expect that clicking a URL will let us download the object itself. However, two distinct actions are taking place: getting distinguishing information about a thing is resolution, whereas getting to the thing itself (or a representation of it) is retrieval. The two activities are typically bundled together when you click on a URL in a browser, which is useful; but they can be logically separated out if necessary. For example, consider a URL: ‘http://www.example.com/paper.pdf’. When requested, the website could resolve the identifier to locate the paper being requested, and immediately download it to the user as a PDF, combining resolution and retrieval. Or it could resolve it to a splash page (created in PDF) about the paper, which provides bibliographic data as well as a link to the paper for download. Both options are valid and widespread on the web at present. One benefit of separating resolution and retrieval is that if the object goes offline, clicking the URL can instead show information about what the object used to be. For example, the URL ‘http://arxiv.org/abs/gr-qc/0609101’ provides resolution of a scientific paper. It no longer provides a link to the paper for retrieval however, as the paper was later found to have plagiarised other work. More information:
1.2.4 Multiple resolutionOnce retrieval and resolution are decoupled, multiple resolution becomes possible: an identifier can resolve to a page with several kinds of links, such as downloads from multiple locations or in multiple formats, as well as links to further information or services, such as purchasing the item in hard format. An intelligent resolver can use information about the user and their context to assist this process, offering the right download for the user’s operating system or browser, for instance. This is already common practice for open source software and shareware. An intelligent resolver can use information about the user and their context to automatically select a mirror copy to deliver; that is an Appropriate Copy service[9]. A resolver can go further and provide an Appropriate Version service, selecting a download based on language, file format, or accessibility format. 1.3 Value-added servicesIn addition to the core services listed above, value-added services can be introduced to satisfy the requirements of particular communities. (Arguably multiple resolution is also a core service, but intelligent resolvers providing appropriate copy and appropriate version resolution are certainly value-added.) 1.3.1 Guaranteeing persistenceTo guarantee the persistence of identifiers, two types of services to verify identifiers are required. A link rot check service verifies that any URLs resolved to are still live. An association check service ensures that those URLs point to the correct resources, and that the resources at those network locations have not been replaced by something else. These services are important for identifier managers, but also to end users, to establish trust in the identifier system. 1.3.2 Archival resolutionAny online resource can end up no longer being actively maintained. Since an identifier is an online resource, it too can end up no longer being kept up to date. This is typically due to institutional changes, where the maintainer of the item is somehow not in communication with the identifier manager. As a result, the identifier manager can no longer find out the current URL of the resource, for the identifier to resolve to. One way of dealing with this is by having the identifier resolve to the last known URL of the resource. Another is to provide contact details for the resource’s current manager, or the last known identifier manager, so that users interested in accessing the resource can contact them directly; this could apply if the identifier manager is no longer active, and no one else has taken the identifiers over. These are all instances of archival resolution services: the identifier is no longer being maintained, so it can be considered archived. 1.3.3 Relationship resolutionCapturing the relationships between various entities can also be treated as a persistent identifier service, mapping between the identifiers of the related entities. This is especially useful if the relation is between abstract entities where citing a URL associated with a specific copy of the resource would be misleading. For example, given the identifier of some intellectual property, a derivative work service can return the identifiers of works known to be derived from it. The relationship is not merely between particular instances of those works; it involves any file containing that intellectual property, as represented through abstract identifiers rather than concrete file locations. Relationship services can also include versioning services, which return a particular version of a file given an identifier encompassing all versions of the file. 1.3.4 Annotation serviceAn annotation service can attach metadata to a resource or parts of a resource, wherever it happens to be stored, through its persistent identifier. This allows annotations to be attached reliably by a third party, and to be accessible in context over the long term, without depending on updates about where the annotated resource has since moved. For example, the W3C Annotea service[10] uses XML to attach annotations to targets identified by URIs. The persistence of those annotations is more secure if the targets are themselves identified through persistent identifiers — especially since the party creating the annotation may have no control over how the location of the target resource may change. 1.3.5 Information hierarchiesRelationship services can also be used to navigate through information models for a domain, which include abstract entities. The FRBR model[11], for instance, is used in libraries to relate copies, formats, editions, versions and adaptations of literary works; an entire such structure can be navigated through identifiers for the various levels of abstraction, until a concrete entity (a physical book or file) is reached. 1.3.6 Citation trackingFinally, if the persistent identifier is used to cite a research output, citation tracking can be treated as an identifier service, scanning for instances where the identifier has been mentioned. Cross-Ref’s Cited-By Linking service[12], which relies on tracking DOI persistent identifiers, is an example of such a service. Using persistent identifiers has the advantage of avoiding specific file locations, so use of a research output can be tracked through its lifecycle, wherever it happens to be stored. On the other hand, there can always be more than one identifier used to cite a resource (including the current local URL, if it is exposed to users); so a citation tracking service is not guaranteed to pick up all existing citations. 2. The Handle System2.1 IntroductionHandle[13] technology, developed by the Corporation for National Research Initiatives[14], has been widely deployed in the repository community. This section gives an overview of the Handle system and how it is used by the ANDS Identify My Data product. For further details, please refer to the Handle website[15].
More information: 2.2 Handles and namespacesA handle consists of two parts: a naming authority and a label unique within that naming authority. The two parts are separated by a slash (‘/’). Naming authorities themselves can consist of different parts, separated by dots (‘.’); unlike DNS, this does not imply a hierarchical structure of authorities. Any other UTF-8 characters are technically permitted in both the names of naming authorities and local names, but in practice, naming authorities tend to be numeric. For example, ‘10.1045/january99-bearman’ is a handle under the ‘10.1045’ naming authority. Handle technology allows specific handle names to be requested and allocated. Sometimes people cite URLs like ‘http://hdl.handle.net/102.100.100/12’ calling it a ‘handle’. It is important to note that the handle here is only ‘102.100.100/12’. Also, note that hdl.handle.net is not, strictly speaking, a ‘handle server’ but rather, a ‘handle proxy server’. See §2.5 ‘Handle Proxy Server’ for details.
More information:
2.2.1 Handles and namespacesThe ANDS Handle namespace is 102.100.100. This is made up of ‘102’ (Australia) dot ‘100’ (e-research) dot ‘100’ (ANDS). Handles allocated by ANDS are numerical values in sequence within this namespace. ANDS PIDS handles therefore look like ‘102.100.100/12’. A resolvable URL for an ANDS PIDS handle looks like ‘http://hdl.handle.net/102.100.100/12’. 2.3 Handle ServerA handle server simply associates metadata with a handle, and returns that metadata when requested by a call to the Handle Service. The kinds of metadata associated include URLs, text descriptions and ownership information. The handle server listens (usually on port 2641) for requests made using the ‘Handle System Protocol’[16]. These requests include behaviour such as handle administration (creating, updating and deleting handles), handle queries (returning metadata associated with a handle), and authentication. It is not directly usable by end users.
More information:
2.4 Handle ClientHandle Servers do not come with a web interface. Therefore a specialised application is required to enable users to interact with them. Two such applications (one command-line, one Java GUI) are included with the Handle Server software, and libraries exist to allow other applications to be written.
2.5 Handle Proxy ServerAs a convenience to users, The Handle System comes with a simple web server, called the Handle Proxy Server (also known as ‘the HTTP interface’ or ‘the resolver’). This is a separate piece of software from the handle server itself, but as it provides a user interface to the handle server, the two are frequently confused. The proxy server provides a resolution service, taking a handle, and providing an HTTP redirect if there is a URL stored in the metadata data record for that handle. For example, suppose there is a handle server running at hdl.ands.org.au, which manages handles under the ANDS PIDS handle naming authority ‘102.100.100’. Further suppose that the ANDS PIDS handle 102.100.100/12 includes the URL ‘http://tardis.edu.au/experiment/view/10’ in its metadata record. The server hdl.handle.net is running a handle proxy server, responding to requests for URLs that begin with ‘http://hdl.handle.net/’.
This is the behaviour if the associated metadata record contains exactly one URL. If it contains no URLs, instead of a redirect, the proxy server serves an HTML page displaying the contents of all associated metadata records. If the associated metadata record contains more than one URL, the proxy server’s behaviour serves a redirect to the first URL encountered in the handle record.
More information:
2.6 Global Handle RegistryIn addition to individual Handle servers and Handle proxy servers, there is a global registry of Handle servers, known as the Global Handle Registry (GHR). The registry maps namespaces (such as 123.456) to IP addresses, so that any user can use any Handle server or Handle Proxy Server to look up any Handle. Every Handle server thus knows how to contact the GHR, which tells it how to contact the Handle server corresponding to a given namespace authority. This registry is hosted by CNRI, who charge an annual fee for each Handle server registered. The CNRI also manages a central GHR in Virginia, USA and two GHR mirrors located internationally, with plans for further expansion. More information: 2.7 ANDS Identify My DataThe ANDS Identify My Data product provides a Handle server and an identifier administration service called the Persistent Identifier Service (PIDS). PIDS is a machine-to-machine service (an ANDS Web Service), intended for integration into existing data management workflows. Technical documentation on this web service is available from:
The PIDS administration service is a subset of that offered by Handle. In particular, only two types of metadata values are supported: Free text (DESC) and URL. ANDS offers a user interface to PIDS (an ANDS Online Service) called ANDS Self-Service Identifiers. ANDS Self-Service Identifiers allows users to authenticate using Shibboleth. Access to the Self-Service Identifiers is intended only for individual users, through an Australian Access Federation (AAF) identity. Access to manage a single handle is not given to multiple users. Identify My Data (self service) is found at: Institutional users should use the machine-to-machine PIDS service, integrating it into their existing infrastructure. This will help safeguard the persistence of identifiers given changes to data location; manual updates to identifiers risk losing this synchronisation. Identify My Data provides curation services to create, update, and list handles. Because ANDS is only providing core cross-disciplinary infrastructure, it is not currently deploying any of the value-added services described in §1.3. Value-added services are worth considering by projects, which can deploy their own services on top of the Handle infrastructure. Given enough demonstrated demand, ANDS may deploy some such services in the future. 3. PolicyIn order for persistence to be realised for the various aspects of an identifier, a robust policy infrastructure needs to be in place. Identifiers are created in different ways for different purposes and interact with their environment in complex ways. Consequently policies can be applied to a range of possible identifier uses. This section provides guidance on a range of possible policy considerations, with some recommended practices for realising persistence. The policy considerations are broken up into questions on:
all of which all need to be worked out before considering how identifiers can best be persisted. Policy considerations are discussed in more detail by the PILIN project[17]. Increasingly, the long-term maintenance of research data is governed explicitly by data management plans, which express a negotiated understanding between the researcher and the institution maintaining the data and publishing it online. Persistent identifiers are essential to the long-term accessibility of resources, which are not restricted to appearing at only one network location or institution. For that reason, the PILIN project has recommended that persistent identifiers should be incorporated in data management planning[18]. 3.1 Label Policy3.1.1 IntroductionThis section explains several issues in label policy, such as the use of meaningfulness and the implications of format choices. 3.1.2 Meaningfulness of labelsOne of the first policy choices that managers of persistent identifiers are confronted with is whether identifier labels should be meaningful. If the label is meaningful — that is, if a user can infer things about what is being identified from the label — then the identifier may be easier for people to remember, to enter without error, and to communicate to others. However, meaningful labels are usually based on attributes of the things identified that are less likely to persist than the thing itself. The network address for a resource is one meaningful label to identify the resource by, and URLs exploit that meaning to do resolution. But resources move, so their network addresses change. Other attributes such as title, institutional owner, or subject matter of the resource, are also subject to change. Updating the label to match the current semantics of the object (i.e. renaming the object) is certainly possible, but results immediately in a broken link or its equivalent. And because the identifier can end up cited by anyone once it is published, it is impossible to update (‘patch’) all instances of the identifier found online. Some organisations — notably, standards bodies — decide to freeze the label anyway in that case, but this produces the undesirable result of a ‘meaningful’ label with an obsolete meaning. Consequently, widespread practice is to use an arbitrary label, either generated randomly, or by using an attribute which cannot change and is not particularly revealing, such as the item’s creation timestamp. That way, any changes in the thing or its status do not affect the persistence of the label. It is possible to preserve meaning in the label, but to obfuscate that meaning. This middle-ground approach may assist in error recovery, while avoiding the pitfalls of transparent meaning. For example, a timestamp can be encrypted or coded as an alphanumeric number. Unless the attributes used to give meaningful labels can be strongly guaranteed never to change, meaningful labels generally pose an unacceptable risk to persistence, and arbitrary labels are commonplace for persistent identifiers. Common approaches are sequential numbers and timestamps; both are still somewhat meaningful, but the meaning is not usually revealing, and can in any case be obfuscated.
3.1.3 Form of labels3.1.3.1 URL safetyLabels used within identifiers need to be URL-safe, since identifiers will almost always end up used in URLs. They should therefore not contain characters which need encoding to be embedded safely in URLs, such as ‘&’ or space: such conversion can confuse users as to whether the encoded or the unencoded label is the ‘real’ label. For example, ‘a&b’, when URL-encoded, becomes ‘a%26b’. URL normalisation and URL encoding[19] is intended to deal with such issues over HTTP, but they do not apply to all contexts in which URIs appear and are still traps for the unwary. Handle identifiers can present a risk, as a wide range of characters are permitted.
3.1.3.2 Variant formsMore generally, labels with multiple possible variant forms should be avoided, as users (or systems) risk assuming that the variants are distinct after all. For example, the ARK identifier system[20] considers the labels ‘712-4’ and ‘7124’ to be equivalent, since it strips out hyphens[21]. However, other URL-based services will treat them as distinct, and human users will typically do likewise. So citation tracking of the identifier might fail; assertions made using the two forms of the identifier might not be applied to the same thing; migrating identifiers to different identifier systems might artificially differentiate the identifiers; indexing may duplicate entries for the resource. Conversely, case sensitivity should be avoided, as should visually confusable characters (1 I l, 0 O), as humans risk failing to distinguish them.
3.1.3.3 PunctuationLabels will likely be delimited by punctuation, both when cited in running text, and when embedded within URLs or other identifiers. For that reason, punctuation should be avoided in labels, if there is a risk of confusion about where the label ends. A label with a trailing comma, such as ‘fred,’, can be confusing when cited in text (readers will assume the label excludes the comma). Likewise, a trailing slash in a label risks being mistaken for a delimiter, if that label is embedded in a URL. For example, the following two URLs would normally be considered equivalent: ‘http://www.example.com/resolve/992’ and ‘http://www.example.com/resolve/992/’.
3.1.3.4 Label length and formatIf there is any prospect that identifiers will often be entered into a system manually, then labels should be short enough for human users to remember in their short-term memory (7±2 ‘chunks’ of information, e.g. 7 characters or words[22]); they should certainly be short enough to write down or type (20 chunks or less). On the other hand, the maximum label length should be large enough that label possibilities are not exhausted in the foreseeable future. So if millions of labels will be assigned for a context, the label size should not be restricted to just four characters. If labels are arbitrarily generated, they should if possible be of uniform length, in order to simplify error checking. The label generation algorithm used should track previously used labels to avoid one name being used to identify two different things. An effective operating practice is to use arbitrary, uncased alphanumeric labels, avoiding I and O, with a fixed length between four and nine characters long, depending on how many identifiers will ever need to be assigned. Such labels are already widely used in systems like TinyURL[23].
3.2 Identifier management3.2.1 Ownership of identifier systemsManagement of identifiers can be separated from management of data. It is important to make decisions on identifier policy based on an understanding of the consequences of this separation. 3.2.1.1 Updating identifier resolutionThe identifier manager undertakes to the end user to maintain the persistent identifier. The identifier manager publishes the identifier, and so has institutional responsibility for it. The identifier manager is often the same as the identifier provider, who provides the services for managing and accessing the identifier — so the identifier manager sets up the identifier management system, and also issues updates to the system. In ANDS’ case, they are distinct: for the ANDS Identify My Data product, ANDS takes on the identifier provider role and the product consumers assume the identifier manager role. To maintain the identifier, the identifier manager has to coordinate with the data manager, who is responsible for keeping the resource identified online. The data manager in turn is publishing data on behalf of the data provider, who is typically the researcher. If the data manager moves the resource to a new address, or takes the object off-line, the identifier manager has to be aware of this and update the identifier accordingly. The identifier manager and the data manager are also not necessarily the same person: the identifier may be managed by a different party from the data. For example, a department has one contact point for issuing updates to ANDS, but the updates originate in several separate labs in the same department: the labs have their own data managers, who are coordinating with the department’s identifier manager, to communicate all the needed updates to the identifier provider. Ensuring that the identifier is updated smoothly requires coordination between the identifier manager and the data manager. The more separated the identifier manager is from the data manager — especially if they belong to different institutional structures — the harder such coordination is to realise. This has been called the ‘Our Stuff vs. Their Stuff’ problem: it is harder to persist identifiers for data that is under some other institution’s control (‘Their Stuff’), whereas managers working under a single authority (‘Our Stuff’) can more easily co-ordinate with colleagues and put the necessary procedures in place. The data provider is also involved in working out the best policy structure for identifiers. The data provider has the best notion of how the identifier will be used by the user community, of how long the data will be useful, and which parts of the data identifiers should point to. (This involves information modelling: see the section on policies below.) The identifier manager is responsible to the data provider to keep their data accessible, as much as they are to the end users. Under this two-tiered resolution arrangement the identifier resolution service must be updated promptly when the URL for the data changes. If the data manager is also the identifier manager, this is straightforward. However, when identifiers are managed externally (by ANDS, for example), it is impractical for identifier managers to detect and respond to such changes in the data identified. ANDS won’t know that you have moved your data. It is your responsibility as a data manager to update the resolution for your identifiers promptly. 3.2.1.2 External identifier policyBy using an external identifier service, you are also constrained by that service’s identifier policies and you have less scope to set your own policies. If the identifier service dictates a certain label format or amount of authority metadata for instance, you cannot set your own policy contradicting that. The ANDS Identify My Data product, for example, generates all identifier labels, eliminating the need for the creation or implementation of any policies on identifier labels by the data manager. You will also have less control over the kinds of identifier services you can provide, because those services rely on the information provided by the external service. (These constraints also hold if you host your own identifiers, but still share your system with other institutions.) Furthermore, you cannot brand identifiers as your own, a useful restriction for ensuring persistence despite ownership changes. On the other hand, if you run your own identifier infrastructure, you are burdened with the commitment of setting explicit policies, maintaining the system for reliability and performance, and having to build up local expertise. 3.2.1.3 BenefitsA benefit of using a common identifier service is the isolation of identifier management from changes in data ownership. This means that if control of the data identified passes from one institution to the other, the identifier managed by a third party is not affected: the new data manager can establish the same relationship with the identifier system as did the old data manager. However, identifier managers still do not have universal scope: a common ANDS identifier may deal with data moving from Victoria to Western Australia better than it will deal with data moving from Victoria to Germany as the formation of a Handle prefix includes a country code. 3.2.1.4 ANDS identifier management policiesAlthough ANDS does not set restrictive policies about use of ANDS identifiers, these identifiers themselves must comply with the requirements of the Handle protocol. Additionally, users of the ANDS Identify My Data product can (and should) set some of their own policies. Identifiers do not become persistent simply by minting them. More information:
3.2.2 Context managementWhen an organisation manages its own identifiers, it can also organise its identifiers into a hierarchy of contexts, akin to DNS subdomains. For example, a university’s identifiers could possibly be broken up into library identifiers and researcher identifiers. Context hierarchies allow delegation of identifier management, and profiling of identifier policies to different domains. But a subcontext should still conform to the embedding context (a university library identifier is still the university’s identifier), so the identifiers still conform to a central, core policy profile.
3.2.3 Authority metadataFor users to trust claims of identifier persistence, mechanisms are needed to allow those claims to be defensible. Users should be able to determine who is claiming the identifier is persistent; who is acting to keep it persistent (and who has done so in the past), how they are doing it, and how long they intend to keep the identifier persistent. At a minimum, identifier users should be able to recover, as publicly available metadata, when the identifier was last updated, and current contact details for the identifier manager. For ANDS PIDs, that means the party using the ANDS update services, rather than ANDS itself. If the identifier stops working — the resolution becomes out of date — users can contact the identifier manager to alert them to the error, or to get more information on the current status of the resource. Contact information should itself be reasonably persistent; e.g. the maintainer should be identified by role and not as an individual. Further authority metadata, such as who created the identifier when, what type of thing is being identified, and who has managed the identifier in the past, can also be included; this can extend as far as maintaining logs on identifier operations. Because authority metadata is used when things go wrong, its availability should not be reliant on external systems: failure to access an external system may be why things have gone wrong to begin with. Contact data should therefore be stored directly in the identifier record, rather than linked through some external database. 3.3 Identifier services3.3.1 Resolution guidelinesThere is a longstanding conflation of resolution and retrieval in URIs, leading users to expect that they can perform one (and only one) kind of action online with identifiers (resolution). To address this expectation, online identifiers should at least provide resolution behaviour as a default. For example, handle records should include a URL field. The meaning of ‘resolving’ an identifier depends on the context. The meaning of a usable representation of the thing identified, to be delivered through retrieval, also depends on context. For that reason, the resolution behaviour of an identifier should be sensitive to context, or at least rich enough not to rule out certain contexts. Different resolution services need to be exposed explicitly, if the user is to realise they are available. 3.3.1.1 Abstract and concrete resolutionIdentifiers can exist for abstract entities, such as the concept of an academic work, rather than physical copies of the work. Identifiers for abstract entities can be resolved in a number of ways. An abstraction resolution provides a description of the entity, such as a bibliographical citation. A concrete resolution provides a representation of the content — in effect, a retrieval. Abstract resolution is more correct for abstract entities: the identifier for an abstract document identifies all versions of the document, not just the latest PDF version, or a browser preview. However, as people typically access identifiers to obtain viewable content, abstract resolution on its own is less useful. Which form of resolution you should provide depends on what users expect to do with the identifier. The common resolution practice in institutional repositories reflects this: identifiers resolve not directly to document or data retrieval, but instead to a splash page containing metadata about the resource. This is a kind of abstract resolution: the splash page provides hyperlinks, clearly labelled as such, so that users can choose the most appropriate representation or service to access the resource. 3.3.1.2 Resolver persistenceIf persistent identifiers are cited with an associated resolver service (that is, ‘http://hdl.example.com/123.456’ rather than just ‘hdl:123.456’), users will reasonably expect that the resolver is as persistent as the identifier. That is, even if the underlying identifier record (‘123.456’) is persistent, users will perceive it as broken if the URL they have stops working. Because the identifier manager is not always responsible for the identifier resolver, a resolver service should be selected with care, to ensure it is trustworthy. If an identifier service is to be called with additional parameters in a URL query, e.g. to specify the particular representation to be retrieved, it is important to take care to delimit the identifier proper from the parameters, to prevent users assuming the parameters are part of the identifier. Identifiers can be presented in different encoding schemes. URL encoding for example allows a URI to be embedded safely within another URI. 3.3.2 Citation of HandlesThere are different ways persistent identifiers can be cited; when publishing persistent identifiers you will need to make a trade off between persistence and usefulness. A Handle, such as an ANDS persistent identifier, can be presented as just a name (e.g. a URN) — which improves persistence — or as a resolvable URI, which is more useful in a digital environment. If the name is used, the context of the name — that is, the identifier system used — should be made explicit (for example ‘Persistent Handle: 102.100.100/12’). While Handle does not have its own recognised URN prefix, ‘hdl:’ can be used informally (for example ‘hdl:102.100.100/12’.) The only standard way of presenting a Handle as a URN is in the Info-URI space, as ‘info:hdl:’ (for example ‘info:hdl:102.100.100/12’.) To present the identifier as resolvable, it is important to choose a resolver that users are confident will remain available for the long haul (for example ‘http://hdl.handle.net/102.100.100/12’)
3.4 Identifier associationKeeping identifiers persistent consumes resources, and should not be undertaken lightly. Data managers need to prioritise what to identify persistently in their domain. Those decisions depend in turn on an information model of the domain of objects that may potentially be identified: persistent identifiers will only be assigned to a subset of those objects. Drawing up such an information model can help anticipate how identifiers are likely to be used and adjusting the information model can capture explicitly what the changes in those expectations are. 3.4.1 Information modellingThe information model should not be restricted to research data and research outputs, but should track all the entities that help contextualise and make sense of the data. For example, research data is organised by its subject matter, so the different subjects and samples that the data is about may themselves need to be identifiable in the future. The same holds for the experimenters and institutions involved in the research, and the instruments, simulations, and workflows used to obtain the data. The possibilities for persistent identification are not restricted to concrete instances of documents and files: information modelling can identify more abstract levels of resource description. These abstractions are more liberal in what is available as an acceptable representation for retrieval; so they are potentially better candidates for persistent identification, because they are not dependent on the ongoing availability of a specific representation. Recurring abstractions that occur independently of domain include:
Identifying different levels of abstractions requires metadata to distinguish them and represent the relations between them. For example, if different versions of a file are identified, users of the identifiers for the versions should be able to recover files corresponding to those particular versions, and how they are related. More information:
3.4.2 Prioritising PersistenceAn analysis of business processes is used to select which things to identify persistently. Although it can be difficult to anticipate what use an identifier will be put to once published, domain knowledge can provide some informed guesses. Such decisions depend on a variety of considerations, discussed below. 3.4.2.1 Will the thing identified be persisted itself?If something identified is destroyed, it may be important to keep identifying it for archival purposes; but there is a higher priority on persisting identifiers for things that are still online. For example it may be more important to identify versions of a file persistently if those versions are still accessible, than if the old versions of the file have been overwritten. Whether deleted objects should have persistent identifiers at all depends on whether they are externally cited (see below). 3.4.2.2 Will it be published?Will the thing identified be available outside the curation boundary? A persistent identifier makes more sense for data which has crossed the curation boundary (is publicly cited by a large number of people, is stable, and will change network location only as a well-defined object). The responsibilities of the identifier manager are greater once the object crosses the curation boundary. If various drafts of an object are created internally, but only the final version of the object is released, then there is less need for the previous drafts to be persistently identified: unreleased drafts moving location may be less disruptive than a released version moving location. It is also possible to publish an identifier without making the resource itself publicly available, but that is less typical. There is an expectation that if the identifier is public, at some stage the thing identified will be public as well. 3.4.2.3 Will it move servers as part of its normal workflow?Even if it is not yet published, research data often moves between servers, e.g. from the lab to a collaboration server, or to a researcher’s private space. If data needs to be accessed consistently throughout such moves, it may make sense to identify it persistently for the duration, and to maintain resolution using that identifier throughout. This is still usually lower priority than having persistent identifiers in place for data that is already published. 3.4.2.4 Will it be cited externally?As with the curation boundary, if a third party links to the thing identified, it is important to make sure the link keeps working. Having the link break will be very disruptive: it is hard to predict who will link to the thing, and impractical to warn everyone that its location will change. Conversely, if the thing will only be linked to internally, then persistent identifiers may not be required: when changes occur, all concerned parties can be alerted directly. 3.4.2.5 Does the information model matter?If there are business processes that depend on specific versions of a file, then those different versions should be identified differently. But if the version is irrelevant to any real business processes using the file, or if only the latest version is ever called on, then there is no business motivation to identify them separately. Likewise, aggregations and disaggregations of objects are open-ended in number, but it is only worth identifying a particular aggregation if there is a business process that will actually make use of it. That means that the thing persistently identified should make conceptual sense to the people and processes interacting with it. The granularity at which objects are identified also depends on how users will interact with the objects. In turn, this means that the thing identified should be easily described through metadata, if it is difficult to describe the thing, it probably does not make enough sense as a concept to identify persistently. 3.4.2.6 Is the thing identified stable?Persistently identifying something raises the expectation that the thing itself is not only accessible over a long period, but is a well-managed and well-defined object. This means that any changes to the thing should, where practical, be well-documented and accountable: if what is being identified is ‘the same thing’ over time, users should be able to work out why the thing does not look identical to what they may have accessed last month. 3.4.2.7 Is the thing under the control of the identifier authority?Persistence implies accountability for any changes in the thing being identified — notably its online location. Such accountability is much easier to realise if the same authority manages both the identifier and the resource identified, because that authority can easily put automatic update procedures in place. Of course, keeping the identifier up to date is not impossible if the identifier and the resource are managed by different parties: the ANDS Identify My Data product depends on that scenario. But this does require that the data manager independently ensures that the resource information associated with the identifier is kept up to date 3.4.3 Timing of persistent identifiersBecause the identifier and the resource identified are discrete digital objects, management of the timing of the creation and public release of the two digital objects has to be coordinated. A digital object should not be published before its persistent identifier is published. Otherwise, the digital object’s non-persistent URL could end up cited instead of its persistent identifier: once third parties start using a particular identifier, it is difficult to achieve a switch to another identifier. An effective operating practice is to associate the name with the thing in the identifier record, before either is published. You can publish the name before the thing is accessible, with the understanding that it will fail to retrieve the thing in the short term. 3.5 Identifier persistence3.5.1 Planning for identifier persistenceTo minimise disruption of persistent identifiers as much as possible, a persistent identifier should be coupled only loosely to the current technologies used in data management. 3.5.1.1 How to use URIsCare is needed when using current network address URLs to identify objects. A URL is tightly coupled to the current technology used to deliver the object, and becomes obsolete if the object is migrated to another server. This does not mean that HTTP URIs should be avoided; rather it means that care should be taken when using HTTP URIs. The file location-specific URL (e.g. ‘http://.../~jsmith/pubs/paper-jan09.pdf’) should be avoided as much as possible, in favour of an abstract URI such as ‘http://.../items/a42p5’. Several identifier schemes rely on a two-tiered system of indirection to achieve persistence: the identifier is resolved to the resource’s current URL (as a ‘locator’), which is then used to retrieve the resource. This is conceptually possible because the current URL of a resource is distinguishing metadata about the resource, so an identifier can resolve to a URL. Under this arrangement, the current URL may change as the object moves, but the identifier itself does not have to — so long as the URL resolution is kept up to date. Users can keep using the original identifier to access the resource, and the identifier is managed to keep its resolution up to date. The two-tiered system introduces the possibility of using identifiers that are not URLs. A URL can be persistent by having it aliased to whatever the current URL of the resource is, and updating that alias; that is the model that HTTP redirection is based on. But a resolver — a service performing identifier resolution — can also take an identifier which is not a URL, and map it to the current URL of the resource. So long as the resolver is in a known place, the identifier itself does not need to be a URL. In fact, some communities prefer identifiers which are not URLs, to avoid confusion between the persistent identifier and the current URL. This defines how technology can keep an identifier persistent; but it only tells part of the story. The real challenge is in the policy and the expectation of trust that surrounds a persistent identifier. This approach also includes processes dealing with the object internally. The more a non-persistent identifier is used, the higher the risk that it will leak into the public domain (where it may be harmful), and the more dependency is introduced on a transitory identifier. ANDS recommends using a two-tiered resolution arrangement (accessing the actual network address of a resource indirectly), in order to avoid excessive coupling of processes to the current file location. However, any processes that will affect the way the persistent identifier resolves need to be tightly integrated with processes updating the identifier resolution. For example, the process for moving a resource to a new directory or server should be engineered to simultaneously update the identifier record. This eventually should lead to the identifiers being an added layer of information infrastructure, leveraged to provide a more generic, less technology-dependent way of managing data. 3.5.1.2 Updating a URIWhen moving identified items, identifier updates should happen without significantly disrupting user access through the identifier:
3.5.2 Persistence time spanNo identifier will persist forever. However, identifier authorities can help identifier users plan for change usefully, by issuing an undertaking to support persistence for a fixed time period. That undertaking should be discoverable by identifier users. Some communities may need the identifier to outlive the resource for different lengths of time, so that historical citations of the identifier are still usable, while others do not. The length of persistence also depends on the technical and governance constraints imposed by the identifier’s own infrastructure. Critically, this includes planning for the future management of the identifiers, when the current manager has moved on. 3.5.3 Persistence guidelines for change in namesIn addition to other forms of persistence disruption discussed above, disruption can occur when a thing starts being identified by a different name, and the original name is no longer associated with it. Usually, the mere existence of two names is inconvenient or confusing, rather than disruptive: it disappoints the presumption of a Universal Identifier (that you need only search for one name to gather all mentions of the thing identified), but that in itself does not disrupt the use of an identifier to identify things. However, if the old name ceases to function, disruption occurs. The old name may be discontinued because:
Eliminating the old name means it is no longer a persistent identifier, which compromises users who relied on it persisting. To address this problem, consider:
Whenever a published URL ends up broken, it is very likely someone will be inconvenienced. Not all potential users can be notified, and not all references (such as paper ones!) can be updated. Strategies to anticipate and mitigate against this problem include:
3.5.4 Persistence guidelines for change in resolutionPersistence can also be disrupted if the same name being used starts to identify a different thing. This is an even more pernicious failure: users will not see broken links, and will reasonably assume everything is still working. However, what they are resolving to is different, so the expectation of persistence of association (the identifier keeps identifying the same thing) has been compromised. This kind of error can occur because the update procedures for either the identifier or the thing identified have failed, or because of human error. It may also happen because the information model for the thing identified has been misunderstood, so it has not been updated correctly. For example, someone could update a file resolved to by an identifier, when the identifier was intended to reference a specific, frozen version. To avoid erroneous updates of identifiers, the following measures can be taken:
3.5.5 Persistence under changed managementUsing an identifier update service presupposes that the data manager remains in contact with the identifier manager, and stays authorised to update identifier data promptly. However, any plan for identifier persistence must include a plan for the contingency when that arrangement is disrupted, and the data management system can no longer communicate updates to the distinct identifier management system — though the thing identified is still online. 3.5.5.1 Update disruptionFor an identifier update to work, the update information must reach the identifier system, and the updated identifier must be returned to data management. The process of identifier updating can be disrupted or discontinued for several reasons:
To deal with these contingencies, consider the following procedures:
4. Glossary
Footnotes[1] http://resolver.net.au/hdl/102.100.272/L8ZDW6PQH [2] http://www.cdlib.org/inside/diglib/ark/ [3] http://purl.oclc.org/docs/index.html [4] http://tools.ietf.org/html/rfc2141 [7] From http://www.valaconf.org.au/vala2008/papers2008/111_Treloar_Final.pdf [8] See http://www.w3.org/TR/cooluris for the current Semantic Web approaches to this issue. [9] http://www.doi.org/doi_proxy/appropriate_copy.html [10] http://www.w3.org/2001/Annotea/ [11] See http://en.wikipedia.org/wiki/FRBR. [12] http://www.crossref.org/citedby.html [14] http://www.cnri.reston.va.us/ [16] http://www.handle.net/rfc/rfc3652.html [17] http://www.dlib.org/dlib/january09/nicholas/01nicholas.html, http://linkaffiliates.net.au/pilin2/outputs/outputs_guidelines.html, http://linkaffiliates.net.au/pilin2/outputs/outputs_policy.html. [18] http://linkaffiliates.net.au/pilin2/outputs/outputs_dmp.html [19] See http://tools.ietf.org/html/rfc3986. [20] http://www.cdlib.org/inside/diglib/ark/ [21] http://www.cdlib.org/inside/diglib/ark/arkspec.html [22] George A. Miller. The Magical Number Seven, Plus or Minus Two. The Psychological Review, 1956, vol. 63, Issue 2, pp.81-97. http://psychclassics.yorku.ca/Miller/ [25] http://en.wikipedia.org/wiki/FRBR
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.5 Australia License. |