Example Bag
Throughout this document we’ll be working with an exemplar bag whose structure is listed below. It is a simple, but real, example of how a published resource may be transmitted and described using BAGEND. Those familiar with BagIt will recoginze this layout:
This document doesn’t explain BagIt itself (there are plenty of existing resources online for that), but we will go over where BAGEND uses or extends BagIt for its purposes.
Bag Declaration
This is an example bagit.txt
, which is a standard file included in all BagIt bags. The BAGEND profile accepts version 0.97
or 1.0
.
Bag Metadata
Below is an example bag-info.txt
file, which contains metadata about the bag. The BAGEND specification requires this file be present.
BAGEND bags must include a metadata tag BagIt-Profile-Identifier
with the value of http://bagend.io/bagit-profile/0.1/
.
Since most, if not all, BAGEND bags contain a Submission
, two additional tags should be present. These tags allow processors to locate the Submission
resource.
BagEnd-Submission-File
BagEnd-Submission-Resource
Another option for referencing and locating resources within a bag was proffered by the Data Conservancy Packaging specification. BAGEND did not go so far as to define or reuse the
bag://
URI scheme, but future iterations of BAGEND may (re-)introduce them.
A couple of things to note:
- The additional tags (
Source-Organization
,Organization-Address
, etc.) are recommended, but not required. - A decision was made by the creator of the bag to use the correlation identifer of the
Submission
resource as theExternal-Identifier
in the bag metadata. The correlation identifer will be discussed in more detail later. - BAGEND bags are not required to have a
Submission
. This allows for a usecase where metadata inbag-info.txt
are interpreted or mapped to BAGEND resource model elements. We do not anticipate this to be the primary mode of BAGEND resource representation.
BAGEND resources walkthrough
The resources file in a BAGEND bag is located under the /resources/bagend
directory. It can be named whatever you wish; its location is specified by the BagEnd-Submission-File
element in bag-info.txt
. All of the BAGEND resources exist in the same file, and form a graph rooted in the Submission
. The identify of the Submission
resource is specified by the BagEnd-Submission-Resource
element in bag-info.txt
. A BAGEND bag doesn’t have to contain a Submission
, but if it does, it must contain only one.
A BAGEND bag lacking a
Submission
would still allow for the metadata elements inbag-info.txt
to be interpreted according to BAGEND semantics
Recall that BAGEND resources are represented as JSON-LD. If you aren’t familiar with JSON-LD, that should be OK. JSON-LD elements will be prefixed with the @
character, otherwise the resources should read easily. For example, a JSON object containing the element "@type": "Submission"
and "@id": "foo"
can be understood to carry the “type” of “Submission” and identifer “foo”. The JSON-LD context (@context
) carries special significance, as it provides JSON-LD aware processors with the information necessary to interpret the JSON as instance of an RDF data model.
A link to the example resources file is here, and the JSON-LD context can be found here.
Resource identifiers
The @id
field contains the resource identifier, and is how resources in BAGEND refer to each other. Resource identifiers must be URIs, but other than that, there are no restrictions.
Referencing an existing resource within the model is done by using its identifier, denoted by
@identifier
.
Resource identifiers in the example bag come in a few different forms, and it is worth asking if some forms should be preferred over others.
- Some resources are identified using URIs from a different context (for example, URIs prefixed with
https://pass.jhu.edu
orhttp://www.jhu.edu
) that resolve to a representation of the resource in that context. - Other resources have a “made up” identifier that looks like it is from a different context, but do not resolve (e.g. the identifier for the
Agreement
in this example ishttp://pass.jhu.edu#agreement
, which looks like a URL but doesn’t actually resolve) - Still other resources use the
instance
prefix for compact IRIs, e.g.instance:Article1
which expands tohttp://bagend.io/instance/0.1#Article1
. These URIs do not resolve.
The JSON-LD prefix
instance
is provided by the JSON-LD context, and may be used to create IRIs for resources.
In addition to identifying and linking resources, the model provides for reconciling or mapping between the BAGEND resource and other contexts. For example, the Article
supports identifiers for popular contexts such as CrossRef, PubMed, PMC, and the publisher’s item identifier. BAGEND cannot account for all existing identifier schemes and contexts, and it cannot anticipate new schemes or contexts. It supports popular contexts and schemes based on stakeholder feedback, and more may be added in the future.
The generic identifiers
field, which is present on every resource, can be used to capture those identifiers whose scheme or context is not yet represented in the BAGEND resource model. This includes identifiers that may be relevant to the producer of the bag which they wish to be preserved or captured by the consumer of the bag.
Outline of a BAGEND resource model
Here is an overview of the major elements contained in an example resource file, rooted in the Submission
. We’ll consider each of these resources in turn.
Submission
The Submission
can be thought of the resource that maintains the bookkeeping related to the transfer of the custodial content of the bag. It accounts for the persons or agents who contributed to the creation of the bag and its content, captures any licenses, terms of service, or other agreements encountered (or anticipated) in the submission process, and provides a detailed description of the published article and any data contained in the bag payload directory.
Submitter, Custodial Contact, and Infrastructure Contact
BAGEND provides the ability to record three different roles related to the creation of a bag. Each of these resources are technically optional, but the receiver of a bag may use them to facilitate processing of the bag.
- The person who is actually submitting the bag
- The person who is responsible for the technical infrastructure creating the bag
- The person who is responsible for the content of the bag payload
Each of these roles will be represented by a Person
.
You can see that a Person
encapulates expected attributes like name, email etc, but it also accommodates their affiliation (an Organization
). If you have a second Person
in the resource model that shares the same affiliation, you don’t need to repeat the affiliation; you can reference it by its @identifier
.
JSON processors will need to be prepared to handle objects or object references. The
affiliation
of thecustodial-contact
below refers to theOrganization
object defined above in thesubmitter
.
Agreements
Agreements link a signatory (an instance of Person
) to a Contract
on a given date. In our example, there is one agreement that was signed by the submitter (the contract-text
is elided for brevity), whereby a license is granted by the submitter of materials to the repository for the purpose of archiving and dissemination. At this point, any agreements are associated with the entire Submission
, including the Article
and all files.
Article
The Article
represents the intellectual content of a BAGEND payload. An overview of the Article
is below, showing the linkages to other resources.
The Article
resource provides links or identifiers to its representation in other contexts. The CrossRef works
identifier is linked to, and the DOI of the published article is also provided. Note that the DOI is not encoded as a URL (e.g. https://dx.doi.org/10.1145/2756406.2756952). If a processor wishes to resolve a DOI, they are responsible for employing the necessary logic to do so.
The awards represent the funding used to produce any of the research in the article or its associated files. The publications are the representation of the BAGEND article in a published journal (print or online, or both). The authors link to people that authored the article, and the files link to the article’s representation along with any associate data.
Awards
Awards identify the funding sources for the research present in the article or its data files.
Note the use of the identifiers
field. In this example, the producer of the bag includes local identifiers for the award (johnshopkins.edu:grant:116920
) and the awarding organization (johnshopkins.edu:funder:302749
). This implies that this Award
has some identity in a context that is opaque to anyone else other than the creator of this particular identifier. The rationale for including this identifier in the BAGEND resource model is that it may be maintained and even indexed by the consumer. This would allow for the producer of the bag to search and find the Award
(or linked resources like Submission
or Article
) with that identifier.
Different organizations often have different identifiers for the same concept, and may not maintain a mapping (or even be aware of) between their respective identifiers. If BAGEND consumers commit to maintaining and indexing the opaque
identifiers
for BAGEND resources, this increases their discoverability by the institutions producing those resources.
Publications
A Publication
represents the BAGEND Article
in the context of an online or print publication. Typically the Publication
will include an embedded Journal
, as is shown here.
Note the use of the instance
prefix; the producer of this bag used it to generate the resource IRIs for the Publication
and Journal
.
Authors
The authors of the article are captured by the authors
key of the Article
. Each author is a Person
resource.
The interesting thing about this example is that there are three authors: one of them is referenced using its @identifer
, while the other two authors are provided as JSON objects. It so happens that the custodial contact of this submission is also one of the authors. A Person
with an @identifier
of https://pass.jhu.edu/fcrepo/rest/users/00222680
was defined earlier as the custodial-contact
of the Submission
. Rather than re-key in all the information of the author, they can simply be referenced.
JSON processers will need to be able to handle the situation where an object may be provided by value or by reference.
An interesting consideration is what happens when a Person
is defined elsewhere, but you wish to add an addtional attribute: what if the custodial-contact
was defined but their ORCID was not included:
And when listing the authors, you wanted to include the individual’s ORCID? There are two options, one of which is preferred by BAGEND.
- Option 1: Add the ORCID to the
Person
instance defined by thecustodial-contact
(this is approach taked by the example) - Option 2: Embed an instance of
Person
as the author, use the same@identifier
as thecustodial-contact
, and just add their ORCID as a property
Example of Option 2:
From a JSON-LD perspective, either option will produce the same RDF. But if you are processing the resources as plain JSON, the processing required to support option 2 is more sophisticated. For this reason, we prefer that an instance of a resource be fully defined in a single place, rather than adding additional properties to the instance throughout.
Files
Finally, we come to the files associated with the Submission
. The Submission
may be comprised of any number of files; typically there would be at least one file containing the content of the Article
. In addition to the text, an Article
may have accompanying data or figures which would be linked here as a File
.
The attributes contain the technical metadata of the file. Note that the location
attribute is a path relative to the base of the bag, not a URI. The file-name
represents a logical name, and may differ from the name in the location
. The logical filename may contain characters that cannot be represented as a physical filename due to filesystem incompatibilities. Or the bag producer could choose to locate files using names derived from their checksum, and use the logical file-name
to carry the human-readable version.
At this juncture, the file-roles
are not a defined enumeration, but we would expect values such as “manuscript”, “table”, “figure”, “supplement”, etc.
The bag payload may contain additional files that are not enumerated here; that is fine, but both producers and consumers should consider the consequences of that decision. It may be better to revise the BAGEND resource model than work around it.