Beacon Street Diary blog

Launching a Digital Collection: The Making of a Metadata Model for Digital Resources

by Zachary Bodnar, Archivist

Work continues apace with moving our digital New England’s Hidden Histories (NEHH) collections over into our new Quartex digital asset management (DAM) system. We are now about nearly complete with phase two of the project! Which keeps us on schedule for our initial launch in the fall of 2021. Some fun facts, we have uploaded almost 60,000 images into Quartex now across 960 individual items. I cannot properly describe how excited we are for our eventual launch. Instead, I can talk about metadata! Our day-to-day work with Quartex involves creating new metadata for each digital object. So today I wanted to talk a bit about how the Congregational Library & Archives (CLA) creates metadata for digital objects and, more specifically, how the CLA created a custom metadata schema for digital objects from a plethora of outside sources and the customizable tools provided by Quartex.

Metadata is intrinsically connected to the work of librarians and archivists. But what exactly is metadata? The dictionary definition for metadata is “data that provides information about other data” or put simply “data about data.” But even that definition feels lacking for just how important metadata is in the information science fields. For librarians and archivists, metadata is information about a resource, be it a book, manuscript, or collection, that describes and contextualizes the resource so that people may discover, find, and know about the resource. If you have ever used a library catalog, whether it be an online catalog or card catalog, you have firsthand experience using metadata to search and browse. Though the rules and structure of metadata have changed over the years, especially with the proliferation of electronic systems, metadata has ever been a part of the work of both librarians and archivists.

Metadata broadly falls into three large categories. Descriptive metadata is information that describes the “facts” about a resource, such as the title or the resource or the creator of the resource. Administrative metadata is information related to the management of a resource and covers things such as use permissions and copyright information. Finally, structural metadata is information about how a resource is put together and can describe such facets as the order of pages within a resource or the type of relationship between two different resources. A fully realized metadata model must take into account all three of these metadata types as each is vitally important for a user to both find a resource, understand what they are looking at, and know how they may use and access it.

Fortunately for librarians and archivists, much of the work related to the form and content of metadata has been done for us in the form of widely accepted metadata standards; these standards, at least within the USA, are often maintained by national institutions, such as the Library of Congress, or national organizations, such as the Society of American Archivists and American Library Association. In the United States, librarians use the MARC (machine-readable cataloging) standards to determine how information is formatted and presented and RDA (resource description and access) to determine the content of a catalog record. Likewise, archivists in the United States have DACS (describing archives: a content standard) which governs how archivists create nearly every aspect of finding aids. The reason these metadata standards are important is because they standardize metadata between otherwise unconnected organizations and creators; this makes the metadata interoperable between systems and ensures a uniform set of experiences and expectations for all users.

But what about digital resources? Digital resources, by their very nature, require a whole new set of metadata standards. DACS has been a great boon for archivists, but it does not exactly help a digital archivist who finds they need a metadata field for describing the digital object’s file type or the differences between multiple versions of the same digital object. The good news is that additional and emerging content standards have been created for digital resources. Unfortunately, there is not a single “all encompassing” standard for digital resources that might be the equivalent of DACS or MARC. Instead, the people who manage digital resources have a plethora of imperfect choices to make. And the result is often an unfortunate combination of worry, analysis paralysis, and confusion. After all, while librarians have a standard in MARC that is decades old, the managers of digital resources still exist on a sort of new frontier.

There are, for digital objects, four major content standards that exist. (There are many more than four metadata standards for digital resources, but outside these four the remaining standards are usually niche and designed for a single type of digital resource such as scientific data sets). Dublin Core (DC) is probably the most common standard, in large part because it is incredibly flexible. DC, which tends to focus on descriptive metadata, has very few rules governing the form of information nor does it make any of its fields mandatory; in essence DC is the ultimate pick and choose metadata standard. Metadata Object Description Schema (MODS) is the other major content standard focused on descriptive metadata. MODS has significantly more rules governing form and content, which makes it difficult to implement, but covers many important descriptive avenues that DC does not necessarily cover. Metadata Encoding and Transmission Standard (METS) is focused almost entirely on structural metadata, such as how digital files are organized, while Preservation Metadata Maintenance Activity (PREMIS) is almost entirely focused on administrative metadata, specifically metadata related to every single facet of the creation, maintenance, and preservation of a digital file.

When the CLA first began to work with the Quartex system, the very first thing we needed to do was to create a list of metadata fields. Quartex, being an incredibly flexible system on the backend, does not have any prescribed fields, outside of a mandatory title field, so we certainly had options. We could make our metadata as complex, or as simple, as we wanted. We could also wholesale port a metadata standard, such as DC, into Quartex, and call it a day. Instead, what we did was take an exceptionally long and hard look at the above four metadata standards and took what we felt was the best parts of those standards and created our very own schema that works for us.

We first determined what the goal of our metadata schema was going to be. Quartex, being primarily used as a public access point for all our digital content, we determined that our metadata had to focus primarily on the needs of our external users. Descriptive metadata, and metadata related to the creation and distribution of the digital resource were deemed to be the most important type of metadata to help users find and understand our digital content. That meant, that as important and useful as PREMIS and METS can be, those standards were leaned upon significantly less as METS and PREMIS metadata is most useful for internal preservation purposes. That left us with DC and MODS as our primary go-to models. Each has their strengths and weakness. While DC has a field for geographic metadata, MODS does not, and while MODS has a field for a genre/form term, DC does not. So, we determined what we felt were the descriptive strengths of these two models and combined them.

The result was a schema of 29 metadata fields which covers everything from a title filed to a field devoted to an item’s provenance. We made sure that every metadata field we created was documented extensively. Part of that documentation was ensuring that each metadata field which had an equivalent field in a different metadata model was enumerated and linked; this will ensure that, in the future, our metadata can be made more interoperable with external systems. We further enumerated what standards, such as which international standard for language terms, we would use for fields that required such standards. And for fields which necessitated strict vocabularies, such as the “type” field which describes the primary content of a digital object, we listed out each of the vocab terms that could be used within that field. We then went through the list of metadata fields to determine which would be required fields. We determined few fields should be made required to ease cataloging since information for any given field might be difficult to determine, if not outright impossible. Still, while most fields are not required, we have strongly encouraged providing as much descriptive information as possible for each asset. Next, we determined which metadata fields would be free text fields and which would be controlled vocabulary fields. Quartex allows for the linking of shared metadata terms if the data is stored as a controlled vocabulary. Any field, such as the subject, name, creator, and camera model fields, which might share metadata between otherwise unconnected resource, was made into a controlled vocabulary field to allow for easy linked data; this allows users to instantly search for “similar” materials with a single click of a mouse.

It can certainly feel daunting to create a metadata schema from scratch. The flexibility in metadata creation Quartex offers is amazing, but when you are just starting with only a title field, it can be easy to wish for a more prescribed schema. Add in the fact that there are numerous metadata schemas for digital content, and you have the formula for confusion and doubt as you move forward. But, as I hope this blog has helped illustrate, going through the process of figuring out a schema that works locally that is focused on the users of the system, will pay dividends. And perhaps most important, is simply to document these decisions. Quartex’s flexibility ensures that we are not permanently locked into a decision we might have made too hastily. We used a pilot period to test an early version of our metadata model and determined that numerous changes needed to be made. For us, many of those changes were related to fields which we had thought should be required cataloging fields; through our pilot though we realized that some of those initially required fields, in certain circumstances, could not be meaningfully filled out necessitating a reversal on their required status. Metadata is the backbone of our work as librarians and archivists, and that has never been truer than now with digital records.

Since our initial pilot and now through the end of phase two of our migration, we have been using the metadata model to create new metadata for every digital object within NEHH. Due to the limitations of our previous web-based NEHH browsing solution, some of which I have talked about before, most of our digital NEHH collections, let alone the individual items, lacked a lot of the metadata that you might expect from a digital archive. This is no one’s fault; NEHH as a project is older than some of the metadata models for digital resources I have listed above! But it has meant doing a lot of catch-up work; all told by the end of this migration I will have been working almost exclusively on metadata creation for about 10 months. But the result will be so worth it. Where the 1735-1822 parish records for First Parish in Brunswick, Maine, simply had a title, date, and short description listed on the CLA’s website, the Quartex record for the same item lists so much more information from subject and geographic coverage fields, to fields describing how many images comprise the whole object, to a rights statement that links out to the appropriate boilerplate, to a field letting you know the exact make and model of camera used to photograph the original object. The result of all this work is a wealth of metadata which we hope will make these already amazing and useful digital resources even more accessible, easier to navigate, and far more descriptive and precise about what exactly each digital object is. There is still a lot of work to be done between now and our soft launch in the early fall, but we are so excited and energized by our work because we know you too will be energized and excited when you see the Quartex site launch!

The whole of our metadata model, as well as some of the particulars of our Quartex configuration, have been extensively and continuously documented. For those interested, you can view the current version of our metadata model here. If you have questions about our metadata model, or are yourself looking to create a metadata model for your digital resources, please feel free to reach out to me.