Redis's rich set of data types allows for easy and fast experimentation of data-based algorithms and approaches on information. In my own experience with Redis, this ability to quickly model and use solutions is based on the characteristics of the different data structures of Redis and the flexibility in defining the structure and syntax of the keys. I was impressed and excited to be able to name a chunk of malleable data and to relate this name with other keys through the naming semantics of the key. This is a great feature of Redis that is sometimes underappreciated as to how powerful and useful a tool it can be in developing and understanding your data.
I first started experimenting with Redis in 2011 as a metadata and systems librarian at Colorado College at the base of the Pikes Peak Mountain in Colorado. Most libraries around the world store and structure their bibliographic data in a somewhat surprisingly durable binary format called, MAachine-Readable Cataloging (MARC), substantially developed in the late 1960s by Henriette Avram of the United States Library of Congress. The current version, MARC 21, is officially supported by the Library of Congress (however, it is in the process of replacing MARC with a new RDF-based linked data vocabulary called BIBFRAME). MARC21 initially encoded information about the books on the library's shelves and has been extended to support e-books available for checkout; video, music, and audio formats; physical formats such as CDs, Blu-ray discs, and online streaming formats; and academic libraries. In fact, an increasingly large percentage of its budget is devoted to the purchase of journal articles through online publishers and electronic-content vendors.
The MARC format is made up of both fixed length and variable-length fields numbered in the three-digit range of 001–999, which in turn can have either character data or subfields with data. In addition, each field can have up to two indicators that modify the meaning of the field. Two of the most common and important MARC fields are the 100 Main Entry – Personal Name field and the 245 Title Statement field. Here is an example from David Foster Wallace's book Infinite Jest:
To use this MARC data in Redis, each MARC record was a hash key modeled as marc:{counter}
with the counter being a global incremental counter. Each MARC field is a hash with the key modeled as marc:{counter}:{field}
. As some MARC fields are repeatable with different information, the hash key would include a global counter such as marc:{counter}:{field}:{field-counter}
. Simply storing these two fields would result in the following six Redis commands:
This key structure in Redis looks like the following:
The storage of MARC data in Redis can be accomplished with just a single Redis data type, a hash, along with a consistent key syntax structure. To improve the usability of this bibliographic data in Redis and to realize a very common use case of retrieving library data as a list of records sorted alphanumerically by title and author name (in library parlance two access points) is also accomplishable with other Redis data types such as lists or sorted sets.
Representing MARC fields and subfields in Redis by using hashes and lists was informative. Further, I wanted to see if Redis could handle other types of book and material metadata models that were being put forward as replacements for MARC. The Functionality Requirements for Bibliographic Record, or FRBR, was a document that put forward an alternative to MARC and was based on entity-relationship (ER) models. The FRBR ER model contained groups of properties that were categorized according to abstraction. The most abstract is the Work
class, which represents the most general properties to uniquely identify a creative artifact with such information as titles, authors, and subjects.
The Expression
class is made of properties such as edition and translations with a defined relationship to the parent Work. Manifestations and Items are the final two FRBR classes, capturing more specific data where Item is a physical object that is a specific instance of a more general Manifestation.
With few actual systems or technologies that implement an FRBR model for library data, Redis offers a way to test such a model with actual data. Using existing mappings of MARC data to FRBR's Work, Expression, Manifestation, and Item, the MARC 100 and 245 fields from the above would be mapped to an FRBR Work in Redis as shown by these examples of using the Redis command-line tool, redis-cli, to connect to a Redis instance:
This new work, frbr:work:1 can be associated with the remaining classes with the following Redis keys and hashes:
In the previous example for Expression, a specific date is captured along with a relationship back to frbr:work:1
through the realization of a property. Similarly, the frbr:manifestation:1
hash has two fields; a publisher, and the physical embodiment of. The physical embodiment of field's value is the frbr:expression:1
key that links the Manifestation back to the Expression. Finally the frbr:item:1
hash has a barcode identifier property and a relationship key back to the frbr:manifestation:1
hash.
In both the MARC and FRBR experiments, the Redis hash data structure provided the base representation for the entity. This strategy starts to fail when there can be more than one value for a specific property, such as when representing multiple authors of a work. The first attempt to solve this problem for those properties with multiple values is by creating a counter for each MARC field as outlined above. For example, the MARC 856 field – Electronic Location and Access – stores the URL for e-books or other material that has a network-resolvable URL. If we want to add two URLs to the preceding MARC example, such as a link to the book in Google Books and a wiki on the book, the Redis commands would be as follows:
This naming approach for the MARC keys meets the requirement for repeating MARC fields, but how can we support the edge case wherein a single MARC field has multiple, repeating subfields? The first pass to solve this problem may be to store a string with some delimiter between each subfield as the value for a particular filed in the MARC. This would require additional parsing on the client side to extract all the different subfields, and we would lose any additional advantages that Redis may provide if these multiple subfields were stored directly in Redis. The second approach to solving the MARC field with multiple subfields in a MARC field would be to further expand the Redis key syntax and use a list or some other data structure as value for each subfield key. Expanding the MARC 856 example, if we wanted to add a second e-book URL, maybe a URL to the Amazon Kindle version, it would look like the following in Redis:
Storing multiple subfields in a Redis list works well, but what if I don't want any duplicate values in a MARC field's subfields? This can be easily solved by the use of Redis's set data type, which, by definition, only contains unique values. The use of sets for the subfield values seems like a good solution, but it fails, if we need to keep the ordering of the values in the subfield.
Fortunately, Redis's sorted set data type fits our use case admirably by ensuring a collection of unique subfield values with no duplications, and finally maintaining, the subfield ordering. The resulting Redis commands for storing the URLs of a book in the MARC 856 field would look the following:
In this example, we examined how to represent a legacy format for library data called MARC, and how MARC's fields and subfields data can be stored in Redis by using hashes, and how the storing of subfields changes as more requirements are met, moving from storing subfields first as Redis lists, followed by sets, and finally finishing by using the sorted set data type. This iterative experimentation hopefully illustrates an important reason for using Redis, namely the ability to quickly test out different methods of storing data and how the characteristics of different Redis data types such as hashes, lists, sets, and sorted sets can be used to represent both the data and some of the requirements for storing and accessing this data.