In January 2010, the administration of outgoing Governor Tim Kaine transferred to the Library of Virginia approximately 1.5 million email messages from more than 200 email accounts. By law, gubernatorial records transferred to the Library “shall be made accessible to the public, once cataloging has been completed.” (Va. Code § 2.2-126) The Library has long had procedures in place for accessioning and processing paper records, but figuring out what to do with 167 gigabytes worth of email required a new set of tools, including an additional staff member, a few key pieces of software, and a whole lot of trial and error.
Project Background
The Library of Virginia has a long and impressive history of preserving and providing public access to the records of state and local government. In 2005, the Library accessioned its first true transfer of born-electronic gubernatorial records. The shift from purely paper-based to increasingly electronic records creation has posed challenges not only to the Library of Virginia, but also to archives across the United States and around the world. Since then, we have developed policies in support of the creation, transfer, and management of this content. We understand and take seriously our responsibility to ensure secure and stable management of this material, as well as to provide open and free public access to the archival records of our government regardless of format.
Budget challenges, staff vacancies, and the absence of definitive professional best practices hindered the Library’s ability to move forward as quickly as we would have preferred. However, Tim Kaine’s April 2011 announcement of his candidacy for the United States Senate and the potential inquiries regarding his administration’s records gave us the opportunity to reassess and reconsider our priorities around these records, especially the born-electronic materials.
With our senior leadership’s support, a workgroup of archivists and IT staff undertook the challenge of making the Kaine administration’s emails accessible to the public in time for the 2012 election cycle. Due to the sheer volume of emails, coupled with technical challenges and limited resources, we soon admitted that perhaps we’d bitten off more than we could chew; unhindered and accepting of the fact that we would not meet the self-imposed 2012 deadline, we buckled down and continued to move forward.
Archival Processing
The Library wrestled with a number of issues during this project, but first and foremost was the question of how the processed emails should be served to the public. Given a choice of limiting access to dedicated computer terminals in the Library’s reading room or allowing anyone with an internet connection to view the emails through the Library’s online digital asset management system, the Library chose to provide online public access to the processed materials. This seemingly simple decision had a significant impact on processing the collection.
We promote open access to government records. However, open access must be balanced with the various laws that restrict access to certain types of records. Personal information (such as Social Security numbers or medical, education, and personnel records), attorney-client privileged records, and materials related to clemency and restoration of rights are just some examples of restricted material. In order to ensure that privacy-protected records were excluded from the collection, state records archivists decided that the electronic equivalent of item-level processing was warranted. Thus, using the appropriate retention schedules, archivists reviewed every email in each email account and segregated the emails that did not qualify as public records or were otherwise restricted from public access. Processed copies of the email PST files were then passed on to the Library’s information technology department for the technical phase of the project.
Despite our best efforts, we knew it was possible for restricted information to slip through our manual dragnet. Thus we also decided to create a virtual “reading room agreement” by creating a secure gateway to the online collection. In order to view the Kaine emails, users must log in using a generic account that the Library created for this collection. By logging in, users are acknowledging their researcher responsibilities regarding protected materials.
Technical Processing
Anyone who ever cheered on MacGyver as he repaired a radiator using only water and an egg white will appreciate our toolkit for this project: two determined archivists embedded in the Library’s IT department and a few hundred bucks for processing software. When the decision was made to put the Kaine emails online, we began looking for a program that would export the emails from the processed PST files into full-text PDFs. Our goal was to serve static copies of the emails in a format that was keyword searchable. We also wanted to convert any files attached to the original emails to full-text PDFs, thereby eliminating the need for users to have additional software installed to view the attachments.
Two software candidates quickly emerged, though both had limitations. Our top choice, PSTViewer Pro by Encryptomatic ($69.99 for one license), handled Microsoft Word, Excel, and PowerPoint attachments like a champion, but the most it could do with MSG files (emails attached to emails) was to include them in the exported PDF as an attachment in their native format. The runner up, Total Outlook Converter Pro by CoolUtils($99.90 for one license), easily converted all manner of Microsoft attachments, including MSG files, but the program routinely crashed when we tried to convert large PST files. In the end, we decided to buy both programs and combine their functionality to meet as many of our needs as possible. Thus, for all emails without MSG attachments, we used PSTViewer Pro. For the smaller number of emails with MSG attachments, we used Total Outlook Converter Pro.
Once the emails had been exported to PDF format, we created CSV files for the corresponding email metadata in order to facilitate bulk ingests into DigiTool, the Library’s digital asset management system. This phase of the project was fairly straightforward once we figured out that Swedish date formatting was the trick to preventing Excel from automatically converting our YYYY-MM-DD dates (necessary for allowing sorting results sets by date) into the MM-DD-YYYY format. And through trial, error, and a fair amount of heartburn, we discovered just how bulky we could get with our ingests into DigiTool. Anything more than 3,000 emails at one time strained the program to the breaking point and required intervention from the software developer (Ex Libris Group) to right the system. When the ingests were finally complete, we turned our attention to tweaking the resource discovery interface to facilitate use of the new collection.
It was at this point that we confronted head-on the inherent limitations in our current digital asset management system. In a perfect world, our resource discovery tool would let users browse, search, and retrieve gubernatorial email records as effortlessly as if they were using Gmail. But in the real world, our system is geared towards delivering more traditional digital content, such as digitized maps, manuscripts, and photographs. Though the emails are full-text searchable, the results set for a given keyword search can easily number in the thousands given the sheer volume of total records in the collection. We thus tried a number of approaches to help users navigate the collection. For example, to better approximate the “inbox” environment of email, we created sub-collections for each individual in the Kaine administration and populated those collections with exported emails from the corresponding PST files. This allows users to step into the shoes of specific administration officials and approximate what they saw when they logged into their email accounts. We also worked within the limits of the software to tweak certain display options when viewing results sets. Finally, we created tip sheets to help users understand how to get the most out of the DigiTool search environment.
Credits
This project would not have been possible without funding provided by Congress for the Library Services and Technology Act.
The Library of Virginia would like to thank a number of individuals whose combined efforts helped bring this project to fruition. First, our hats go off to the following members of the Kaine administration for their assistance in coordinating the archival project and answering the Library’s questions long after they had moved on to the next phase of their lives: Amber Amato, Sherrie Harrington, Margaret Hughson, Timothy M. Kaine, William H. Leighty, Larry Roberts, Mark Rubin, and Wayne Turnage. We also applaud the work of the Kaine administration records officers in managing these emails, as well as the work of David Allen of Northrop Grumman in facilitating the transfer of the email records to the Library. Additionally, we want to thank our colleagues, both former and current, for the teamwork underpinning this project. In the Library of Virginia alumni camp, we thank Ariel Billmeier, Ben Bromley, Don Chalfant, Conley Edwards, Siri Roma, and Anita Vannucci. We also give virtual high fives to those still clocking in every day to safeguard the archives of our Commonwealth: Roger Christman, Kathy Jordan, Rebecca Morgan, Paige Neal, Susan Gray Page, and Jason Roma. Finally, we say thanks to the Library of Virginia’s Executive Management Team, which was crazy enough to support our idea of taking risks and trying something that might fail miserably.