The National Archives and Records Administration is at a turning point — in a few short years they will be taking in more digital records than traditional formats. Leslie Johnston, NARA's director of development and tool management, sat with Federal Times Editor Aaron Boyd to talk about the challenges in preserving digital records — including constantly changing formats and the exponential increase in incoming data — and how tools like open source and cloud can help.
How has the digital revolution changed the archives and your role at NARA?
I came to the archives two years ago as the first director of digital preservation. Even though the archives has been receiving electronic records actually since the 1960s, we only recently put into place a real infrastructure and understood the need for large-scale, planned, long-term digital preservation for the electronic records. That doesn't mean we weren't preserving them before. But we have so much more coming in now and there is such a wider variety of what we get, we needed to put a bigger infrastructure in place and put plans in place to more proactively handle the sort of materials that we get.
As an example, people think of the archives as a place where we get stacks and stacks and shelves worth of paper. Well, of course we're still getting paper; we will never not get paper. But we're increasingly getting born digital, electronic-only documents. But it's not just documents, it's video, it's audio, it's GIS, it's photographs, it's maps — it's every possible format you can imagine.
We used to think, "Oh we get about five or six formats," and that's what our official guidance was. In 2014 we extended that to over 60 formats that we accept. And of course, if an agency produces something, we take it.
That means we need to be able to recognize and process and characterize everything that comes in and then have a plan in place to take care of it forever.
We have really two goals at the archives. One is to preserve everything that comes to us in the archives. But also to make everything accessible to the public for use. Which means not only do we need to be able to recognize it, we need to be able to process it and then we need to put it out in a format that everybody can use and take advantage of it as public information because that's what a lot of what we have is.
How much digital content comes in compared to paper and other traditional formats?
What we get right now in terms of paper verses digital is in some ways pretty equal if you think about it. The way that federal records management works is that agencies have records — temporary and permanent. What comes to the archives is permanent.
Those records are scheduled for delivery. They could be scheduled five years in advance, seven years in advance, 15 years in advance, 50 years in advance. So these records may be coming to us from just a few years ago or they could be coming to us from several decades ago. There's paper at agencies right now that will still be coming to us 10, 20, 50 years from now. It's the same with the electronic records.
What we will see over time is a graph where the paper goes down some and the electronic goes way up. But there's always going to be a balance. Right now, I think it’s about 50-50.
When did the scales start to tip toward more digital records and less paper?
The watershed moment is actually happening right now. I'm looking at what we have scheduled to come in to us and what we have scheduled in our record centers out in the field to be transferred over and it's pretty equal right now. So this is really the time to be talking about the type of planning we're doing for paper and its continued permanent preservation and what we're doing with digital that's coming in right now.
Does your work include digitizing traditional formats?
My work in digital preservation and the infrastructure for preservation is about the born-digital. That said, I do actually also work with the group that does the digitization at the archives, as well as talk to agencies that are digitizing their paper records.
While the paper is still considered the formal official record, [agencies] are increasingly digitizing their paper records for use alongside their current born-digital records. And if they want to transfer the digital to us, then we'll absolutely talk to them about taking that.
And we're digitizing our own holdings because there's an increasing demand for online access to the older paper, video, audio, film materials online alongside the things that are born digital. So we have some very large partnerships to digitize and those files need to be managed in the exact same way as anything that comes to us from an agency that is already digital. They represent the agency's holdings; they represent the records of the work of the federal government; they all have to be managed in the same way; they need to be in consistent similar formats; and we have the same requirements to keep those around permanently.
Which format is easier to archive: paper or digital?
We get a lot of questions about what's easier or harder to work with, the analog collections or the born-digital collections and they each have different challenges.
The physical, analog collections — the paper collections, the film collections, the map collections — they have their challenges because even though in some ways they're often more stable in terms of the type of media they're printed on, they aren't always stable and they have to be cared for very carefully, stored in very careful conditions, monitored — we need trained professionals to be able to work with those materials to keep them safe and cared for over time.
And then people say, "Well obviously your digital work must be completely different," and it's not actually completely different. It's just working with a different type of collection.
We have formats that go back to the 1960s that require just as much intricate care as, say nitrate film from the 1920s and ‘30s in which we need to understand the file formats — know what they are; have them documented; be able to transform them over time so they're actively being cared for; they're still in a format that's open and usable. We need to know what software can be used to work with and open and present those files to researchers that want to use them.
Because that's just a completely different challenge but it's the same type of professional with very similar skills. You need to know a lot about what the records are how they were created, what form and format they are in, what sort of care they need over time and how you continue to preserve them and make them accessible forever.
So while it's a slightly different skill set working with digital files, working with print and paper and film materials is much the same job.
What different kinds of digital records are sent to the archives; which are most common?
People often assume that email is the largest type of material that we get and that's not wrong. But there are other forms of electronic records that we get that are almost as large as email, in some cases larger.
Photographs. Photographs absolutely are and can be records. Photographs are increasingly now taken with digital cameras and so we get large, large files that are digital photographs. I can think about, say what we get in our White House transition or get from other agencies, what we get from the agencies that actually do aerial photography. We can talk about the sort of materials we get from NASA.
Data sets. Data sets are and can be records because this is actually research work performed by the federal government: These are federal records, they also come to us.
We're actually processing so many types of electronic records.
Email is one of the biggest challenges because that is certainly the format that is growing the most. And because of the scale of even one agency's email from one area of its operations for one year. It's very difficult to pre-accessoin or review or filter or process that to always determine what are the really important records versus which are the lunch orders.
I come from sort of an old school collection development background in which I'm a save-all-the-things person. We are increasingly developing new tools to better automate the indexing, the searching, the filtering of the corpuses of material that we get in so that we can more and more and more automate that. That's really the challenge for us: If we're going to take in a billion email messages — and that could be just from one agency in one transfer — then how do we know which of those are the relevant important records and which are, "Please go over to the corner and get me a sandwich."
What kind of tools and processes are used in digital preservation?
The work with the electronic records requires a lot of different types of software for the different stages of what we call the lifecycle: The acquisition and ingest of the records all the way through its lifecycle to making it publicly available but also keeping it going.
Some of the types of tools we use — and these are tools that are available and developed through the larger archival community internationally — are the sorts of tools that are capable of looking at the files as they're coming in and saying, "You, you’re a Word Perfect 3.1 file; you, you’re a DOS file; you’re a data set; you’re a PDF file. It's capable of doing what we call characterizing the files because one of the most important aspects of digital preservation is that we know what we have — exactly what file formats they are and then through the other sorts of data stores we have, know what type of software we need to work with that material. That's where it all starts for us, during our ingest process characterizing everything that we have got.
During the processing of these file — the next stage of the lifecycle — we’ve got something that's a Word Perfect for DOS file. Who can a still open a Word Perfect for DOS file? Now that Word Perfect for DOS file is the original, authentic record and we want to keep that. But you know most people cannot actually open that on their machines right now to work with those. So we have a lot of tools — some of them purchased, some of them developed, some of them open source – that can take a Word Perfect for DOS file and transform it into a PDF file that anybody can read and work with and use online.
We have to plan for and do a lot of mass transformations. If we got a lot of very old formats from the ‘70s, we could actually transform them into something that is publicly accessible but also usable.
We do that a lot with data sets, for example. We get a lot of data sets and they need to be transformed into a format that we can not only make available through our catalog, but put them out on Data.gov, put computing APIs on top of them so people can actually query them and work with them as data.
While much of the big data community is working toward standardization, Archives doesn't have that luxury. How do you deal with processing non-standard data?
Our work with data sets is an interesting one because it's not really our role to normalize the data that we get from agencies. It's our role to preserve the data as we received it.
We hope that they have given us a code book for the data set that we have received. A code book is what actually describes the structure or the schema of the data that we have received so that we can make sure that the data actually matches the code book. It's supposed to match and if not we go back to the agency and we negotiate with them a little bit. But our goal is to deliver it as it actually was created and existed. So in that sense we don't actually make anything meet any new standards.
What we do is we take it and we actually map it at least to the type of descriptive metadata that we have in our catalog so it is discoverable. So whoever wants to work with it and do research using that data and combine it with other sources can get the data exactly as it was created, get the code book that comes along with it and describes how it's structured and then can work with it and map it along to other standards for other data that they're working with.
We don't have the ability or really the mission to update everything because then it's not the authentic record anymore.
Accessibility is a difficult challenge with old records. Which is more difficult to make accessible: paper or digital?
In terms of difficulty they each have their challenges.
Paper has the challenge potentially of condition. Over time, there’s fading of handwriting if you're talking about something like a ledger book or observational data. Many years ago I worked at the University of Virginia where the National Radio Astronomy Observatory is. We were working with them to digitize their observational log books from the observatory there. It's fascinating. But you don't change it. You have to try to figure out how to transcribe it, how to capture it, how to make the paper accessible in a way that you can keep using that paper over time.
For digital, it's really about the same thing. It's about preserving the form that it's in as closely as possible and making it available in a form that’s as usable by as many people as possible. That's part of our work and our mission.
Our definition of public use copies and those formats will also change over time. We're going to need to continue to always proactively look at what formats are current usable, when they go into obsolescence, when software to open those file formats aren't available anymore.
PDF is a good access format; it's not a preservation format because it's a very forgiving format but it's a good access format. Thirty years from now we may say, "PDF, who uses PDF anymore? There's a new format that's more accessible, captures more of the look and feel, captures more of the interactivity." We're going to need to change everything over and migrate to a new format 10, 20, 30 years from now.
That's where we get into the need to have large storage and compute infrastructures and that's the other part of digital preservation. It's not just knowing what the files are, that they're forensically authentically — there’s a lot of forensic software that was developed in the legal community, in the defense community, in the archives community and some great open source software out there that we can use in our work. But these are very large collections of files and they require a lot of storage, a lot of memory to work with them and a lot of compute power to work with things at that scale.
The other part of our work and planning right now is building out a larger and larger and scalable infrastructure to work with these files and records over time.
Why bother retaining digital records in obsolete formats?
There's a lot of conversations in the archival community and in the digital preservation community about whether you save the original file or you just migrated to a new format and chuck the original format out. One of the things that I found in my work, personally, is that the tools to transform file formats get better and better over time.
I was in graduate school in the 1980s. I have all of my files from when I was in graduate school writing my master's thesis, working on my PhD. Those were in MacWrite 1.0, the very first version of the Macintosh MacWrite software. When I went to transform those files into a format that was usable in a modern piece of software in the ‘90s, it did a very poor job. There are tools available now that can do a better job of transforming that into current formats. I did that transformation in the ‘90s; I want to do that transformation again to get a better copy of that; I can't do it if I threw out the MacWrite 1.0 file from 1988.
A), it's authentic. But B), I will always need to create a new format for public use over time and if I'm working from a copy of a copy then there is the chance to have a lot more corruption, to lose a lot of the authentic experience and to lose really the fidelity of the original record if I'm not working against the original file.
How do you manage the exponential increase in incoming data?
This gets back to some of what I've talked about a little earlier about what federal records management is: There are temporary records and there are permanent records. Not everything that every agency captures as observational data — like NOAA or NASA or the EPA — is necessarily a permanent record that comes to NARA for permanent holding and permanent preservation.
But we have actually developed quite elaborate infrastructure for this we have a tool that we call our electronic records archive, or ERA — it went into production in 2009 — which is how we actually track all of our electronic records holdings. We have three different versions of that software: One for federal records, one for presidential records and one for congressional and legislative records. Because they each have different rules around them – around what comes to us, how they're managed and what access rules apply to them. And separately we also have the classified environment because, of course, not everything that comes to us is unclassified.
Take Census data. Census data is very sensitive data because there's a lot of personally identifiable information in it. So we track that separately from everything else because of those rules.
So we ingest all of these records into systems that track what they are, who they came from.
I say that it's as simple as, "What it is, where it is and who it belongs to," are the most basic pieces of information we track about it. But it's a lot more elaborate than that because we're really looking at: What are the file format characterizations? When did we get it? What do we know about it? What record series does it come from? Which agency does it come from? What are the access rules? Is it classified or unclassified? So we have a large infrastructure that we've had in place now for close to a decade.
We actually started tracking it in some ways back in the 1950s when the first electronic files really started coming in to NARA. I have a system that actually knows what the accessions are for electronic records going back to the 1950s because we keep migrating that data into new and more updated systems over time, as well.
We're right now in the process of refactoring our electronic records archive software to move at least all the unclassified material into the cloud. Part of the scale for us is we need that flexibility, that elasticity to be able to work with these materials and the agencies are now creating them and storing them more often than not in the cloud, depending on their sensitivity and their classification.
For us it's obviously a goal to be able to take it directly into the cloud. Manage it there, work with it there, process it there, have the archivists work in virtual environments in the cloud and then bring them over — still in the cloud — to our public access catalog because the National Archives catalog is already operating and runs in the cloud. So for us this is about having a seamless process potentially from the agency working in the cloud to us taking control, processing it and making it available in the cloud — all through one seamless, elastic environment.
That's our goal right now and that's what we're working on right now.
What's the timeline for that move to the cloud?
The goal right now is to go into production in 2018; we started the development in late 2014. So we are about halfway through the development. Next year we're going to start the process to get it through an approved to go into production in early 2018 for the unclassified and then we start working with the classified materials.
It's a really exciting time to be doing this.
How much data does NARA take in per day and how much is stored within the archives?
When I think about the size of our digital collections, across all of the collections that I know off the top of my head, it's about half a petabyte, which is a thousand times more than a terabyte, which is a thousand times more than a gigabyte, which is a thousand times more than a megabyte. To most people that's a large number. I can't say how much we take in every day because it actually varies depending on which agencies are transferring files to us; that's not a number we track.
We recently did some forecasting about where we're going and by 2020 we expect to have 50 petabytes of data. That will include both the born-digital records but also things that are digitized. So we see a huge uptick in what we're going to be storing and managing over the next few years and that's why we're working on the infrastructure that we're working on now that’s scalable and can not only work with many more files but many more types of files.
"Cloud" means different things to different people. What kind of cloud environment is NARA building toward?
Right now, we’re working with a vendor to develop something in public cloud storage. That obviously has to go through quite a few security controls and has actually already been approved for use by federal agencies for unclassified documents that are below a certain security threshold. So we are working with a commercial public cloud right now.
When it comes to classified, that will be a very different experience for us.
But we are also an increasingly virtualized environment, ourselves. We think of ourselves as working in a hybrid cloud environment.
In some ways I'm not a fan of the word cloud because cloud makes it seem like somewhere off in the ether there's this place where all your files are living. It's a data center that somebody else just happens to manage. Some of my other colleagues that have been working in this realm since the ‘60s or ‘70s, we all remember this as timesharing. This is just use of someone else's environment. This is what in the ’80s and ‘90s we called hosted services. This is really nothing new. This is partnering with another organization — whether it's cloud, whether it's a commercial hosting situation, whether it's a vendor that has software that just happens to only live on the cloud and we use it — these are all collaboration's and partnerships with other agencies and with commercial organizations.
What's the most interesting preservation effort or initiative you've worked on?
Many, many, many decades ago I started out life thinking that I was going to be a museum curator and that I was going to be sitting in a small room with little pieces of pottery writing very detailed discourses about ceramic glazes.
But I had the opportunity while I was still in graduate school to go to work for a museum — the Fowler Museum at UCLA. They were building a new building and they needed to move all of their material and collections into a new building. They needed to inbox and inventory everything so they bought a computer system.
This was in 1986. They bought a computer the size of a small refrigerator and I was the only person on staff as a graduate student who had ever used a computer before. I had an original 1984 Macintosh that I bought when I was an undergraduate and so that made me the computer person. And this is often been my role in many jobs, in many agencies over the decades that I'm the computer person.
So I got to start out many, many decades ago organizing information, looking at taxonomies, looking at controlled vocabulary and we started digitizing things in 1986. We were capturing via video cameras onto one-inch tape, mastering it onto laser discs and hooking them up to a computer program with a terminal.
That was where I actually started out with this, so that very first thing that I saw in 1986, in 1987: That we can capture this, we can control this, we can digitize this and we can make these incredibly fragile archaeological materials broadly accessible to the entire world. That started me on my entire career.
And I'm still excited doing that now because that is what I still do 35 years later. I work with those digital materials — they just happen to be born digital now — make sure that they're adequately and consistently described, that those files live someplace safe and that anybody that needs them can get to them. And they're going to be able to get to them, their children are going to be able to get to them, their grandchildren are going to be able to get to them. That is why I do what I do.
Aaron Boyd is an awarding-winning journalist currently serving as editor of Federal Times — a Washington, D.C. institution covering federal workforce and contracting for more than 50 years — and Fifth Domain — a news and information hub focused on cybersecurity and cyberwar from a civilian, military and international perspective.