Jaime Longoria: At the beginning of the project, how did you determine which data was going to be required to tell this story, and what did you have to do to get your hands on it?
Frank Matt: That took some doing. One of our reporters who had done this story out of South Carolina first gave us a hint not just how big of a national story this could be. This person had filed an exploratory FOIA request for data to see what would come back. You know, without specifying fields. They knew the data had just existed. So that gave us an early idea. My first FOIA request was for the data dictionary of this claims database from the Department of Labor’s Office of Workers’ Compensation Programs which adjudicates claims for, among other things, former nuclear weapons workers. When I got this data dictionary back it was a list of 6,000 fields. They were all coded and the department didn’t claim to have any comprehensive set of definitions for these fields, so we were left to decipher what we thought was important. We ended up getting into a long back and forth with the Department of Labor about what was even in their database. It took us months before we were able to even craft our first FOIA request for the data itself in a comprehensive way. We ended up requesting a few hundred fields of data that we determined would be the most important. One of the things that we had from that start that made this a challenging process was that this was data for a very specific purpose, which is adjudicating claims. It’s not meant really to learn about workers, and that was what we were trying to do — to learn more about these workers experiences during and after their employment when they got sick and tried to get compensation from the government for it. It was incredibly complicated data and even to get to the point for filing a request for all the data we thought needed, we had a monthlong process trying to figure out was in this database. We were met with a lot of obtuse-ification and also, that claim from the Department of Labor from the start: they don’t even maintain a list of definitions of fields in their database.
Longoria: Can you walk us through the role the editors and reporters played in knowing exactly what to request and how to process it all? What was the decision-making process?
Matt: I was doing a lot of this early work myself, trying to figure out what was in the database. It was a lot of back and forth communication with the Department of Labor, finding a few fields I think meant “this” and asking them to clarify what they mean, but also trying to decode the responses. For instance, for months the Department of Labor insisted they didn’t track cause of death, which was one of the things we were interested in. We wanted to know the causality of whether a person’s occupational illness is what contributed to their death. So it took a few times back and forth of this before I realized what was going on. Every time the Department of Labor responded to me saying they wouldn’t track cause of death, they would put cause of death in quotes. That led me to believe that they were calling it something different. We learned, in fact, that in their subset of cases, they did track this and what fields were important to it. It just took a lot of persistence from our parts. I had a great editor who backed me up on this and got involved when they needed to and when it took some, as he calls it, “big footing” of the Department of Labor. Going over people’s heads, calling the bosses if necessary to get response when we need one.
Longoria: So how many negotiation processes did you go through?
Matt: I don’t know if I could even count them. There was a large back and forth with the Department of Labor through the entire process — from the beginning of learning what was in the data set, once we had the data set, discovering problems with the data set, things in our request they didn’t comply with, what certain data meant, the certain formatting of how they gave it to us. It was also that, in order to give us this database they had to take their large relational database and cram it into a few giant spreadsheets. My task as the data reporter on the project was to take these spreadsheets and try to reverse engineer them into relational data which took a substantial cleaning effort. Throughout that process, too, there were always little things that I discovered that I had to go back to the Department of Labor on to make sure that I had it right, the relationships between the data as I was piecing that back together.
Longoria: Can you talk us through that process a little bit, of cleaning up the data?
Matt: So like I said, the biggest issues was that this data was never meant to be displayed in the spreadsheet form that was given to us that way. For instance, every row ended up being a case, in the largest spreadsheet they gave us. A worker might have several claims on a case and they might have the dates of when they worked on a single cell on a spreadsheet, but if a person worked at different periods, at different sites those things would be separated by commas within a single cell. And also, all of these data fields were used for very specific purposes, in the claims adjudication process. They weren’t necessarily the same as what we were trying to use them for. There was a lot of combining fields and manipulating the data. For instance, for date of death was spread across five different fields. So I had to amalgamate those and get the best information to combine them into single field. But there are five different fields meaning there are five different forms that a claimant could have reported these on, and depending on which form they put it on that would be the field in this claims adjudication database. Also, a lot of the information in here was self reported by claimants which makes it extremely difficult for analysis. For instance, job titles were all self reported, so they were wildly inconsistent. Things were called different things. They were full of spelling errors. So I had to generate analysis fields for things like that when analysis was impossible based on how varied they were. In some cases of those self reported fields, I was able to clean them up enough with tools like OpenRefine to make more uniform so that we could crunch them, but the initial format of the data was incredibly messy. And it was a long process to turn that into a relational database that we could then query for analysis.
Longoria: Why was it important to show the reader the sheer number of employees affected by their work at nuclear plants? What effect does this, or should this, have on readers and their perception of the written narrative?
Matt: This piece is equal parts history corrective — lot of people are taught that the Cold War was a war without casualties — and to show that this problem is not a thing of the past. The whole second half of our project is about the current nuclear weapons complex. After we spent 12 billion dollars compensating people for their occupational illnesses, there are still accidents, and health benefits are being slashed at these plants. It was also a very timely project in that the United States is just about to embark on a giant and costly nuclear weapons modernization program, so a lot of these sites will be more active than they have been in years. So, we thought it was prime time to address the real human tool in nuclear weapons development. Because the data we were given was anonymous and necessarily so, because it involve people’s health, every person in the database was give a numerical identifier, and from the start what we wanted to do was to somehow humanize this anonymous database. We didn’t want to just bombard people with charts and graphs and facts and figures which can be very dehumanizing in a way and taking a very 10,000-feet-view of the situation. We wanted to take the simultaneous approach of getting a sense of the scale, but at the same time grounding the data in the real human lives affected.
Longoria: And how exactly was this accomplished?
Matt: A big part of that was the design concept and the algorithm. When you look at the story, there are these person icons that are the literal backdrop of the story. Each one of those represents a worker. There are 107,000 of them. When you click on them, each one of these pops up a micro-story that was generated by the algorithm. We wanted to display the data on every single person, but we didn’t want it to just be bullet points. We though a way to humanize the workers was to write an algorithm. Our developer, Danny Dougherty, wrote an algorithm that queried the database I had built and generated a story based on each person from the data set, stringing it together. So when it’s woven together in a way that when readers are confronted with the scale of it — with how many people have been affected — the effect is really striking. And writing that algorithm was quite an effort because it took a lot of iterations. It took a lot of trial and error. There was so many variations to account for and there are also issues with the data that made it extremely difficult. For example, we didn’t know the gender of a person, so that makes pronoun choices and sentence structure very difficult to account for. Danny would write the algorithm, and I would go through and pour through the output. We’d find the areas which weren’t accounted: if a claim starts with a survivor instead of the employees themselves, for example, or if it reads oddly when a person has more than one job title. So we just kept trying and trying again, and I think that the algorithm ended up with accounts for most, if not all, of the different variations.
Longoria: What would you say was the most difficult part of the entire project?
Matt: There were a lot of things that were very difficult in different ways. I think conceptually, what we were doing was very difficult — taking a large dataset that was meant for a very specific purpose and using it for something completely different. That database wasn’t designed by research trying to learn about the lives of nuclear workers, which is what we wanted to do. It was meant for adjudicating claims. I think the thing that made this project challenging arose from that. Also, a lot of what made it challenging were the standards that we put on ourselves for telling these stories. There were much easier ways in which we could have presented 107,000 stories. Our approach was that we wanted each of those stories to read well and tell each one the best we could. I’ll give you an example of something that was very time consuming. Job titles were taken from the analysis fields I generated, which are clean and uniform, but are not very descriptive. So when the algorithm was first churning out these stories, it was saying the person’s job title was “labor,” “management,” or even “miscellaneous.” What we decided to do when we saw these early iterations was to go back to the self reported field for job titles. The problem there was that because they were self reported, they were incredibly messy. In fact, they were so messy there isn’t really a tool that could help you clean them in a quick way. I spent about a week, more or less around the clock, cleaning 188,000 rows of data by hand on job titles, editing each one individually. So, the result was that, rather than having someone’s job description as “labor” or “management,” you got that the person was senior engineer on a task force, or a graduate intern at one of the labs, a pipe sitting trainee who eventually became a production foreman. I’d like to think that we gave all of these 107,000 stories the care that they deserved. Those are a lot of stories to tell and it was difficult to devote time and attention to each of them. I should also say, going through those self reported job fields line by line lead me to find other stories that I would have never found any other way. I found that a former lieutenant general of Alaska got sick from nuclear weapons testing in Alaska. I found out that 25 CEOs of contractor companies had gotten sick. I found a person who developed one of the first had held Geiger counters, which is an early safety development. Going through those line by line I found gems and leads to other stories that I would have never found any other way.