When statistical agencies like the U.S. Census Bureau publish data, they have two competing mandates: To provide useful information to policymakers and ensure the privacy of survey responders. Managing both is a difficult task, but a new paper from the Terry College of Business and forthcoming in American Economic Review explains how to strike the right balance.
Better policies come from richer information, but data that is too accurate can reveal private information, such as family income or political preference — things many Americans would rather not have openly published.
For example, the Department of Education allocates federal funding to needy schools under the Elementary and Secondary Education Act of 1965 (commonly called Title I). The Census Bureau helps policymakers decide how to distribute the Title I money by publishing data on how many children qualify for assistance. If the bureau publishes all of its data, the Department of Education could perfectly allocate funds to the neediest districts, but it would expose information about the annual incomes of many families. If the bureau publishes no data, the Department of Education would allocate either too much or too little money to schools, but privacy of American families would be protected.
The best way is to find a middle ground that allows the Education Department to fund the right schools without compromising data on individual families, said associate professor of economics Ian Schmutte, who co-authored the research.
“We are confronted with the social choice problem: how much privacy and how much accuracy do we want?” he said. “In the paper, we take a classic model of public goods that says the optimal choice is going to be one where the marginal cost of losing a bit of privacy is exactly equal to the price we would be willing to pay to increase accuracy. So if we could go around and collect everyone’s attitude about how much they value privacy, we could discover where exactly the optimal choice lies.”
Modern privacy concerns come from two sources: reconstruction attacks and re-identification attacks. In a reconstruction attack, hackers attempt to rebuild a confidential database by using only published statistics from it. In a re-identification attack, a data set is recreated by combining published data with other available records. Good cybersecurity practice calls for statistical agencies to be vigilant against both, Schmutte said.
“Today, people might be concerned with citizenship status. If the Census Bureau publishes block-level detail on voting age and citizenship status, it might be possible for somebody to retroactively look at this data and have some knowledge about where non-citizens live and piece together the status of specific individuals by combining the bureau’s data with other information.”
To guard against such attacks, it’s essential that statistical agencies publish only limited sets of data. Schmutte’s research provides scientific guidance on how to weigh modern privacy concerns so that policymakers don’t over- or under-report data based on political pressures.
“We might be in a situation where there’s a small but vocal contingent that is concerned about privacy, which could lead policymakers to a decision that over-emphasizes privacy without thinking about the public benefit that comes from having more accurate statistics,” Schmutte said. “Right now, there is no real coherent or rigorous economic framework to guide policymakers in making these sorts of choices, which is what we’re trying to achieve.”
The framework his research describes is currently being implemented at the Census Bureau as it prepares for the 2020 Decennial Census.
The research, “An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices” was co-authored by John Abowd of the U.S. Census Bureau and Cornell University.