Research Lead: Matthew Bishop
UC Campus(es): UC Davis
Problem Statement: Policy development is driven by information, some of which involves people and organizations. Some of this information includes personally identifiable information (PII) and other sensitive information. Typically, sensitive information cannot be released, so if the data is to be released to justify a policy or the manner in which a policy is to be implemented, the data must be sanitized to remove any sensitive data points. This must be done in a way to preserve the utility of the data, so as to ensure that the policies and practices can be justified. Improving transportation infrastructure requires gathering data to be used in developing, or validating, transportation policies and practices. Sometimes this data must be made available to third parties to carry out analyses or to independently confirm the claims made about policy elements and the way policies are put into practice. Data can be released in two forms: raw data, which is the data as it is gathered, and aggregated data, in which summaries are released. A solution to preventing the release of sensitive information is to sanitize the data—i.e., remove or suppress the sensitive information. The goal of sanitization is to protect sensitive information while enabling analyses of the data that will produce the same results as analyses of the unsanitized data. However, protecting information requires that sanitized data cannot be linked to data from other sources in a manner that leads to desanitization.
Project Description: This project reviews typical strategies used to sanitize datasets, the research on how some of these strategies are unsuccessful, and the questions that must be addressed to better understand the risks of desanitization. The project also identifies and characterizes precisely the gaps in the existing research surrounding data sanitization in order to preserve the balance between utility and privacy. In order to identify research gaps that create uncertainty, the researcher met with a variety of parties, including policy makers, data analysts, and privacy officers, to answer the following questions: What are the specific threats that data sanitization is to guard against, and how can one validate that those are the correct threats? How do we prevent false inferences from being drawn? How do we determine what external data is needed to desanitize the target data? What information must one or more external datasets have so that an adversary could desanitize enough data in the sanitized dataset for their purpose?