Optimization Technology for Data Security
Government agencies and commercial organizations that collect, store and publish data typically have a responsibility to protect the confidentiality of the sources of these data. This responsibility has become more difficult as the quantity of published data has increased and methods to extract information from data have grown more sophisticated. The agencies and commercial organizations release many data products in the format of statistical tables. These tabulations are derived from information collected on individual persons or establishments. A typical survey can involve a large number of interrelated higher-dimensional tables, making analysis and assessment complex. There exist a variety of statistical methods, notably iterative proportional fitting, log linear models, and Markov chain Monte Carlo sampling, for analysis or adjustment of tabular data. A variety of important problems in survey methodology, notably statistical disclosure limitation, data editing and two-way stratified survey sampling, have been formulated as mathematical optimization models.
OptTek has developed advanced optimization technologies to address the challenges of confidentiality protection. With significant support from the Centers for Disease Control and Prevention’s National Center for Health Statistics and the US Department of Transportation’s Bureau of Transportation Statistics, OptTek has developed sophisticated mathematical and statistical methodologies for disclosure limitation while maximizing the quality of released information in a research area called controlled tabular adjustment (CTA).
The underlying techniques are based upon Synthetic Data Substitution (SDS). SDS as introduced by Dandekar and Cox (2002), overcomes many of the problems associated with traditional cell suppression and perturbation methods. SDS introduces controlled perturbations, into tabular data, based on suppression protection ranges and minimizes data loss as measured in terms of the amount of perturbation required to achieve the protection level specified by ranges. The goal is to establish a table with minimal data loss that has a desired set of protection ranges. SDS is computationally scaleable to large multi-dimensional tables and provides confidentiality protection in such a way that the outside observer has no information on how the data were modified. This reduces the risk of disclosure and information loss. SDS is a new method that can be viewed as a combination of cell suppression and controlled rounding and provides the promise of enhancing data access while protecting confidentiality all in the context of an efficient computational method.
The objective in generating synthetic tabular data is to minimally perturb the original data while modifying sensitive cell values to a sufficient extent to ensure confidentiality. The value of each sensitive cell is replaced by a “synthetic” value selected to be at minimal distance away from the true cell value. The minimal distance is determined by the protection limit rules given for cell suppression. Some of the nonsensitive cell values are then adjusted from their true values by as small an amount as possible to restore original totals within the tabular system. An optimal solution replaces the sensitive cells with new values and minimizes the disruption of the rest of the data. An appropriate function such as the sum of the absolute values of the perturbations can be used to define the optimal solution. Additionally, nonsensitive cell perturbations are limited to be within sampling variability or some other legitimate limit. In general, zero cells are not modified. The end result is a completely populated table that affords the proper protection for the sensitive cells. An intruder is stymied by the fact that sensitive cells are not identified as they would be if cell suppression were applied.