What is Amnesia?
Amnesia is a flexible data anonymization tool that transforms relational and transactional databases to dataset where formal privacy guaranties hold.
Data anonymizationAmnesia implements data anonymization techniques from the field of Privacy Preserving Data Publishing (PPDP). The key idea in anonymization is that identifying information is removed from the published data, so no sensitive information can be attributed to a person. The anonymization procedure is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person; it also includes removing secondary information, e.g. like age, zipcode that might lead indirectly to the true identity of an individual. This secondary information is referred to as quasi-identifiers. To better understand how secondary information can be used to re-identify a person, consider the following example. A publisher that owns medical data of patients wants to publish an anonymized version of the data she owns. The data are superficially anonymized by removing direct identifiers e.g., names and social security numbers, but descriptive information like the zip code of the patient’s residence and her/his age remain. An adversary who wants to identify the patients that are related to the anonymized data, may have access to such descriptive information from other sources, e.g., a voter’s registry. The re-identification can be achieved by matching the descriptive information (Zip code, Age) of the anonymized data to the public registry. If a single match is produced for a given combination, then a patient can be accurately identified. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users.
Anonymization techniques present descriptive information in an obscure or generalized way, to guarantee that such matchings cannot take place. There are several methods for transforming the data and several different guarantees that are provided for the anonymized data. Amnesia currently focuses on k-anonymity . k-anonymity guarantees that every record in the anonymized data, will be indistinguishable from other k-1 records in the same dataset based on the quasi identifiers. An example of k-anonymization (for k=4) appears in Figure 2. Consider the leftmost table: Age and Zipcode are quasi identifiers, that can be used to re-identify a person in the anonymized data. Diagnosis is the sensitive information in each record and it is not used as quasi identifier; it is not common knowledge and if the adversary already has it then there is nothing more to be revealed in the re-identification. The anonymization process transforms the quasi identifiers to a form where each combination of values appears at least k=4-times.
Anonymization algorithms have to transform the data into a form that provides a privacy guarantee with the minimum possible distortion of the original data. A significant challenge for every anonymization method is to provide the best trade-off between privacy guaranty strength and anonymized data quality. In real world application the number of quasi identifiers is often very large, and it is difficult to provide k-anonymity and at the same time preserve any useful information in the anonymized data. To tackle this problem in high dimensional data, Amnesia supports another, more flexible privacy guarantee, namely km-anonymity. km-anonymity requires that each combination of up to m quasi identifiers must appear at least k times in the published data. The intuition behind km-anonymity is that there is little privacy gain from protecting against adversaries who already know most of the terms of one record, and significant information loss in the effort to do so. An example of how km-anonymity works is presented in the Figure 3, which depicts customers who have bought products at a supermarket (in a very simplified form). For more information on km-anonymity see .