What is Amnesia?

Amnesia is a flexible data anonymization tool that transforms relational and transactional databases to dataset where formal privacy guaranties hold.

Data anonymization

Amnesia implements data anonymization techniques from the field of Privacy Preserving Data Publishing (PPDP). The key idea in anonymization is that identifying information is removed from the published data, so no sensitive information can be attributed to a person. The anonymization procedure is not limited to the removal of direct identifiers that might exist in a dataset, e.g. the name or the Social Security Number of a person; it also includes removing secondary information, e.g. like age, zipcode that might lead indirectly to the true identity of an individual. This secondary information is referred to as quasi-identifiers. To better understand how secondary information can be used to re-identify a person, consider the following example. A publisher that owns medical data of patients wants to publish an anonymized version of the data she owns. The data are superficially anonymized by removing direct identifiers e.g., names and social security numbers, but descriptive information like the zip code of the patient’s residence and her/his age remain. An adversary who wants to identify the patients that are related to the anonymized data, may have access to such descriptive information from other sources, e.g., a voter’s registry. The re-identification can be achieved by matching the descriptive information (Zip code, Age) of the anonymized data to the public registry. If a single match is produced for a given combination, then a patient can be accurately identified. The sparser the data are, the more unique combinations exist, and the easier it is for an adversary to locate unique records that correspond to specific users.

Figure 1 - Data linking

Anonymization techniques present descriptive information in an obscure or generalized way, to guarantee that such matchings cannot take place. There are several methods for transforming the data and several different guarantees that are provided for the anonymized data. Amnesia currently focuses on k-anonymity [1]. k-anonymity guarantees that every record in the anonymized data, will be indistinguishable from other k-1 records in the same dataset based on the quasi identifiers. An example of k-anonymization (for k=4) appears in Figure 2. Consider the leftmost table: Age and Zipcode are quasi identifiers, that can be used to re-identify a person in the anonymized data. Diagnosis is the sensitive information in each record and it is not used as quasi identifier; it is not common knowledge and if the adversary already has it then there is nothing more to be revealed in the re-identification. The anonymization process transforms the quasi identifiers to a form where each combination of values appears at least k=4-times.

Figure 2 - Original and anonymized dataset with k=4

Anonymization algorithms have to transform the data into a form that provides a privacy guarantee with the minimum possible distortion of the original data. A significant challenge for every anonymization method is to provide the best trade-off between privacy guaranty strength and anonymized data quality. In real world application the number of quasi identifiers is often very large, and it is difficult to provide k-anonymity and at the same time preserve any useful information in the anonymized data. To tackle this problem in high dimensional data, Amnesia supports another, more flexible privacy guarantee, namely km-anonymity. km-anonymity requires that each combination of up to m quasi identifiers must appear at least k times in the published data. The intuition behind km-anonymity is that there is little privacy gain from protecting against adversaries who already know most of the terms of one record, and significant information loss in the effort to do so. An example of how km-anonymity works is presented in the Figure 3, which depicts customers who have bought products at a supermarket (in a very simplified form). For more information on km-anonymity see [2].

Figure 3 - km-anonymity with k=2 and m=2"

Amnesia algorithms

Amnesia reads the original data and transforms them to anonymized data by using generalization and suppression. Generalization is the substitution of a value in the original file, e.g., "Athens" with a more abstract one, e.g., "Greece". Substitutions take place according to a predefined hierarchy of values, e.g., "Athens" < "Greece" < "Europe", which is user defined or it can be automatically created by the tool. Currently Amnesia supports global recoding, i.e., when a value, e.g., "Greece" is replaced by a more abstract one "Europe", then all appearances of this value in the specified attribute are replaced by the new abstract value. Amnesia supports 2 algorithms for k-anonymity, Incognito [3] and a parallel version of the Flash algorithm [4]. Both algorithms find the optimal solution. Incognito is provided as a point of reference, since the parallel version of Flash outperforms it in all settings. These algorithms follow the strategy of full domain generalization, i.e., when one value, e.g., "Greece" is generalized to another abstract value, e.g., "Europe", then all values in the same generalization level as "Greece", i.e., single countries, will be generalization to the same abstract level as "Europe", i.e., continents. k-anonymization algorithms produce a lattice of candidate solutions which can be visually explored by the user. Different solutions respond to different anonymization levels for each quasi identifier. The user can explore the quality of the data with ad hoc queries and a visual representation of the value distribution. The user can choose to suppress outliers in cases where the anonymization is not enough to guarantee k-anonymity. km-anonymity is provided by the a-priori algorithm [5], which is adjusted to work both for relational and set-valued data. Amnesia focuses on usability and flexibility to allow the user to understand and guide the anonymization process. Since anonymization methods have not been extensively used in practice, it is essential that users will be able to tailor the anonymization processes and especially the information loss in the anonymized data to their needs. To this end Amnesia has an easy to use hierarchy creator and editor that allows the semi-automatic creation of hierarchies and several tools to assess the information loss in the candidate anonymization solutions.