Home > About-Documentation

Documentation

The Amnesia anonymization tool is software written in Java and JavaScript and should be used locally for anonymizing personal and sensitive data. The basic idea behind anonymization is that users load a file containing personal data (original data) to Amnesia, and Amnesia transforms it into an anonymous dataset, which can then be stored locally. The transformation is guided by user selections and provides an anonymization guarantee for the resulting dataset. Amnesia currently supports k-anonymity and km-anonymity guarantees. The following five steps summarize the anonymization process.

Dataset

The datasets that Amnesia can process are stored as delimited text files. Each line in the text file is a different record, and each distinct value of the record is separated from the next by a delimiter. The user must provide the delimiter used in the original file to the tool when importing a dataset. Anonymized files are saved in the same format as the original files. The data models supported by Amnesia are relational tables, set collections, and object-relational tables (all of them stored as delimited text files).

  • Relational tables have a fixed number of columns, and as a result, each record has the same number of values. Each column can have a different data type.
  • Set collections are datasets of arbitrary length records. Each record can have an arbitrary number of values of the same data type (currently supporting strings).
  • Object-relational tables are the combination of the above. These tables have a fixed number of columns, but one column is a set, i.e., it contains an arbitrary number of values of the same type. To correctly parse a delimited file containing an object-relational table, the user must provide Amnesia two delimiters: separating values of different columns and separating different values in the set column.

Amnesia supports four data types: strings, integers, doubles (floating point), and dates. When loading a dataset, Amnesia will try to guess the data type, but it will do so based only on the first lines of the imported dataset. The user should check these recommendations and correct them accordingly.

Load a dataset

Importing a dataset to Amnesia can be initiated in three different ways:

  1. By clicking the orange button "Choose Dataset" located on the upper-left corner of the Index screen.
  2. By using the left-side menu: "Source" -> "Load From Local".
  3. By clicking the image ("Drop files to upload"): when the dataset appears on the screen, the user should press "Upload", located in the upper-right corner of the image.
  4. Through the dataset screen by clicking "Source"->"Manage") and selecting "Load New Dataset", located in the upper-right menu.

After initializing the dataset loading, a wizard guides the user to model the data in Amnesia correctly. In the first step, the user should choose how the input dataset should be processed

  1. Simple Table: a relational table, i.e., a table with a fixed number of columns of possibly different data type.
  2. Sets of values: collections of records with an arbitrary number of values of the same type.
  3. Table with a set-valued attribute: a table with a fixed number of columns where one column is a set. Tables with a set-valued attribute will need two delimiters: one for separating columns and one for separating values inside the set-valued attribute.
  4. Disk-based simple table: a simple table that resides on the hard disk and it is the suitable option for very large datasets.

Amnesia parses the first lines of the dataset and presents a preview to the user. Amnesia guesses the data types, which have to be confirmed by the user. The user can also choose by using the check box next to each attribute which columns will appear in the output dataset.

Load From Zenodo

Amnesia can import data directly from Zenodo. The process is initiated by selecting the left-side menu: "Source" -> "Load From Zenodo". A wizard guides the user to connect with Zenodo so that Amnesia acquires access to their files. In the first step, the wizard requires the access token found on the user's Zenodo profile. After connecting to the user's account, Amnesia presents a table with the user's full file description. The user can then choose a dataset by selecting it.

Load From Dataverse

Amnesia can import data directly from Dataverse. The process can be initiated by selecting the left-side menu: "Source" -> "Load From Dataverse''. A wizard guides the user to give the required credentials so that Amnesia connects with the Dataverse server that contains the user's files. In the first step, the wizard requires the Dataverse server URL, the user access token, and the persistent ID of a specific Dataverse Dataset (e.g., doi:10.70122/FK2/TZAXXS). All necessary information exists on the user's Dataverse profile. After connecting to the user's account, Amnesia presents a table with the user's full file description. The user can then choose a dataset by selecting it.

Load DICOM Images

Users can load DICOM images by clicking the orange button "Choose Dataset" located on the upper-left corner of the Index screen. Then, they must choose "Load folder with DICOM images" and select a folder containing DICOM images. Then, Amnesia loads all .dcm files in the directory and stores sensitive meta-data information such as "PatientID","PatientName", "PatientAge", "Modality", "PatientSex", "PatientBirthDate", "PhotometricInterpretation", "BodyPartExamined", "PatientOrientation", "ViewPosition", "ConversionType" and "SamplesPerPixel" as a simple table. After the anonymization, the anonymized table is converted to a compressed folder of .dcm files. By clicking "Save To Local" the user can download the anonymized .zip file.

Save

Datasets can be saved locally only after the pseudo-anonymization process by navigating to the dataset screen ("Source" -> "Manage") on the left-side menu, then by applying pseudo-anonymization to one or more string columns, and finally by clicking "Save To Local", which is located on the upper-right of the screen. Amnesia stores the data in .txt comma-delimited files along with another .txt file which maps the randomized order of the pseudo-anonymized data to the original order.

Save To Zenodo

Amnesia can store datasets to Zenodo. From the left-side menu, the user should navigate to the dataset screen ("Source" -> "Manage") and click "Save To Zenodo", located on the upper-right of the screen. Then, a wizard will guide the user through the process. In the first step, the user will be asked to provide the User Authentication Token (found on their Zenodo account), Author, Affiliation, the Filename, Title, Description, Contributors, and Keywords. In the next step, Amnesia displays a summary describing the dataset to be published in Zenodo that needs user confirmation. The last column of the table is the percentage of similarity between this specific file to the file that the user wants to save. This percentage results from comparing the attributes: fileName, keywords, and checksum between the two files. Upon confirmation of the user, the file is published. Amnesia stores the data in .txt comma-delimited files.

Save To Dataverse

Amnesia can store datasets to a Dataverse server. The process is available from the left-side menu; the user should navigate to the dataset screen ("Source" -> "Manage") and click "Save To Dataverse" located on the upper-right of the screen. A wizard, then, assists the user through the process. In the first step, the user will be asked for the User Authentication Token, the Dataverse Dataset persistent ID (both can be found on the user's Dataverse account), and a Description. Amnesia stores the file on the user's Dataverse Dataset specified from the given persistent ID in .txt or .csv comma-delimited files.

Check Anonymization

This option allows the user to check whether the source dataset is already anonymous or not, according to k-anonymity. This option is available through the dataset screen ("Source" -> "Manage" ) by clicking "Check Anonymization". A wizard will ask the user for the preferred k parameter of the anonymization guaranty. In the next step, Amnesia presents a pie-chart representation of the dataset. This chart indicates all groups or records with their size and highlights the percentage of records that fall to groups of size less than k. The user can trivially anonymize the dataset by suppressing all records that fall in the latter category.

Load Anonymization Rules

Saved solutions, i.e., collections of rules for generalizing values, can be loaded and applied to different datasets. Note that applying a solution to a different dataset does not guarantee that the new dataset will be anonymous. Loading of anonymization rules is accessible from the dataset screen ("Source" -> "Manage") by selecting "Load Anon Rules" located on the upper-right menu.

Hierarchy

Generalization hierarchies are a set of rules that define how specific values should be substituted by more general ones in anonymizing the data. The key idea here is that values that are specific enough to be identifying (e.g., a residence zip code) are replaced by more general (e.g., city names) so that they can no longer reveal a person's identity. An example hierarchy is depicted in the following figure.

Amnesia will use the hierarchy to replace specific values with more general until the privacy guarantee is reached. A characteristic of generalization hierarchies is that all nodes lead up to a single node (root). This property guarantees that Amnesia will be able to replace all values with a common one if needed.

Generalization hierarchies contain semantic information that the user has to provide. In the case of domains where semantic information is linked to a total order (e.g., numbers, dates, etc.), Amnesia can help the user generate new hierarchies.

Hierarchies can be stored and loaded by Amnesia in a predefined format. We offer ready-to-use hierarchies for some important real-world ontologies (ICD codes, ZIP codes)..

Load

Loading hierarchies can be initiated in two ways:

  1. by selecting "Hierarchy" -> "Load From Local" on the left-side menu.
  2. by selecting "Hierarchy" -> "Manage" and clicking "Load New Hierarchy" on the upper-right corner of the menu.

Save

Hierarchies created by Amnesia can be saved as local files. To save a hierarchy, users have to select "Hierarchy" -> "Manage" on the Hierarchy screen and click"Save Hierarchy" in the upper-right menu.

Auto Generate

Amnesia helps users create custom hierarchies based on the original dataset file. This feature is accessible by using the left-side menu by selecting "Hierarchy" -> "Auto Generate" or through the hierarchy screen by selecting "Hierarchy" -> "Manage" and then clicking "Autogenerate Hierarchy" located on the upper-right menu.The hierarchy is created so that it contains the active domain (i.e., all the values) of an attribute of the input file. The user must first choose the attribute, the type (distinct or range), and the hierarchy's variable type (domain). In the case of distinct values, the user can further custom the hierarchy creation by choosing

  • the sorting function for the domain values (numeric, alphabetical, random),
  • the name of the hierarchy, and
  • the fanout (i.e., the average number of children of each node).

In the case of range values, the user must choose

  • the name of the hierarchy, the boundaries of the attribute domain,
  • the step (i.e., the size of ranges at the lower level of the hierarchy), and
  • the fanout.

A special case of range hierarchies is date values. Because of the fact that dates are not based on the decimal system, the user must define several ranges for different granularity levels. Specifically, the user must define how many days, months, and years will be grouped in each node. Moreover, the user must define the fanout, which is the number of year ranges that will be grouped in a single node.

Edit Hierarchy

Existing or auto-generated hierarchies can be edited by the user through a visual interface (by pressing the "Edit" button in the hierarchy panel). Editing includes adding, removing, renaming nodes, and moving them from one place of the tree to another. A detailed guide on hierarchy editing with visual examples is available here

Generate hierarchy with demographic data

Amnesia supports population-based hierarchies by postal code or age. These hierarchies are available only for simple table or disk-based simple table data and are based on demographic data from various countries such as France, Germany, Great Britain, USA, etc. The central concept is that if an attribute value (age or post-code) does not satisfy k-anonymity on the local dataset, Amnesia will check whether the value guarantees k-anonymity in the population distribution of the hierarchy node to which it belongs. With this method, Amnesia reduces information loss. Finally, these hierarchies can not be saved or edited and can not be combined with custom hierarchies.

Algorithms

Execute

Algorithm execution is initiated through the Algorithms screen, which is accessible from the left-side menu. Amnesia displays the input dataset and the hierarchies on the upper part of the screen. The user can make his selections on the bottom part of the screen:

  • The user must associate each quasi-identifier with an already loaded hierarchy on the left. A hierarchy can be used to anonymize several quasi-identifiers. One hierarchy must be defined for each quasi-identifier.
  • The user must choose the anonymization algorithm and its parameters on the bottom-right side. Currently, Amnesia supports:
    1. k-anonymity for simple tables
    2. k-anonymity with local recording for disk-based simple tables
    3. km-anonymity for sets, and
    4. k-anonymity and km-anonymity for relational tables with a set-valued attribute

Anonymization is initiated by clicking "Execute".

Pseudo-Anonymization

Pseudo-anonymization is applied using the masking method. After the user loads a dataset to Amnesia, there is a "Pseudo-anonymization" button next to every string attribute. By clicking it, a pop-up window depicts every character of a random value of the column in a small box. The user is then asked to set the desired special character for the mask (e.g., *,&,^ etc.). Finally, the user chooses which characters from the sample value to hide with the mask character.

Solution Graph

Amnesia is oriented toward enabling the user to tailor every step in the anonymization process. For simple k-anonymity, this entails the ability to choose any valid solution. Amnesia depicts the solution space and allows the user to choose from the solutions that guarantee k-anonymity. Unfortunately, this is not possible for km-anonymity or for local recoding k-anonymity since the solution space in these cases is too large to be completely explored or visualized. The algorithms in these cases use heuristics to quickly identify a good, but possibly not the overall best solution. The functionality described below is available only for simple k-anonymity.

Solution Selection

To view the complete solution space for simple k-anonymity, the user must navigate the solution screen (Solution Graph). The different solutions are represented as nodes in a graph. Each node corresponds to a different combination of anonymization levels for each quasi-identifier. For example, if we have the quasi-identifiers "Age" and "Zipcode", one solution will represent "Age" anonymized to the first hierarchy level and "Zipcode" to the second and another "Age" anonymized to the second hierarchy level and "Zipcode" to the first. All possible combinations are represented. Blue nodes indicate the safe solutions, while red nodes the unsafe ones. The respective solution is applied by double-clicking a node, and the tool redirects the user to the anonymized dataset. Amnesia allows the user to customize the final solution further by transforming an unsafe solution into a safe one using suppression. It is often the case that an unsafe solution violates the desired k-anonymity guarantee just for a few records. For example, in a dataset with 100k records, there might be only 5 records that do no fall into k-sized groups. Instead of further generalizing the whole dataset, the user might opt for suppressing, (i.e., removing these 5 records). Amnesia will show the percentage of records that violate the guaranty by following these steps:

  1. Choose an unsafe solution in the solution graph,
  2. Choose "Show statistics",
  3. Select the combination of all available quasi-identifiers.

Amnesia will show the percentage of records that violate the privacy guarantee based on the combination of quasi-identifiers chosen on the left part of the pop-up window. If the user presses the "Suppress" button at the bottom-right part of the window, these records will be removed, and the unsafe solution will be transformed into a safe one and will be directly applied to the dataset.

Amnesia gives the opportunity to the user to further customize the final solution by transforming an safe solution to safe one using suppression. It is often the case that an unsafe solution violates the desired k-anonymity guaranty just for a few records. For example, in a dataset with 100k records, it might be only 5 records that do no fall into k-sized groups. Instead for further generalizing the whole dataset, the user might opt for suppressing, i.e., removing these 5 records. Amnesia will show the percentage of records that violate the guaranty by following these steps: 1. Choose an unsafe solution in solution graph, 2. Choose "Show statistics", 3. Select the combination of all available quasi identifiers. Amnesia will show on the left part of the pop up window the percentage of records that violate the privacy guaranty based on the combination of quasi identifiers chosen. If the user presses the suppress button at the bottom right part of the window, these records will be removed and the unsafe solution will be transformed to a safe one and will be directly applied to the dataset.

Show Sample of Anonymized Dataset

The user can get a preview of each solution in the solution graph. When a solution node is clicked, a pop-up menu will appear with the option "Preview of the Anonymized dataset"..

Anonymized Dataset

Save Anonymization Rules

An anonymization solution comprises a series of anonymization rules that define how each quasi-identifier must be anonymized (e.g., Rule 1: "Country" attribute should be anonymized to the continent level). Amnesia allows the user to save these rules so that they can be reused in the same or similar datasets in the future. Rules are saved by using the "Save Rules" button in the results screen.

Save Anonymized Dataset

Saving the anonymized dataset is possible through the results screen or the anonymized dataset screen ("Anonymized->Source") by clicking "Save To Local" located in the upper-right menu.

Complete scenarios of anonymization using Amnesia

These tutorials provide an end-to-end demostration of how to anonymize a dataset using a specific privacy model

Amnesia ReST API

See detailed instructions and examples of how to use Amnesia backend procedures.