The datasets that Amnesia can process are stored as delimited text files. Each line in the text file is a different record, and each distinct value of the record is separated from the next by a delimiter. The user must provide the delimiter used in the original file to the tool when importing a dataset. Anonymized files are saved in the same format as the original files. The data models supported by Amnesia are relational tables, set collections, and object-relational tables (all of them stored as delimited text files).
Amnesia supports four data types: strings, integers, doubles (floating point), and dates. When loading a dataset, Amnesia will try to guess the data type, but it will do so based only on the first lines of the imported dataset. The user should check these recommendations and correct them accordingly.
Importing a dataset to Amnesia can be initiated in three different ways:
After initializing the dataset loading, a wizard guides the user to model the data in Amnesia correctly. In the first step, the user should choose how the input dataset should be processed
Amnesia parses the first lines of the dataset and presents a preview to the user. Amnesia guesses the data types, which have to be confirmed by the user. The user can also choose by using the check box next to each attribute which columns will appear in the output dataset.
Amnesia can import data directly from Zenodo. The process is initiated by selecting the left-side menu: "Source" -> "Load From Zenodo". A wizard guides the user to connect with Zenodo so that Amnesia acquires access to their files. In the first step, the wizard requires the access token found on the user's Zenodo profile. After connecting to the user's account, Amnesia presents a table with the user's full file description. The user can then choose a dataset by selecting it.
Amnesia can import data directly from Dataverse. The process can be initiated by selecting the left-side menu: "Source" -> "Load From Dataverse''. A wizard guides the user to give the required credentials so that Amnesia connects with the Dataverse server that contains the user's files. In the first step, the wizard requires the Dataverse server URL, the user access token, and the persistent ID of a specific Dataverse Dataset (e.g., doi:10.70122/FK2/TZAXXS). All necessary information exists on the user's Dataverse profile. After connecting to the user's account, Amnesia presents a table with the user's full file description. The user can then choose a dataset by selecting it.
Users can load DICOM images by clicking the orange button "Choose Dataset" located on the upper-left corner of the Index screen. Then, they must choose "Load folder with DICOM images" and select a folder containing DICOM images. Then, Amnesia loads all .dcm files in the directory and stores sensitive meta-data information such as "PatientID","PatientName", "PatientAge", "Modality", "PatientSex", "PatientBirthDate", "PhotometricInterpretation", "BodyPartExamined", "PatientOrientation", "ViewPosition", "ConversionType" and "SamplesPerPixel" as a simple table. After the anonymization, the anonymized table is converted to a compressed folder of .dcm files. By clicking "Save To Local" the user can download the anonymized .zip file.
Datasets can be saved locally only after the pseudo-anonymization process by navigating to the dataset screen ("Source" -> "Manage") on the left-side menu, then by applying pseudo-anonymization to one or more string columns, and finally by clicking "Save To Local", which is located on the upper-right of the screen. Amnesia stores the data in .txt comma-delimited files along with another .txt file which maps the randomized order of the pseudo-anonymized data to the original order.
Amnesia can store datasets to Zenodo. From the left-side menu, the user should navigate to the dataset screen ("Source" -> "Manage") and click "Save To Zenodo", located on the upper-right of the screen. Then, a wizard will guide the user through the process. In the first step, the user will be asked to provide the User Authentication Token (found on their Zenodo account), Author, Affiliation, the Filename, Title, Description, Contributors, and Keywords. In the next step, Amnesia displays a summary describing the dataset to be published in Zenodo that needs user confirmation. The last column of the table is the percentage of similarity between this specific file to the file that the user wants to save. This percentage results from comparing the attributes: fileName, keywords, and checksum between the two files. Upon confirmation of the user, the file is published. Amnesia stores the data in .txt comma-delimited files.
Amnesia can store datasets to a Dataverse server. The process is available from the left-side menu; the user should navigate to the dataset screen ("Source" -> "Manage") and click "Save To Dataverse" located on the upper-right of the screen. A wizard, then, assists the user through the process. In the first step, the user will be asked for the User Authentication Token, the Dataverse Dataset persistent ID (both can be found on the user's Dataverse account), and a Description. Amnesia stores the file on the user's Dataverse Dataset specified from the given persistent ID in .txt or .csv comma-delimited files.
This option allows the user to check whether the source dataset is already anonymous or not, according to k-anonymity. This option is available through the dataset screen ("Source" -> "Manage" ) by clicking "Check Anonymization". A wizard will ask the user for the preferred k parameter of the anonymization guaranty. In the next step, Amnesia presents a pie-chart representation of the dataset. This chart indicates all groups or records with their size and highlights the percentage of records that fall to groups of size less than k. The user can trivially anonymize the dataset by suppressing all records that fall in the latter category.
Saved solutions, i.e., collections of rules for generalizing values, can be loaded and applied to different datasets. Note that applying a solution to a different dataset does not guarantee that the new dataset will be anonymous. Loading of anonymization rules is accessible from the dataset screen ("Source" -> "Manage") by selecting "Load Anon Rules" located on the upper-right menu.
Generalization hierarchies are a set of rules that define how specific values should be substituted by more general ones in anonymizing the data. The key idea here is that values that are specific enough to be identifying (e.g., a residence zip code) are replaced by more general (e.g., city names) so that they can no longer reveal a person's identity. An example hierarchy is depicted in the following figure.
Amnesia will use the hierarchy to replace specific values with more general until the privacy guarantee is reached. A characteristic of generalization hierarchies is that all nodes lead up to a single node (root). This property guarantees that Amnesia will be able to replace all values with a common one if needed.
Generalization hierarchies contain semantic information that the user has to provide. In the case of domains where semantic information is linked to a total order (e.g., numbers, dates, etc.), Amnesia can help the user generate new hierarchies.
Hierarchies can be stored and loaded by Amnesia in a predefined format. We offer ready-to-use hierarchies for some important real-world ontologies (ICD codes, ZIP codes)..
Loading hierarchies can be initiated in two ways:
Hierarchies created by Amnesia can be saved as local files. To save a hierarchy, users have to select "Hierarchy" -> "Manage" on the Hierarchy screen and click"Save Hierarchy" in the upper-right menu.
Amnesia helps users create custom hierarchies based on the original dataset file. This feature is accessible by using the left-side menu by selecting "Hierarchy" -> "Auto Generate" or through the hierarchy screen by selecting "Hierarchy" -> "Manage" and then clicking "Autogenerate Hierarchy" located on the upper-right menu.The hierarchy is created so that it contains the active domain (i.e., all the values) of an attribute of the input file. The user must first choose the attribute, the type (distinct or range), and the hierarchy's variable type (domain). In the case of distinct values, the user can further custom the hierarchy creation by choosing
In the case of range values, the user must choose
A special case of range hierarchies is date values. Because of the fact that dates are not based on the decimal system, the user must define several ranges for different granularity levels. Specifically, the user must define how many days, months, and years will be grouped in each node. Moreover, the user must define the fanout, which is the number of year ranges that will be grouped in a single node.
Existing or auto-generated hierarchies can be edited by the user through a visual interface (by pressing the "Edit" button in the hierarchy panel). Editing includes adding, removing, renaming nodes, and moving them from one place of the tree to another. A detailed guide on hierarchy editing with visual examples is available here
Amnesia supports population-based hierarchies by postal code or age. These hierarchies are available only for simple table or disk-based simple table data and are based on demographic data from various countries such as France, Germany, Great Britain, USA, etc. The central concept is that if an attribute value (age or post-code) does not satisfy k-anonymity on the local dataset, Amnesia will check whether the value guarantees k-anonymity in the population distribution of the hierarchy node to which it belongs. With this method, Amnesia reduces information loss. Finally, these hierarchies can not be saved or edited and can not be combined with custom hierarchies.
Algorithm execution is initiated through the Algorithms screen, which is accessible from the left-side menu. Amnesia displays the input dataset and the hierarchies on the upper part of the screen. The user can make his selections on the bottom part of the screen:
Anonymization is initiated by clicking "Execute".
Pseudo-anonymization is applied using the masking method. After the user loads a dataset to Amnesia, there is a "Pseudo-anonymization" button next to every string attribute. By clicking it, a pop-up window depicts every character of a random value of the column in a small box. The user is then asked to set the desired special character for the mask (e.g., *,&,^ etc.). Finally, the user chooses which characters from the sample value to hide with the mask character.
Amnesia is oriented toward enabling the user to tailor every step in the anonymization process. For simple k-anonymity, this entails the ability to choose any valid solution. Amnesia depicts the solution space and allows the user to choose from the solutions that guarantee k-anonymity. Unfortunately, this is not possible for km-anonymity or for local recoding k-anonymity since the solution space in these cases is too large to be completely explored or visualized. The algorithms in these cases use heuristics to quickly identify a good, but possibly not the overall best solution. The functionality described below is available only for simple k-anonymity.
To view the complete solution space for simple k-anonymity, the user must navigate the solution screen (Solution Graph). The different solutions are represented as nodes in a graph. Each node corresponds to a different combination of anonymization levels for each quasi-identifier. For example, if we have the quasi-identifiers "Age" and "Zipcode", one solution will represent "Age" anonymized to the first hierarchy level and "Zipcode" to the second and another "Age" anonymized to the second hierarchy level and "Zipcode" to the first. All possible combinations are represented. Blue nodes indicate the safe solutions, while red nodes the unsafe ones. The respective solution is applied by double-clicking a node, and the tool redirects the user to the anonymized dataset. Amnesia allows the user to customize the final solution further by transforming an unsafe solution into a safe one using suppression. It is often the case that an unsafe solution violates the desired k-anonymity guarantee just for a few records. For example, in a dataset with 100k records, there might be only 5 records that do no fall into k-sized groups. Instead of further generalizing the whole dataset, the user might opt for suppressing, (i.e., removing these 5 records). Amnesia will show the percentage of records that violate the guaranty by following these steps:
Amnesia will show the percentage of records that violate the privacy guarantee based on the combination of quasi-identifiers chosen on the left part of the pop-up window. If the user presses the "Suppress" button at the bottom-right part of the window, these records will be removed, and the unsafe solution will be transformed into a safe one and will be directly applied to the dataset.
Amnesia gives the opportunity to the user to further customize the final solution by transforming an safe solution to safe one using suppression. It is often the case that an unsafe solution violates the desired k-anonymity guaranty just for a few records. For example, in a dataset with 100k records, it might be only 5 records that do no fall into k-sized groups. Instead for further generalizing the whole dataset, the user might opt for suppressing, i.e., removing these 5 records. Amnesia will show the percentage of records that violate the guaranty by following these steps: 1. Choose an unsafe solution in solution graph, 2. Choose "Show statistics", 3. Select the combination of all available quasi identifiers. Amnesia will show on the left part of the pop up window the percentage of records that violate the privacy guaranty based on the combination of quasi identifiers chosen. If the user presses the suppress button at the bottom right part of the window, these records will be removed and the unsafe solution will be transformed to a safe one and will be directly applied to the dataset.
The user can get a preview of each solution in the solution graph. When a solution node is clicked, a pop-up menu will appear with the option "Preview of the Anonymized dataset"..
An anonymization solution comprises a series of anonymization rules that define how each quasi-identifier must be anonymized (e.g., Rule 1: "Country" attribute should be anonymized to the continent level). Amnesia allows the user to save these rules so that they can be reused in the same or similar datasets in the future. Rules are saved by using the "Save Rules" button in the results screen.
Saving the anonymized dataset is possible through the results screen or the anonymized dataset screen ("Anonymized->Source") by clicking "Save To Local" located in the upper-right menu.