The dataset that can be processed by Amnesia are stored as delimited text files. Each line in the text file is a different record and each distinct value of the record is separated from the next by a delimiter. When importing a dataset the user must provide the delimiter used in the original file to the application. Anonymized files are saved in the same format as the original files. The data models supported by Amnesia are relational tables, set collections and object relational tables (all of them are stored as delimited text files). Relational tables have a fixed number of columns, and as a result each record has the same number of values. Each column can have a different data type. Set collections are datasets consistent of arbitrary length records. Each record can have an arbitrary number of values of the same data type (only strings are currently supported for sets). Object relational tables are the combination of the above. The are tables that have a fixed number of columns, but one column is a set, i.e., it contains an arbitrary number of values of the same type. To correctly parse a delimited file containing an object relational table, the user must provide Amnesia two delimiters: a) the one separating values of different columns and b) the one separating different values in the set column. Amnesia supports 4 data types: Strings, integers, doubles (floating point) and dates. When loading a dataset Amnesia will try to guess the type the data, but it will do so based only on the first lines of the imported dataset. The user should check these recommendations and correct them accordingly.
Import of a dataset with personal data in Amnesia is initiated in 4 different ways: a) By clicking the orange button “Load Dataset” located up and left to the Index screen, b) By using the left side menu: “Source” -> “Load From Local”, c) by clicking the image (“Drop files to upload”). When the dataset appears in the screen, you need to press the button “upload”, which located up and right inside the image, and d) through the dataset screen (“Source”->”Manage”) by pressing the button “Load New Dataset”, located in the up right menu. After the initialization of the loading a wizard guides the user to correctly model the data in Amnesia. In the first step, the user should choose how the input dataset should be treated, there are 4 choices: a) “Simple Table”, b)”Sets of values”, c)”Table with a set-valued attribute” and d) “Disk based simple table”. Τhe “Simple table” refers to a relational table, i.e., a table with a fix number of columns of possibly different data type, the “Sets of values” to collections of records with an arbitrary number of values of the same type, “Table with set-valued attribute” is a table with a fix number of columns where one column is a set. Tables with a set-valued attribute will need 2 delimiters: one for separating columns and one for separating values inside the set-valued attribute. Finally, “Disk based simple table” is a simple table that resides on the hard disk and it is the suitable option for very large datasets. Amnesia parses the first lines of the dataset and presents a preview to the user. Amnesia guesses the data types, which have to be confirmed by the user. The user can also choose by using the check box next to each column name, which columns will appear in the output dataset.
Amnesia can import data directly from Zenodo. The process is initiated by using the left side menu: "Source" -> "Load From Zenodo". A wizard guides the user to set up a connection with Zenodo, so that Amnesia acquires access to her or his files. In the first step, the wizard asks for the user access token, which can be found in her or his Zenodo profile. Following the connection, Amnesia presents a table with the full description of the files, that are saved in the user’s Zenodo profile. The user can choose the file by clicking on it.
Data is only saved locally with this option. From the left menu navigate to the dataset screen ("Source" -> "Manage") and click button “Save To Local”, located up and right to the screen. The saved file is in txt format with comma as delimiter.
Data is saved to Zenodo. From the left menu navigate to the dataset screen ("Source" -> "Manage") and click button “Save To Zenodo”, located up and right to the screen. A wizard will be initiated to guide the user in the saving process. In the first step, the user will be asked to the user authentication token (taken from her or his Zenodo account), Author, Affiliation, Filename, Title, Description, Contributors and Keywords. In the next step, a summary that describes the file that will be published in Zenodo is displayed to be confirmed by the user. The last column of the table is the percentage of similarity between this specific file, with the file that the user wants to save. This similarity percentage arises from the comparison of the attributes: fileName, keywords and checksum between the two files. Upon confirmation of the user, the filed is published. The saved file is in txt format with comma as delimiter.
This option allows the user to check whether the source dataset is already anonymous or not, according to k-anonymity. This option is available through the dataset screen ("Source" -> "Manage" ), by cliking the “Check Anonymization” button, located down and right. A wizard will be initiated and the k parameter for the anonymization guaranty will be asked from the user. In the next step, a graphical representation of the dataset as a pie chart appears. The pie chart indicates all groups or records and their size and highlights the percentage of records which fall to groups with size less than k. The user can trivially anonymize the dataset by suppressing all records that fall in the latter category.
Saved solutions, i.e., collections of rules for the generalization of values, can be loaded and applied to different dataset. Note, that applying a solution to a different dataset does not guarantee that the new dataset will be anonymous. Loading of anonymization rules is accessible from the dataset screen("Source" -> "Manage"), with through the button “Load Anon Rules”located in the up and right menu.
Generalization hierarchies is a set of rules that define how detailed values can be replaced by more generic ones in the process of anonymizing the data. The key idea here is that values that are specific enough to be identifying, e.g., a residence zip code, are replaced by more generic ones, e.g., by the city that contains the respective zip code, so that they can no longer reveal the identity of a person. An example hierarchy is depicted in the following figure.
The anonymization algorithm will use the hierarchy to replace specific values with more generic ones, until the privacy guaranty is reached. A characteristic of generalization hierarchies is that all nodes lead up to a single node (root). This property guaranties that if needed the algorithm will be able to replace all values by a common one.
Generalization hierarchies contain semantic information and the user has to provide them as input. In the case of domains where semantic information is linked to a total order, e.g., numbers, Amnesia can provide substantial help to the user to generate a new ones.
Hierarchies can be saved and load by Amnesia in a predefined format. You can see an example here.
For several important real work ontologies (ICD codes, Zip codes) we offer hierarchies here.
Loading of hierarchies is initiated at two points : a) by using the left menu: "Hierarchy" -> "Load From Local" and b) from the hierarchy screen ("Hierarchy" -> "Manage") by pressing the button “Load New Hierarchy” located in the up right menu.
Hierarchies created by Amnesia can be saved as local files. Saving an hierarchy is accessible from the Hierarchy screen ("Hierarchy" -> "Manage") by pressing the button “Save Hierarchy” in the up right menu.
Amnesia helps the user to create custom hierarchies, based on the input file. The hierarchy is created in such a way that it contains the active domain (i.e, all the values) of an attribute of the input file. The user must first choose: the attribute, the type (distinct or range) and the variable type (domain) of hierarchy. In the following the user must make a choice based on whether a hierarchy or ranges or distinct values will be created. In the latter case, the user can further custom the hierarchy creation by choosing: the sorting function for the domain values (numeric, alphabetical, random), the name of hierarchy and the fanout (i.e., the average number of children of each node). In former case, the user must choose the name of hierarchy, the boundaries of the attribute domain, the step, i.e., the size of ranges at the lower level of the hierarchy and the fanout. This feature is accessible by using the left menu: "Hierarchy" -> "Auto Generate" and through the hierarchy screen ("Hierarchy" -> "Manage") by pressing the button “Autogenerate Hierarchy” located in the up and right menu.
A special case of range hierarchies are dates. Because dates are not based on the decimal system the user must define several ranges for different granularity levels. More specifically, the user must define how many days, how many months and how many years will be grouped in each node. Moreover, the user must define the fanout, which is the number of year ranges that will be grouped in a single node.
Existing or auto-generated hierarchies can be edited by the user through a visual interface (by pressing the “Edit” button in the hierarchy panel). Editing includes adding and removing nodes, renaming nodes, moving nodes from one place of the tree to another. A detailed guide on hierarchy editing with visual examples is available here
Algorithm execution is initiated through the algorithms screen, which is accessible from the left menu. In the paper upper part of the screen the input dataset and the hierarchies are displayed. The choices of the user are done in the lower part of the screen. On the left the user must associate each attribute that acts as a quasi-identifier with an already loaded hierarchy. The same hierarchy can be used in several attributes. One hierarchy must be defined for each quasi identifier. Finally, in down and right panel, the user can choose the anonymization algorithm and its parameters. Currently the user has the following choices: k-anonymity for simple tables, k-anonymity with local recording for disk based simple tables and km-anonymity for sets, and either k-anonymity or km-anonymity for relational tables with a set valued attribute. Anonymization is initiated by clicking the button “execute”.
Amnesia is oriented in enabling the user to tailor every step in the anonymization processes. For simple k-anonymity this entails the ability to choose any valid solution. Amnesia depicts the solution space and allows the user to choose of the solutions that guarantee k-anonymity. Unfortunately, this is not possible for km-anonymity or for local recoding k-anonymity since the solution space in these cases is too large to be completely explored or visualized. The algorithms in these cases use heuristics to identify quickly a good, but possibly not the overall best solution. The functionality described below is available only for simple k-anonymity.
To view the complete solution space for simple k-anonymity, the user has to navigate to the solution screen (Solution Graph). The different solutions are represented as nodes in a graph. Each node corresponds to a different combination of anonymization levels for each quasi identifiers. For example, if we have the quasi identifiers "Age" and "Zipcode", one solution will represent "Age" anonymized to the first hierarchy level and "Zipcode" to the second and another "Age" anonymized to the second hierarchy level and "Zipcode" to the first. All possible combinations will be represented. Blue nodes indicate safe solutions and red nodes unsafe. By double-clicking on a node the respective solution is applied and the platform will redirect the user to the anonymized dataset.
Amnesia gives the opportunity to the user to further customize the final solution by transforming an safe solution to safe one using suppression. It is often the case that an unsafe solution violates the desired k-anonymity guaranty just for a few records. For example, in a dataset with 100k records, it might be only 5 records that do no fall into k-sized groups. Instead for further generalizing the whole dataset, the user might opt for suppressing, i.e., removing these 5 records. Amnesia will show the percentage of records that violate the guaranty by following these steps: 1. Choose an unsafe solution in solution graph, 2. Choose “Show statistics”, 3. Select the combination of all available quasi identifiers. Amnesia will show on the left part of the pop up window the percentage of records that violate the privacy guaranty based on the combination of quasi identifiers chosen. If the user presses the suppress button at the bottom right part of the window, these records will be removed and the unsafe solution will be transformed to a safe one and will be directly applied to the dataset.
The user can get a preview of each solution in the solution graph. When a solution node is clicked a pop-up menu will appear with the option User select (one click) a solution and a pop-up will appear, then user clicks to “Preview of the Anonymized dataset”.
An anonymization solution comprises a series of anonymization rules, that define how each quasi identifier must be anonymized, e.g., Rule 1: "Country" attribute should be anonymized to the continent level. Amnesia allows the user to save these rules, so they can be reused in the same or similar datasets in the future. Rules are saved by using the "Save Rules" button in the results screen.
Saving the anonymized dataset is possible through the results screen or the anonymized dataset screen (Anonymized->Source), by pressing the button “Save To Local”, located in the up right menu.