Home > Features

Features

Pseudo-anonymization is a de-identification method that removes or replaces direct identifiers (names, ids, phone numbers, etc.) from a sensitive dataset (health records, medical prescriptions, financial information, online surveys, workplace files, etc.).

Masking is a pseudo-anonymization method that refers to hiding some of the information in the dataset using alternative characters. Masking techniques are widely used for hiding parts of credit card numbers during credit card processes and payments.

However, remaining identifying data (date of birth, zip code, gender, marital status, etc.) could be combined to re-identify persons and compromise their privacy. Note that the triple (date of birth, gender, zip code) is enough to uniquely identify at least 87% of US citizens in publicly available datasets. Such identifying data are also called quasi-identifiers or indirect identifiers.

K-anonymity is a real privacy guarantee, proved and established mathematically. It guarantees that every combination of values of quasi-identifiers can be indistinctly matched to at least k persons. In order to achieve that, quasi-identifiers are pooled in a larger group that contains information corresponding to any single person. K refers to the occurrences of each combination of quasi-identifiers in the dataset. If k=3, the dataset is 3-anonymous, meaning at least three records (persons) share the same quasi-identifier values. Pooling the quasi-identifiers in larger groups is achieved through generalization, hence reducing an attribute's specificity by substituting a specific value for a more general one. For example, specific age values can be generalized into age groups (e.g., the group 30-35 includes the age values of 30, 31, 32, 33, 34, 35). In the case of data values that are irrelevant to the purpose of the data collection, k-anonymity can be achieved by using suppression complementary to generalization. Suppression refers to removing an attribute's value entirely from the dataset.

Km-anonymity is a weaker form of k-anonymity that is better suited for high-dimensional data. As in k-anonymity, the algorithm considers a number n of quasi-identifiers but now limits the guarantee against adversaries that know only m of the n quasi-identifiers (m << n). In other words, the anonymization algorithm guarantees that each combination of m quasi-identifiers appears at least k times in the datasets, independently of the total number n of quasi-identifiers.

Generalization and suppression should be used while preserving the usability of the data. To reduce information loss, using demographic statistics can provide limited yet valuable privacy guarantees. In this case, k-anonymity may not hold for the entire dataset; however, it is guaranteed a population distribution.