Individuals are not present in information systems; only data about individuals is. Therefore, the relationship that exists between a piece of information and the person or people who are associated with it is what we mean when we talk about identity. As an illustration, a data item may be created by, about, or sent to an individual. Identity sums up all we already know about that person.
Degrees of Identifiability
Identifiability is the extent to which a person can be identified. Individuals who are highly identifiable and engage in specific activities (such as opening a new account or providing demographic data to a website) may be more vulnerable to identity theft and surveillance. When we can connect the data to a single person and know exactly who they are, that is the most obvious example of identification. This provides us with an identifiable person—the most robust kind of identity.
A pseudonym is a less strong kind of identity. We can link many data items about the same person when the person the data is about is identified only by a pseudonym. One major privacy benefit of digital identity is the capacity to create and employ pseudonyms, which allows us to separate an individual’s online persona from their reallife persona. This type of secrecy, meanwhile, might be deceptive because it is frequently possible to determine who is really behind the pseudonym.
Anonymity is the weakest type of identification. When data is genuinely anonymous, we are unable to identify the person it pertains to or even determine whether two data points pertain to the same person.
In order to identify a person, quasi-identifiers integrate data with outside knowledge, such as information that is readily available to the public. In the general population, for instance, “age 25 might not be a strong identifier, but “age 105” would be.
One can observe the distinctions by applying a formal definition. Let us say we have an identity function I(d) that tells us who the data item (d) is about, and a set of data items; D= {d1; dn}. I(d) is an identified person, so to speak, for a known individual, I(d)=i. I(dk) is a pseudonym if we can state that I(dj)=I(dk), meaning that the two data items are about the same person, but we do not know who that person is. The data is anonymous if we are unable to determine either statement (an identifiable individual or a pseudonym).
How Is Identity Used?
Using unique identifiers that have been created elsewhere is another strategy. There are benefits and drawbacks to this. Because the user can reuse an existing identification, externally created IDs are frequently a more user-friendly option because they minimize the number of identifiers the user needs to memorize. Lastly, it is now simpler to link data between various systems thanks to these identities.
Linkability does, however, potentially pose a privacy issue because it facilitates identity theft and other fraudulent activities by connecting data from several platforms. Giving the user the option to select the external identification is a preferable strategy. Email addresses are guaranteed to be unique due to the nature of the internet, but users are free to have more than one, giving them the opportunity to remain anonymous if they so want.
Non-repudiation—the indisputable nature of an individual’s identity—is a further justification for establishing identity. For instance, the retailer must be able to verify the legitimacy of a purchase to process a credit card transaction. Once more, even if systems are configured to require a specific person for this reason, a role (an authorized user) is all that is required. It offers new ways to safeguard identification while guaranteeing the accuracy of information, even though this distinction could be challenging for in-person interactions.
Information systems also employ identity to improve user experience, especially personalization. For instance, a recognized person’s web searches can be tailored according to their past behavior or stated preferences.
In this situation, all that is needed is a pseudonym— knowing that a collection of searches all came from the same individual, including preferences offered by that individual, is equally effective even if I do not actually know the identity of that individual.
Representing Identity
Identity can be represented in a multitude of ways, each with pros and cons. The simplest method is to rely on outside, easily remembered identifiers, such as a person’s name, or login details. Regretfully, both are rarely distinctive, which raises the risk of mistaken identity.
This is why banking typically requires; name and date of birth, and online credit card transactions ask for name and address. Many times, there is a unique combination of elements, or the non-unique occurrences are uncommon enough to be handled after the fact. These combinations usually lead to a certain person being recognized.
A system-generated user ID or unique number is an additional option. This offers the same chance for pseudonymity, and if constructed appropriately, can provide some privacy—for instance, by excluding user names from the user ID. A system-generated, unique user ID protects privacy for all users, even those who are unaware of the related privacy hazards of identifiability, even though user-generated user IDs can still offer privacy.
For example, U.S. Social Security numbers, which are unique nine-digit numbers, are generated using digit numbers, location, and date information that can aid in linking persons to their numbers. A system user ID must be dependent on other identifying data to properly ensure anonymity. Technologies designed specifically for that purpose can be used to express identity. For storing and managing identifying data, commercial programs like Google Wallet and Microsoft Passport offer a versatile architecture. Identity verification procedures are also provided by public-key infrastructure and cryptographic certificates. These systems can offer authentication methods and typically combine identity representations with other identity-related data (name, address, etc.)
De-Identification
Data needs to be de-identified before disclosure, for legal compliance and for protecting the identity, and hence, privacy of individuals.
De-identification is the removal or modification of information in a dataset to make it more difficult or impossible to identify specific individuals. In other words, de-identification moves a dataset along the identifiability spectrum, which protects privacy by reducing the ability to link data with specific individuals.
There are numerous ways to de-identify data, which are briefly described in this paper. It is pertinent to note that de-identification is a bouquet of methods and algorithms that can be treated to data, with different outcomes and levels of efficacies. De-identification is one of the primary techniques used to prevent an individual’s identity from being connected to their personal information. De-identification can be accomplished in many ways, including:
- Pseudonymization: Individual identifiers (such as names) are replaced with numbers, letters, symbols, or a combination, such that the data points are not directly associated with a specific individual.
- Anonymization: Direct and indirect identifiers are removed and mechanisms are put in place to prevent re-identification.
- Tokens: Tokenization, an example of pseudonymization, is a system of de-identifying data, which uses random tokens as stand-ins for meaningful data.
- K-anonymity, L-diversity, and T-closeness: Have been created to lessen the possibility that data anonymity may be compromised by someone who might use the data in conjunction with existing information to draw conclusions about specific individuals within a data collection.
L-diversity builds on K-anonymity by requiring at least “L” distinct values in each group of K-records for sensitive attributes. K-anonymity depends on the creation of generalized, truncated, or redacted quasi-identifiers as replacements for direct identifiers, so a given minimum number (“K”) of individuals in a dataset have the same identifier. By decreasing the granularity of data in a dataset, T-closeness increases L-diversity. Privacy technologists should understand the advantages and disadvantages of each methodology since employing these methods to secure personal data may make the resulting dataset less useful. At times, it is important to understand the dichotomy between de-identifying data to protect privacy, against using identifiable data to maintain its utility.
Various laws and regulations worldwide have different definitions of de-identification. For example, the US Health Insurance Portability and Accountability Act (HIPAA) covers health data processed – but does not restrict disclosure of defined “de-identified” information, where there is “no reasonable basis to believe that the information can be used to identify an individual.” General Data Protection Regulation (GDPR) in the EU uses the terms “anonymous” and “pseudonymous.”
In general, information that has been rendered irreversibly anonymous in such a way that an individual is no longer identifiable is not subject to the requirements of GDPR. In contrast, if the information has been “pseudonymized” in such a way that it can still be used to identify or re-identify a person individually, then it remains “personal data” and continues to fall within the ambit of GDPR. One important risk to consider when releasing de-identified data is reidentification. Specifically, there is a chance that an attacker or other third party could re-associate data back to the specific individuals to which it relates unless data is rendered irreversibly anonymous (which is a very difficult proposition to meet considering evolving technology and ubiquity of external datasets).
Recommended Actions
By consistently applying basic privacy standards, operational teams can reduce risks of disclosure of de-identified data in some of the following means:
- De-identify datasets to the extent possible and practical under the circumstances (e.g., stripping all unnecessary personal data, or by using other anonymization techniques like aggregation, tokenization, etc.)
- Whenever possible and practical, disclose de-identified data only to reputable third parties that contractually commit to use the data only for legitimate purposes, and not to combine the dataset with any other external data, without prior consent.
- Use the collection and limitation data principles to restrict what is directly relevant and necessary to accomplish the specified purpose. For example, do not process or publish data unless necessary for scientific or research analysis uses.
In the end, it is the designated owner of a dataset who must make sure that pertinent datasets are de-identified as much as practically possible given the circumstances and compliant with applicable privacy and data protection regulations.