De-Identification in Research Data Management - Omnibus

# Overview Researchers routinely make commitments of confidentiality to the people whose data they collect, typically through informed consent and IRB-approved protocols. Confidentiality is the ethical promise that information about identifiable individuals will not be disclosed without authorization. *De-identification* is one of the technical means of upholding that promise: It modifies a dataset so the individuals it describes cannot reasonably be re-identified, making the data safer to store, analyze, share, and archive. De-identification is required (with varying specificity) by frameworks including HIPAA in the US health context, FERPA for US educational data, and GDPR in the EU. There are two important distinctions. - **De-identification vs. anonymization.** These terms are often used interchangeably, but they are not the same. _Anonymization_ implies irreversible removal of all linkages between data and individual. _De-identification_ is a weaker standard; it removes obvious identifiers but typically retains the possibility of re-identification by someone with sufficient auxiliary information. Most "anonymized" research data is, more accurately, de-identified. - **De-identification vs. pseudonymization.** Replacing names with codes (e.g., UUIDs or SGICs) is _pseudonymization_: The data are protected against casual inspection, but a linking key may still exist somewhere. Pseudonymized data remain regulated under HIPAA, GDPR, and most IRB frameworks. The central practical question is not _whether_ data are "de-identified" in some absolute sense, but _what is the residual risk of re-identification_, and whether that risk is acceptable given the data's sensitivity, the intended audience, and the use case. De-identification, in short, is a risk-management activity in service of a confidentiality commitment, not a checkbox. # The Limits of Simply Removing Names A common intuition is that removing direct identifiers (names, SSNs, addresses, phone numbers) makes a dataset safe to share. This intuition is wrong, and demonstrably so. The foundational result is Latanya Sweeney's 2000 finding that the combination of **5-digit ZIP code, full date of birth, and sex** uniquely identified approximately **87% of the US population** based on 1990 census data. A 2006 re-analysis by Golle, using more careful methodology, revised this figure to roughly 61–63%, which is still a substantial majority. The implication is direct: A dataset stripped of name and address but containing these three demographic variables is _not_ de-identified in any meaningful sense, because those three variables in combination act as a near-unique fingerprint that can be linked back to publicly available records like voter rolls. This is why HIPAA's **Safe Harbor** rule specifies 18 categories of identifiers that must be removed, not just name and SSN. The list includes geographic subdivisions smaller than a state (ZIP codes can be retained only at the 3-digit level, and only for populated areas), all dates more specific than year, ages over 89, and a catch-all "any other unique identifying number, characteristic, or code." Safe Harbor reflects an empirical lesson: Re-identification attacks succeed by combining innocuous-looking variables, so the legal standard had to expand far beyond the obvious identifiers. The general principle: **Identification is a property of variable combinations, not individual variables.** A dataset can be deeply identifying even when every column, viewed alone, looks harmless. # Quasi-Identifiers The technical term for these "innocuous in isolation, identifying in combination" variables is **quasi-identifiers**. The standard taxonomy (used in k-anonymity work, HIPAA guidance, and FERPA implementation) recognizes four kinds of variables. - **Direct identifiers** are variables that identify a person on their own (name, SSN, email address, full street address). These should be removed or replaced for any dataset intended for sharing. - **Quasi-identifiers** are variables that do not identify on their own but can identify in combination, especially when linked to external data sources. Common examples include date of birth, ZIP code or other geography, sex, race/ethnicity, occupation, education level, job title, employer, and household composition. In educational research, common quasi-identifiers include grade level, school, district, and disability status. - **Sensitive attributes** are variables whose disclosure causes harm or embarrassment (medical diagnosis, income, test scores, disciplinary records). These are what de-identification is _protecting_, not what it is removing. - **Non-identifying attributes** are variables that pose no realistic identification risk (e.g., the outcome of a randomized intervention if reported only in aggregate). The hard part is that the boundary between "quasi-identifier" and "non-identifying" is **context-dependent**. A variable that is harmless in a national dataset of 100,000 cases may be identifying in a school-district dataset of 200. This is why a fixed rule like Safe Harbor is necessarily imperfect; it cannot anticipate every context, and an "expert determination" (HIPAA's alternative pathway) requires actually assessing the joint distribution of variables in the dataset at hand. A useful heuristic: **Assume an adversary has access to public records** (voter rolls, social media, school directories, news coverage) and ask which combinations of your variables, plus those public sources, could uniquely identify a respondent. If the answer is "many," the dataset is not ready to share. ## k-Anonymity **k-anonymity** is the most widely used formal model for evaluating re-identification risk based on quasi-identifiers, introduced by Samarati and Sweeney in the late 1990s. It is essentially a way to operationalize the heuristic above. The core idea is simple: A dataset is **k-anonymous** with respect to a chosen set of quasi-identifiers if every combination of those quasi-identifier values appears in at least _k_ records. In other words, every individual in the dataset is indistinguishable from at least _k_ − 1 others on the basis of those variables. A concrete example. Suppose a dataset of teacher survey responses contains the quasi-identifiers {district, grade level, years of experience} alongside sensitive responses about administrator trust. If a particular combination (say, _District A, 5th grade, 22 years experience_) appears for only one teacher, that teacher is uniquely identifiable to anyone who knows her demographics. The dataset is **1-anonymous** with respect to those quasi-identifiers, which is to say, not anonymous at all. If the dataset is restructured so every combination matches at least five teachers, it becomes **5-anonymous**, and any specific demographic profile traces to at least five possible respondents. Two standard techniques produce k-anonymity. - **Generalization** replaces precise values with broader categories. Date of birth becomes birth year; ZIP code becomes 3-digit ZIP; "22 years experience" becomes "20–29 years." - **Suppression** removes or masks values, either at the cell level (replacing rare values with a missing code) or at the record level (dropping outlier rows that cannot be generalized into a sufficient group). **The k-anonymity tradeoff.** Higher _k_ means lower re-identification risk but greater information loss. With aggressive generalization, the data become safer but also less analytically useful; categories grow coarser, rare-but-important cases disappear, and statistical power for subgroup analyses drops. There is no value of _k_ that is universally "right"; it depends on the sensitivity of the sensitive attributes, the size of the dataset, the audience, and the analytic uses planned. FERPA's de-identification guidance, for instance, implicitly endorses _k_ = 5 (at least four others with the same quasi-identifier combination). **What k-anonymity does not protect against.** Two well-documented limitations are worth knowing. - **Homogeneity attack.** If all _k_ records in a group share the same sensitive attribute value (e.g., all five teachers with profile _District A / 5th grade / 20–29 years experience_ report low administrator trust), the adversary learns the sensitive value even though they cannot identify the individual. Refinements like **l-diversity** and **t-closeness** address this by requiring variation in sensitive attributes within each group. - **Background-knowledge attack.** An adversary who already knows something about a target (e.g., "I know she is in this dataset, and I know she has a child with a disability") can narrow the candidates within a k-anonymous group, sometimes to one. These limitations do not invalidate k-anonymity; they clarify that it is a _minimum_ standard, not a guarantee. In practice, k-anonymity is best treated as a structured way to think about residual risk and to make explicit, defensible choices, rather than a formula that produces "safe" data. # Practical Implications for RDM These are just a few practical guidelines and principles worth mentioning. - **De-identification serves confidentiality; it does not replace it.** Confidentiality is an ethical commitment to participants. De-identification is one technical tool for honoring that commitment, alongside secure storage, access controls, and disciplined data-sharing practices. - **Pseudonymization (UUIDs, SGICs, replacement IDs) is necessary but not sufficient.** Removing names protects against casual inspection; it does not protect against linkage attacks via quasi-identifiers. - **Variables are not identifying or non-identifying in themselves; combinations are.** Researchers should learn to look at their data as an adversary would, asking which joint distributions of variables are sparse enough to identify individuals. - **Small samples amplify risk.** Most applied research in education and the behavioral sciences uses samples small enough that k-anonymity with even modest _k_ requires aggressive generalization. This tension between data utility and disclosure risk is itself a topic worth surfacing in graduate-level RDM training. ___ # Sources - Sweeney, L. (2000). _Simple Demographics Often Identify People Uniquely_ (Carnegie Mellon University, Data Privacy Working Paper 3). https://dataprivacylab.org/projects/identifiability/paper1.pdf - Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. _International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems_, 10(5), 557–570. - Golle, P. (2006). Revisiting the uniqueness of simple demographics in the US population. In _Proceedings of the 5th ACM Workshop on Privacy in Electronic Society_. https://crypto.stanford.edu/~pgolle/papers/census.pdf - US Department of Health and Human Services. _Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the HIPAA Privacy Rule._ https://www.hhs.gov/hipaa/for-professionals/special-topics/de-identification/index.html - Utrecht University Data Privacy Handbook, [K-anonymity, l-diversity and t-closeness](https://utrechtuniversity.github.io/dataprivacyhandbook/k-l-t-anonymity.html). - Wikipedia, [k-anonymity](https://en.wikipedia.org/wiki/K-anonymity).