Self-Generated Identification Codes (SGICs) - Omnibus

> [!cite] Source > Yurek, L. A., Vasey, J., & Havens, D. S. (2008). The use of self-generated identification codes in longitudinal research. *Evaluation Review*, *32*(5), 435–452. https://doi.org/10.1177/0193841X08316676 ## Purpose and Uses A *self-generated identification code* (SGIC) is a respondent-constructed alias used to consistently link the same individual's data across multiple data collection waves **without** collecting identifying information. There is a recurring practical issue in longitudinal research: It can be challenging to matching paired observations of the same individuals over time while preserving their anonymity. This is a clever method that offers a solution to that problem. SGICs are particularly useful when: - The design is longitudinal (panel, cohort, repeated measures) and requires matching the same respondents across waves so paired data are not mistakenly treated as independent samples. - Survey content is sensitive (e.g., substance use, workplace attitudes, health behaviors) and anonymity assurances are needed to improve response validity. - Respondents are in subordinate or vulnerable positions (e.g., employees, students, patients) where perceived identifiability could discourage candid responses. - A traditional identifier (name, SSN, employee ID) would either compromise anonymity or impose unacceptable administrative burden. Reported match rates depend heavily on the interval between waves: roughly 88–100% at 1–4 months, declining to ~50–75% at 12–18 months. Yurek et al. (2008) achieved 51–67% match rates across 6-, 12-, and 18-month intervals with a hospital nurse sample. ## Constructing an SGIC The SGIC is built from a short set of researcher-constructed questions that the respondent answers identically at each wave. Items should be: - **Stable**: The answer should not change over time (e.g., birth month is fully stable; "number of older brothers" is less stable than "number of older brothers, living and deceased"). - **Variable**: Answers should be well-distributed across the sample so that the combined code is probabilistically unique (e.g., ethnicity offers poor variability in a homogeneous sample). - **Proximate to self**: Items about the respondent personally yield lower error and omission rates than items about other people. In Yurek et al.'s data, items about the respondent's mother or siblings had error/omission rates 3 to 15 times higher than items about the respondent's own birth month or middle name. A typical four-element question set (from the article) and two worked examples: | Question | Maria Hernandez | David Chen | | ---------------------------------------------- | :--------------: | :----------: | | First letter of mother's first name | Elena → `E` | Wei → `W` | | Number of older brothers (living and deceased) | 2 → `02` | 0 → `00` | | Month of birth (two digits) | September → `09` | March → `03` | | First letter of own middle name (X if none) | Luisa → `L` | (none) → `X` | | SGIC | `E0209L` | `W0003X` | To improve match rates, researchers commonly apply a two-phase match: an exact match first, then an "off-one" match allowing a single element to differ or be missing if other demographic information is consistent. ## Practical Notes for the Course - SGICs are a privacy-preserving design choice, not a data-cleaning afterthought. Thus, the item set must be planned before Wave 1. - Match rates degrade with longer intervals and with less stable item sets; researchers should expect and document attrition between waves. - More elements generally improve uniqueness, but at the cost of increased respondent burden and higher cumulative error rates. Item quality matters more than item count. - Items must be specific enough to discriminate respondents but not so individualized that anonymity is compromised.