> [!cite] Source
> Yurek, L. A., Vasey, J., & Havens, D. S. (2008). The use of self-generated identification codes in longitudinal research. *Evaluation Review*, *32*(5), 435–452. https://doi.org/10.1177/0193841X08316676
## Purpose and Uses
A *self-generated identification code* (SGIC) is a respondent-constructed alias used to consistently link the same individual's data across multiple data collection waves **without** collecting identifying information. There is a recurring practical issue in longitudinal research: It can be challenging to matching paired observations of the same individuals over time while preserving their anonymity. This is a clever method that offers a solution to that problem.
SGICs are particularly useful when:
- The design is longitudinal (panel, cohort, repeated measures) and requires matching the same respondents across waves so paired data are not mistakenly treated as independent samples.
- Survey content is sensitive (e.g., substance use, workplace attitudes, health behaviors) and anonymity assurances are needed to improve response validity.
- Respondents are in subordinate or vulnerable positions (e.g., employees, students, patients) where perceived identifiability could discourage candid responses.
- A traditional identifier (name, SSN, employee ID) would either compromise anonymity or impose unacceptable administrative burden.
Reported match rates depend heavily on the interval between waves: roughly 88–100% at 1–4 months, declining to ~50–75% at 12–18 months. Yurek et al. (2008) achieved 51–67% match rates across 6-, 12-, and 18-month intervals with a hospital nurse sample.
## Constructing an SGIC
The SGIC is built from a short set of researcher-constructed questions that the respondent answers identically at each wave. Items should be:
- **Stable**: The answer should not change over time (e.g., birth month is fully stable; "number of older brothers" is less stable than "number of older brothers, living and deceased").
- **Variable**: Answers should be well-distributed across the sample so that the combined code is probabilistically unique (e.g., ethnicity offers poor variability in a homogeneous sample).
- **Proximate to self**: Items about the respondent personally yield lower error and omission rates than items about other people. In Yurek et al.'s data, items about the respondent's mother or siblings had error/omission rates 3 to 15 times higher than items about the respondent's own birth month or middle name.
A typical four-element question set (from the article) and two worked examples:
| Question | Maria Hernandez | David Chen |
| ---------------------------------------------- | :--------------: | :----------: |
| First letter of mother's first name | Elena → `E` | Wei → `W` |
| Number of older brothers (living and deceased) | 2 → `02` | 0 → `00` |
| Month of birth (two digits) | September → `09` | March → `03` |
| First letter of own middle name (X if none) | Luisa → `L` | (none) → `X` |
| SGIC | `E0209L` | `W0003X` |
To improve match rates, researchers commonly apply a two-phase match: an exact match first, then an "off-one" match allowing a single element to differ or be missing if other demographic information is consistent.
## Practical Notes for the Course
- SGICs are a privacy-preserving design choice, not a data-cleaning afterthought. Thus, the item set must be planned before Wave 1.
- Match rates degrade with longer intervals and with less stable item sets; researchers should expect and document attrition between waves.
- More elements generally improve uniqueness, but at the cost of increased respondent burden and higher cumulative error rates. Item quality matters more than item count.
- Items must be specific enough to discriminate respondents but not so individualized that anonymity is compromised.