Universally Unique Identifiers (UUID) - Omnibus

# Definition A *UUID* (Universally Unique Identifier), also called a *GUID* (Globally Unique Identifier), is a 128-bit value used to label digital entities with a guarantee (probabilistic, not absolute) of uniqueness across space and time. UUIDs are defined by IETF [RFC 9562](https://www.rfc-editor.org/rfc/rfc9562.html) (which obsoletes RFC 4122) and ISO/IEC 9834-8. Crucially, they require no central registration authority: Any system can generate one independently with negligible risk of collision. A UUID is conventionally written as 32 hexadecimal digits in five hyphen-separated groups (8-4-4-4-12). For example, **`f47ac10b-58cc-4372-a567-0e02b2c3d479`**. For the random version of UUID (the most common), the probability of a duplicate within 103 trillion UUIDs is roughly one in a billion. # UUID Versions RFC 9562 defines eight versions, distinguished by how the 128 bits are produced. They fall into three functional groups, plus a custom slot: **Random** (non-reproducible, no embedded information): - **v4** is generated from random or pseudo-random numbers. Each UUID is independent: the same input generates a different UUID every time. **This is the default in most libraries and the safe default for most RDM uses.** **Time-based** (sortable by creation time, but with caveats): - **v1** combines a timestamp with the generating machine's MAC address. The MAC address is identifying information that should not be shared externally. - **v2** (DCE Security) is a v1 variant that embeds POSIX user/group IDs. Rarely supported outside its original niche. - **v6** reorders v1's fields so the timestamp comes first, making UUIDs sortable. Same MAC-address concern as v1. - **v7** combines a millisecond timestamp with random bits (sortable, no MAC address). This is the modern recommendation when time-ordering is useful. **Name-based** (deterministic and reproducible): - **v3** (MD5 hash) and **v5** (SHA-1 hash) deterministically hash a namespace UUID plus a name string. The same inputs always produce the same UUID, on any machine, at any time. v5 is preferred per the RFC; v3 remains available for compatibility. See the limitations section for why this is rarely the right tool for participant identification. **Custom:** - **v8** is reserved for experimental or vendor-specific UUIDs. The internal structure is up to the implementer. > [!tip] Practical recommendation > Assume **v4** unless you have a specific reason to choose otherwise. Use **v7** for new systems where insert order is useful. Avoid v1, v2, and v6 because they leak host identity. v5 has narrow legitimate uses but is easy to misuse (see Limitations). # Uses in Research Data Management UUIDs are useful in RDM whenever an unambiguous, decentralized identifier is needed and a registered persistent identifier (DOI, ORCID, ARK) is unavailable or unnecessary. Common applications: - **Participant/case IDs** that are unambiguous and carry no personally identifying information. - **Record-level keys**: Useful with relational data; linking observations across tables (e.g., demographics, survey responses, biomarker measurements) without relying on names or institutional IDs. - **File and dataset identifiers**: Can be used for versioned data products, derived datasets, or intermediate analysis files, where local accession numbers risk collision when datasets are combined across projects or institutions. - **Provenance tracking**: Tagging extracted samples, instrument runs, or analysis outputs so they can be traced back to their source even if filenames or directory structures change. - **Distributed or multi-site projects**: Situations where teams need to generate IDs independently without coordinating a central numbering scheme. # Generating UUIDs UUIDs can be generated in essentially any modern computing environment. In R, two CRAN packages cover the practical landscape: - **'uuid'** (Simon Urbanek; based on `libuuid`) — the long-established package. - v4 (random): `UUIDgenerate()` (default), or `UUIDgenerate(n = 50)` for a batch - v1 (time-based): `UUIDgenerate(use.time = TRUE)` - v3 and v5 (name-based): `UUIDfromName(namespace, name, type = "sha1")` for v5 (default), `type = "md5"` for v3 - Does not support v6, v7, or v8. - **'uuid'** (Thomas Bryce Kelly; Rust backend; released April 2026) — RFC 9562–compliant, supports v1, v4, v5, v6, and v7, with v7 as the default. Useful if you specifically need v7 or v6; otherwise `uuid` is sufficient for most coursework and applied use. Note that installation requires a Rust toolchain, which may be a barrier on locked-down machines. A minimal example generating participant IDs with `uuid`: ```r library(uuid) participants <- tibble::tibble( id = UUIDgenerate(n = 50) ) ``` In other environments: - **Online generators** are convenient for one-off needs, but unsuitable for batch participant ID assignment in research workflows. - **Python** has the standard-library `uuid` module: `uuid.uuid1()`, `uuid.uuid3()`, `uuid.uuid4()`, `uuid.uuid5()`. v6, v7, and v8 require a third-party package. - **Excel/Google Sheets** have no built-in UUID function, but custom formulas or scripts (e.g., Apps Script) can produce them. # Limitations and Cautions UUIDs solve a narrow problem well, but they are easy to misuse. Things UUIDs cannot do, and pitfalls to keep in mind: - **They are not persistent identifiers in the scholarly sense.** A UUID is just a string; it does not resolve to anything, is not curated, and is not registered with any authority. It cannot replace a DOI for citing a published dataset, nor an ORCID for identifying a researcher. UUIDs identify; persistent identifiers also _locate_ and _attest_. - **They are not respondent-recoverable.** Unlike an SGIC, a UUID cannot be regenerated by a participant from memory. If you assign UUIDs to participants and then need them to reidentify themselves at Wave 2 without holding a linking key, you cannot do it. UUIDs require the researcher to maintain a separate (and protected) crosswalk between UUID and respondent. - **Reproducible UUIDs (v3/v5) are usually the wrong tool for participant IDs.** It is technically possible to generate the same UUID for the same person across time and place using name-based UUIDs (v3 or v5). However, hashing a predictable input (a name, an email, a roster entry) produces a UUID that is reversible by brute force, since an adversary can simply hash the same plausible inputs and check for matches. This is _pseudonymization presented as anonymization_, and it is a well-known failure mode. Salting the input with a secret restores security but reintroduces the need to protect a key, eliminating the original advantage. For reproducible participant tracking without a key, an SGIC is the appropriate tool; for reproducible record linkage with a key, a v5 UUID is acceptable as long as the salting and key management are handled correctly. - **They are not by themselves a de-identification strategy.** Replacing names with UUIDs in a dataset is _pseudonymization_, not anonymization. If a linking key exists anywhere, the data are still re-identifiable. HIPAA and IRB requirements still apply. - **They carry no meaning and are not human-friendly.** A 36-character hex string is unreadable, error-prone if hand-transcribed, and useless for visually scanning records. For purposes where humans need to recognize or communicate an ID (e.g., site codes, condition labels, classroom rosters), short structured IDs are usually better. - **Most versions are not sortable in a meaningful way.** v4 is random, so sorting produces no insight about creation order or any other attribute. v7 fixes this for time order; v6 does as well but with the MAC-address caveat. - **Uniqueness is probabilistic, not guaranteed.** Collision is astronomically unlikely for v4 but not strictly impossible. For most research uses this is irrelevant; for systems backing critical infrastructure, it is worth understanding. - **They do not enforce referential integrity on their own.** Using UUIDs as keys does not prevent typos, duplicates, or broken joins. The usual data validation steps still apply. In short, UUIDs are excellent internal record identifiers for research data. They are not a substitute for a persistent identifier service, an anonymization scheme, or a participant-facing tracking mechanism. ___ # Sources - IETF RFC 9562, _Universally Unique IDentifiers (UUIDs)_. https://www.rfc-editor.org/rfc/rfc9562.html - Wikipedia, [Universally unique identifier](https://en.wikipedia.org/wiki/Universally_unique_identifier). - FAIR Cookbook, [Unique, persistent identifiers](https://fairplus.github.io/the-fair-cookbook/content/recipes/findability/identifiers.html). - CRAN, [`uuid` package](https://cran.r-project.org/package=uuid) and [`uuidx` package](https://cran.r-project.org/package=uuidx).