Documenting an Instrument, a Study, and a Dataset - Omnibus

# Overview Research data documentation is often discussed under a single umbrella ("metadata," "documentation," "data management"), and the standard terms (codebook, data dictionary) are used loosely and interchangeably even in otherwise authoritative sources. This document proposes a sharper four-part taxonomy that maps onto four genuinely distinct questions a researcher needs to answer about their data, at four different points in the research timeline. The four documents: 1. **Manual:** What is this instrument, and how does it work in general? 2. **Codebook:** How is this instrument operationalized for this study? 3. **Data dictionary:** What is the structure of this specific dataset? 4. **Data log:** What has been done to this dataset, and why? A useful organizing principle is that each document belongs to a different _temporal phase_ of the project, is authored by a different _role_, and serves a different _audience_. Conflating them (which is common) creates downstream problems: Cleaning history ends up in codebooks, instrument-level information ends up in data dictionaries, and information about what was actually done to the data has nowhere to live. ## The Four Documents ### Manual The **manual** is the technical specification for an instrument, typically created by the instrument's developer. It is upstream of any particular study; it describes the instrument in its abstract, generalizable form. Examples include the MMPI manual, the PHQ-9 scoring guide, the NSSE administration manual, and the manuals accompanying commercial assessment batteries. A manual typically includes: - Conceptual rationale and theoretical grounding - Item content, scale structure, and scoring conventions - Psychometric properties (reliability, validity evidence, factor structure) - Normative data and intended populations - Administration procedures and timing - Permitted modifications, translations, and licensing terms **Authored by:** The instrument developer. **Temporal position:** Exists before any given study begins. **Audience:** Any researcher considering using the instrument. ### Codebook The **codebook** documents how an instrument is operationalized for a specific study. It is study-level, not instrument-level, and it captures the decisions a research team makes about _how_ they will use one or more instruments in their particular project. %% A codebook is the documentation of metadata for a dataset. %% A codebook typically includes: - Which items from the instrument are administered, and in what order - The full text of items and response options as presented to respondents - Scoring decisions specific to this study (e.g., which items are reverse-scored, how composite scores are calculated) - Missing-data codes and their meanings (e.g., -9 = "Refused", -8 = "Don't know", -1 = "Skipped due to filter logic") - Skip patterns and the universe of respondents who saw each item - Recoding plans (which raw values are combined into which categories, and the substantive rationale) **Authored by:** The study PI or research team. **Temporal position:** Drafted during study design, often before data collection begins. **Audience:** Human readers (collaborators, replicators, reviewers). The codebook is often _instrument-specific_: A study using three instruments will typically have three corresponding codebook sections, each tied to its source manual. ### Data Dictionary The **data dictionary** describes the structure of a specific dataset. It is the dataset's schema, in the relational-database sense from which the term originally comes. Where the codebook documents the _study's design_, the data dictionary documents the _data as they actually exist_ after collection. A data dictionary typically includes, one row per variable: - Variable name as it appears in the data file - Data type (numeric, character, factor, date, logical) - Units of measurement and precision - Allowable range or set of valid values - Code assignments for categorical variables (e.g., 1 = "Yes", 0 = "No") - Missing-value conventions in the file - Relationships to other variables (foreign keys, derived-from relationships) - Constraints (required, unique, non-null) **Authored by:** The data manager or analyst. **Temporal position:** Created and maintained after data collection, alongside the dataset itself. **Audience:** Both machine and analyst; data dictionaries are often written in machine-readable form (CSV, JSON, YAML). A single study's codebook and data dictionary will overlap in content (both will mention variable names and value codes), but they answer different questions: The codebook says _what the variable means and why it was measured this way_; the data dictionary says _what the variable looks like in this file_. ### Data Log The **data log** documents what has been done to a dataset and why. It is the running record of cleaning, recoding, transformation, exclusion, and other analyst decisions that take a raw dataset to an analytic one. Unlike the other three documents, which describe states (the instrument, the study design, the dataset's structure), the data log describes _changes_. A data log typically includes dated entries with: - What was changed (records deleted, variables recoded, outliers flagged, files merged) - Why the decision was made (preregistered criterion, suspected data-entry error, attention-check failure) - Who made the decision (matters in team projects) - How many records or variables were affected - What the prior state was, in case the decision is later revisited **Authored by:** The analyst. **Temporal position:** Maintained continuously throughout data preparation and analysis. **Audience:** The analyst's future self, collaborators, replicators, and reviewers. A note on the term: "Data log" is not firmly established in the RDM literature. Closely related terms include _cleaning log_, _change log_, _data processing log_, and the broader concept of a _provenance record_. The word "log" has a strong association with automatically generated records (server logs, transaction logs), but the data log as described here is deliberately curated by a human analyst. Be explicit about this distinction with students; some may prefer "data processing log" or "analyst's log" to head off the ambiguity. The data log is complementary to, not a substitute for, the **cleaning code itself**. In a reproducible workflow, the code (an R script, RMarkdown document, or Quarto file) is the primary, executable record of what was done. The data log captures the _reasoning_ behind decisions that the code alone cannot fully convey: Why this case was treated as an outlier rather than a data-entry error, why this cutoff was chosen, what was tried and rejected. ## The Taxonomy at a Glance |Document|Question it answers|Authored by|Temporal position| |---|---|---|---| |**Manual**|What is this instrument, and how does it work in general?|Instrument developer|Exists before the study| |**Codebook**|How is this instrument operationalized for this study?|Study PI / research team|Study design phase| |**Data dictionary**|What is the structure of this specific dataset?|Data manager / analyst|After data collection| |**Data log**|What has been done to this dataset, and why?|Analyst|Throughout analysis| ## A Note on Real-World Usage The terms _codebook_ and _data dictionary_ are widely used interchangeably in published RDM guides and in practice. Many otherwise authoritative sources (university libraries, repository documentation, federal data archives) treat them as synonyms. Students should expect to encounter the terms used loosely in the literature and in datasets they reuse. The taxonomy above is a working convention for this course, not a universal standard. It is defended on pedagogical and conceptual grounds: The four documents answer four genuinely distinct questions, and conflating them creates real downstream problems. Teaching students to distinguish them gives them both a clean internal model and the awareness that real-world documents will sometimes be misnamed. ## Pedagogical Note The temporal axis is worth surfacing explicitly when teaching this material. Students who can place each document on a project timeline are much less likely to conflate them: - The **manual** exists before the researcher ever sees it. - The **codebook** is drafted during study design, before data collection. - The **data dictionary** is created with the dataset, after data collection. - The **data log** is maintained continuously during data preparation and analysis. Each document also has a characteristic _failure mode_ worth naming for students: - Manuals are often ignored, with researchers using an instrument without consulting how the developer intended it to work. - Codebooks are often retrofitted after data collection rather than drafted during design, which loses the connection to study rationale. - Data dictionaries are often missing entirely, or replaced by reliance on variable names and the analyst's memory. - Data logs are almost universally underdeveloped: Decisions get made, code gets edited, and the rationale evaporates. A complete documentation practice produces all four, treats them as complementary rather than competing, and keeps each one in its proper lane.