(1.3) Portability -- Importing & Exporting Data - Omnibus

# Introduction A central concern in research data management is *portability*: the ability to move data between different software systems, computing environments, and users/collaborators without losing essential information. In practice, data rarely remain within a single tool or system. A dataset might be collected in an online survey platform, inspected in Excel, analyzed in R, archived in a repository, and later reused by another researcher using entirely different software. Understanding these operations therefore requires understanding *data interchange*. # Import, Export, and Data Interchange In most research workflows, data must move between different software tools. These transfers are part of a broader process known as *data interchange*. Data interchange refers to the transfer of data between different software systems or platforms using file formats that each system can interpret. Because different programs often use their own specialized file formats for data storage, moving data between systems requires either: - converting data into a format another system can read, or - using tools capable of directly interpreting another program’s native format. These transitions require the ability to *import* and *export* data. Importing refers to reading external data into a software environment so that it can be used, manipulated, or analyzed. Exporting refers to writing data from a software environment to an external file in a given format and file type. These operations are always relative to a specific software context. A dataset exported from one system is typically imported into another as part of a broader workflow. Two broad categories of file formats are particularly important in this context. ## Native File Formats A *native file format* is a file type designed primarily for a specific software application. Examples include the data files for R and SPSS. | Software | Native format | | -------- | :----------------------: | | R | `.RData`, `.rda`, `.rds` | | SPSS | `.sav` | | Stata | `.dta` | | SAS | `.sas7bdat` | Native formats are typically optimized for the internal data structures of the software that created them. As a result, they often preserve rich metadata such as: - variable (column) data types - variable labels - value labels - missing value codes However, native formats are not always readable or editable by other applications or platforms without specialized tools. ## Interchange (Universal) File Formats An *interchange format* is a file format designed to facilitate data transfer between different software systems. These formats prioritize compatibility and simplicity over software-specific features. Common interchange formats include: | Format | Typical use | | ----------------- | ------------------------------------ | | `.csv` | tabular data exchange | | `.tsv` | tabular data exchange | | `.txt` | plain-text tabular data\* | | `.json` | hierarchical data exchange | | `.yaml` / `.toml` | structured configuration or metadata | \* This is typically, the case, but TXT files can contain any plain-text data arranged in any logical data structure. Because these formats are simple and widely supported, they can typically be read by many different tools, including statistical software, databases, spreadsheet programs, and programming languages. For this reason, interchange formats are often used when data must move between multiple systems. ## Excel Files as a "Grey Area" Microsoft Excel’s `.xlsx` format occupies an intermediate position. Technically, `.xlsx` is the native format of Microsoft Excel. Excel files can contain: - multiple worksheets - formulas - cell formatting - charts and visualizations These features make `.xlsx` files rich working documents within Excel. However, Excel files are also commonly used for data interchange, especially when datasets are shared among collaborators who may not use the same analytical software. For example: ``` R → XLSX → collaborator ``` Because `.xlsx` files are widely supported, many programs—including R—can import them. Despite this practical convenience, Excel files are not ideal interchange formats because: - they may contain multiple sheets - formatting can obscure underlying data structure - formulas may not translate correctly to other systems - automatic type conversion can introduce errors For these reasons, simple formats such as CSV or TSV are generally preferred for reliable data interchange. ## An Important Distinction Between Importing a File vs. Opening a File A common source of confusion for beginners is the difference between opening a file and importing data from a file. Although these actions may appear similar from the user’s perspective, they are conceptually different. When an application opens a file, it is working directly with that file’s native format. For example, Excel can open an `.xlsx` workbook, and SPSS can open a `.sav` dataset. In these situations, the program loads the file into memory so that it can be viewed or edited, and when the user saves the file, the changes are written back to that same file (or to a new file if “Save As” is used). In this sense, the program is operating directly on its own native document format. By contrast, importing data is a process in which a program reads data from an external file and creates an internal representation of that data within the program’s working environment. The original file remains unchanged. Instead, the software constructs a new in-memory object—often called a buffer or data object—based on the contents of the file. Any changes made afterward apply only to this internal representation, not to the original source file. For example, when R imports a CSV file, the contents of that file are read and converted into a data frame or tibble stored in working memory. Subsequent modifications affect only this in-memory dataset. The original CSV file on disk is not altered in any way. After importing, the dataset may be saved in a new format, often the native format of the software performing the import. For instance, data imported into R from a CSV file might later be saved as an `.rda` or `.RData` file. The key point is that importing is a read-only operation with respect to the source file. The source file serves merely as the input from which a new working copy of the data is constructed. # Common Interchange Pathways In practice, data can move between systems through several common pathways. ## 1. Export to a Universal Interchange Format One program exports data to a widely supported interchange format, which can then be imported by many other systems. Example: ``` SPSS (.sav) → CSV ``` In this pathway example, SPSS exports the dataset to a CSV file. Another statistical platform (such as R) could then import the CSV file using an appropriate import function. This is the most common and robust method of interchange, because formats such as CSV, TSV, and JSON are supported by many different software tools. ## 2. Import from a Universal Interchange Format Instead of exporting to another program’s native format, a program may import data from an existing interchange file. Examples: ``` CSV → R CSV → Excel CSV → Stata ``` Many datasets distributed publicly (for example, government data releases or research archives) are provided in such universal formats to maximize accessibility. ## 3. Direct Export to Another Program’s Native Format Some software systems can export data directly to the native format of another program. Example: ``` SPSS (.sav) → Stata (.dta) R → SPSS (.sav) ``` This approach attempts to preserve more metadata—such as variable labels or value labels—than would be possible when exporting to simpler formats like CSV. However, such conversions depend on the exporting software correctly translating the structures expected by the destination system. ## 4. Direct Import of Another Program’s Native Format Some programs can directly read native datasets produced by other software. Example: ``` SPSS (.sav) → R Stata (.dta) → R SAS (.sas7bdat) → R ``` In R, packages such as `haven` allow users to import native datasets from SPSS, Stata, and SAS without requiring an intermediate conversion step. # Importing Data into R R can import data from a wide variety of sources. ## Files from Other Statistical Software Datasets created in other statistical packages can often be imported directly into R. Common examples include: - SPSS (`.sav`) - Stata (`.dta`) - SAS (`.sas7bdat`) In R, these files are commonly imported using packages such as 'haven'. ## Universal File Formats R can also read many widely used interchange formats. Common examples include: - CSV and TSV files - Excel spreadsheets (`.xlsx`) - plain-text data files (`.txt`) - JSON files - YAML or TOML configuration-style data Different packages provide tools for reading these formats, including the 'tidyverse' ecosystem. ## Databases Large datasets are often stored in relational database systems rather than flat files. In these situations, data are typically retrieved through *database queries*, which extract subsets of the database into a working dataset that can be analyzed in R. # Exporting Data Exporting data involves writing a dataset from R to an external file so that it can be used elsewhere. Common export targets include: - Excel (`.xlsx`) - CSV or TSV files - native formats used by other statistical software Exporting may also occur when preparing datasets for collaborators, publications, or public repositories. %% # Format vs. Structure During Import When data are imported into R, two distinct issues must be addressed: file format and data structure. - Format refers to the encoding of the file itself (CSV, XLSX, JSON, etc.). - Structure refers to how the data are organized conceptually once imported (variables, observations, hierarchical relationships). Import functions typically translate a file’s format into an internal data structure such as a data frame or tibble. However, this translation may require assumptions about: - column types - delimiters - missing values - text encoding These assumptions can affect how the imported data are interpreted. %% # Lossless and Lossy Export Export operations may be either *lossless* or *lossy*. A lossless export preserves all information present in the original dataset. A lossy export discards or alters some information. For example, exporting an SPSS dataset to CSV may lose: - column data types (numeric, text, date) - variable labels - value labels - custom missing value definitions Because CSV files store only raw tabular data, these forms of metadata cannot be preserved. Understanding the potential for information loss is an important part of responsible data management. # Assumptions Made by Import Functions When importing data, software must often infer properties of the dataset automatically. Import functions may attempt to determine: - column data types (numeric, text, date) - delimiters - character encoding - missing value representations These assumptions can occasionally produce incorrect interpretations. For example, when identification numbers are interpreted as numeric values or when date formats are ambiguous. For this reason, it is often important to inspect imported data carefully and verify that the resulting data structure matches expectations. # Database Queries Large data systems such as relational databases are often not exported in their entirety. Instead, researchers typically extract subsets of data using *queries*. In this workflow: ``` Database → query → dataset → R ``` The query retrieves a subset of the database and returns it in a format that can be imported into R or exported to an interchange file such as CSV.