---
title: "fetch"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{fetch}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
In 2020 I wrote a package called [libr](https://libr.r-sassy.org) to deal
with the problem of managing multiple related datasets. This package allows
you to manage data from several popular data sources:
SAS, Excel, CSV, and R. While the package received generally positive feedback,
users requested some features that the package did not support:
* Users wanted to examine the data *before* loading it into memory.
* Users wanted to pick only *some* datasets to load into memory.
* Users wanted to subset all datasets in the library based on a condition.
After consideration of these requests, I determined that these features would
be most easily satisfied by creating a new package. That package is **fetch**.
## Key Features
The **fetch** package allows you to retrieve data from any of several different
data sources while minimizing memory usage.
There are only two steps to retrieve data:
* Create a data catalog
* Fetch data from the catalog
The above two steps are accomplished by the following functions:
* `catalog()`: Defines a data catalog for a particular data source
* `fetch()`: Pulls a data item from the catalog and loads it into memory
The package has the following key features:
* The `catalog()` function allows you to explore the data before reading from the source.
* The `fetch()` function loads only the data you need for your analysis.
* The "where" parameter on the `catalog()` function allows you to subset across
all datasets in the catalog.
* The "select" and "top" parameters on the `fetch()` function let you further
limit the data returned.
* The "import_spec" parameters on the `catalog()` and `fetch()` functions
give you control over the column data types.
## How to use **fetch**
Let's start with a simple example. In this example we will create a data catalog,
examine its contents, and fetch a dataset from the catalog. First, create
and view the data catalog:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog
ct <- catalog(pkg, engines$csv)
# View catalog
ct
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
# data item 'ADAE': 56 cols 150 rows
# data item 'ADEX': 17 cols 348 rows
# data item 'ADPR': 37 cols 552 rows
# data item 'ADPSGA': 42 cols 695 rows
# data item 'ADSL': 56 cols 87 rows
# data item 'ADVS': 37 cols 3617 rows
```
As can be seen above, this catalog has 6 csv files in it. The number of columns
and rows are displayed for each file. Information about each file is contained
in the catalog. This information can be accessed using list notion, like so:
```{r eval=FALSE, echo=TRUE}
# View info for catalog item
ct$ADEX
# data item 'ADEX': 17 cols 348 rows
# - Engine: csv
# - Size: 70.7 Kb
# - Last Modified: 2020-09-18 14:30:22
# Name Column Class Label Format NAs MaxChar
# 1 ADEX STUDYID character NA 0 3
# 2 ADEX USUBJID character NA 0 10
# 3 ADEX SUBJID character NA 0 3
# 4 ADEX SITEID character NA 0 2
# 5 ADEX TRTP character NA 8 5
# 6 ADEX TRTPN numeric NA 8 1
# 7 ADEX TRTA character NA 8 5
# 8 ADEX TRTAN numeric NA 8 1
# 9 ADEX RANDFL character NA 0 1
# 10 ADEX SAFFL character NA 0 1
# 11 ADEX MITTFL character NA 0 1
# 12 ADEX PPROTFL character NA 0 1
# 13 ADEX PARAM character NA 0 45
# 14 ADEX PARAMCD character NA 0 8
# 15 ADEX PARAMN numeric NA 0 1
# 16 ADEX AVAL numeric NA 16 4
# 17 ADEX AVALCAT1 character NA 87 10
```
When the catalog item is printed, it shows a data dictionary for the
specified dataset. Column names and data types are displayed, along with
some other useful attributes of your data.
Once the catalog is created, you are ready to fetch data from the catalog:
```{r eval=FALSE, echo=TRUE}
# Fetch data from a catalog
dt <- fetch(ct$ADEX)
# View data
dt
# A tibble: 348 × 17
# STUDYID USUBJID SUBJID SITEID TRTP TRTPN TRTA TRTAN RANDFL SAFFL MITTFL PPROTFL
#
# 1 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y
# 2 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y
# 3 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y
# 4 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y
# 5 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y
# 6 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y
# 7 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y
# 8 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y
# 9 ABC ABC-01-051 051 01 ARM A 1 ARM A 1 Y Y Y Y
# 10 ABC ABC-01-051 051 01 ARM A 1 ARM A 1 Y Y Y Y
# 338 more rows
# 5 more variables: PARAM , PARAMCD , PARAMN , AVAL ,
# AVALCAT1
# Use `print(n = ...)` to see more rows
```
At this point, you can proceed with your analysis, or fetch additional datasets
from the catalog as needed. So easy!
## Additional Features
The above example covers the very basics. Now let's look at some more useful
features of the **fetch** package.
### Apply a Where Expression to a Fetch Operation
What if you are performing an analysis on a subset of data? Is there a way
to apply a subset when the data is fetched? You can apply subset criteria
using the "where" parameter on the `fetch()` function. Like this:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog
ct <- catalog(pkg, engines$csv)
# Subset data for a specific subject
dt <- fetch(ct$ADVS, where = expression(SUBJID == '049'))
# View data subset
dt
# A tibble: 44 × 37
# STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL
#
# 1 ABC ABC-01-049 049 01 VS VSSTRESN 9 ARM D 4 ARM D 4 Y
# 2 ABC ABC-01-049 049 01 VS VSSTRESN 41 ARM D 4 ARM D 4 Y
# 3 ABC ABC-01-049 049 01 NA NA NA ARM D 4 ARM D 4 Y
# 4 ABC ABC-01-049 049 01 VS VSSTRESN 34 ARM D 4 ARM D 4 Y
# 5 ABC ABC-01-049 049 01 VS VSSTRESN 35 ARM D 4 ARM D 4 Y
# 6 ABC ABC-01-049 049 01 VS VSSTRESN 36 ARM D 4 ARM D 4 Y
# 7 ABC ABC-01-049 049 01 VS VSSTRESN 37 ARM D 4 ARM D 4 Y
# 8 ABC ABC-01-049 049 01 VS VSSTRESN 38 ARM D 4 ARM D 4 Y
# 9 ABC ABC-01-049 049 01 VS VSSTRESN 39 ARM D 4 ARM D 4 Y
# 10 ABC ABC-01-049 049 01 VS VSSTRESN 40 ARM D 4 ARM D 4 Y
# 34 more rows
# 25 more variables: SAFFL , MITTFL , PPROTFL , TRTSDT ,
# TRTEDT , ADT , ADY , ADTF , AVISIT , AVISITN ,
# PARAM , PARAMCD , PARAMN , PARAMTYP , AVAL , BASE ,
# CHG , AWRANGE , AWTARGET , AWTDIFF , AWLO , AWHI ,
# AWU , ABLFL , ANL01FL
# Use `print(n = ...)` to see more rows
```
The `expression` function can be used to define any R expression. This function
accepts unquoted variable names, logical operators, and R functions like
`is.null()` and `is.na()`. You can use the the "where" parameter with an
expression to narrow down the data you are working with and reduce
the memory footprint of your program.
### Get Top N Rows
Another way to limit the data returned by your program is to use the "top"
parameter. This parameter allows you to return only the "top N" number of rows
in the target data item. The "top" parameter is useful for exploratory purposes.
Here is an example:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog
ct <- catalog(pkg, engines$csv)
# Subset data for a specific subject
dt <- fetch(ct$ADVS, top = 5, where = expression(SUBJID == '049'))
# View results
dt
# A tibble: 5 × 37
# STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL SAFFL
#
# 1 ABC ABC-01-… 049 01 VS VSSTR… 9 ARM D 4 ARM D 4 Y Y
# 2 ABC ABC-01-… 049 01 VS VSSTR… 41 ARM D 4 ARM D 4 Y Y
# 3 ABC ABC-01-… 049 01 NA NA NA ARM D 4 ARM D 4 Y Y
# 4 ABC ABC-01-… 049 01 VS VSSTR… 34 ARM D 4 ARM D 4 Y Y
# 5 ABC ABC-01-… 049 01 VS VSSTR… 35 ARM D 4 ARM D 4 Y Y
# 24 more variables: MITTFL , PPROTFL , TRTSDT , TRTEDT ,
# ADT , ADY , ADTF , AVISIT , AVISITN , PARAM ,
# PARAMCD , PARAMN , PARAMTYP , AVAL , BASE , CHG ,
# AWRANGE , AWTARGET , AWTDIFF , AWLO , AWHI , AWU ,
# ABLFL , ANL01FL
```
The above fetch returned the top 5 rows of the dataset for subject '049'. Note
that the "top" parameter can be used with or without the "where" parameter.
### Apply a Pattern to a Catalog
The previous examples loaded all datasets from the data source into the data
catalog. There may be instances, however, when you have a very large number
of datasets, and want to limit the data items loaded into the catalog. For
this situation, you can use the "pattern" parameter on the `catalog()` function,
like so:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog, applying pattern to dataset names
ct <- catalog(pkg, engines$csv, pattern = "*S*")
# View catalog
ct
# data catalog: 3 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Pattern: *S*
# - Items:
# data item 'ADPSGA': 42 cols 695 rows
# data item 'ADSL': 56 cols 87 rows
# data item 'ADVS': 37 cols 3617 rows
```
The above example limits the items in the catalog to those with an "S" in the
name. While this is not a realistic example, it nevertheless demonstrates
how the "pattern" parameter can be used to reduce the datasets loaded into
the catalog.
### Apply a Where Expression to an Entire Catalog
Sometimes you want to subset all the datasets in your catalog. This subset
can be accomplished with the "where" parameter on the `catalog()` function.
Observe:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog without where expression
ct1 <- catalog(pkg, engines$csv)
# View catalog
ct1
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Items:
# data item 'ADAE': 56 cols 150 rows
# data item 'ADEX': 17 cols 348 rows
# data item 'ADPR': 37 cols 552 rows
# data item 'ADPSGA': 42 cols 695 rows
# data item 'ADSL': 56 cols 87 rows
# data item 'ADVS': 37 cols 3617 rows
# Create catalog with where expression
ct2 <- catalog(pkg, engines$csv, where = expression(SUBJID == '049'))
# View catalog
ct2
# data catalog: 6 items
# - Source: C:/packages/fetch/inst/extdata
# - Engine: csv
# - Where: SUBJID == "049"
# - Items:
# # data item 'ADAE': 56 cols 5 rows
# - Where: SUBJID == "049"
# # data item 'ADEX': 17 cols 4 rows
# - Where: SUBJID == "049"
# # data item 'ADPR': 37 cols 7 rows
# - Where: SUBJID == "049"
# # data item 'ADPSGA': 42 cols 10 rows
# - Where: SUBJID == "049"
# # data item 'ADSL': 56 cols 1 rows
# - Where: SUBJID == "049"
# # data item 'ADVS': 37 cols 44 rows
# - Where: SUBJID == "049"
```
Notice that a where expression has been added to each data item in the catalog,
and the row counts for each item have been reduced. Now
let's fetch some data from this catalog:
```{r eval=FALSE, echo=TRUE}
# Subset data for a specific subject
dt1 <- fetch(ct2$ADVS)
# View results of ADVS fetch
dt1
# A tibble: 44 × 37
# STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL SAFFL
#
# 1 ABC ABC-01-049 049 01 VS VSSTRESN 9 ARM D 4 ARM D 4 Y Y
# 2 ABC ABC-01-049 049 01 VS VSSTRESN 41 ARM D 4 ARM D 4 Y Y
# 3 ABC ABC-01-049 049 01 NA NA NA ARM D 4 ARM D 4 Y Y
# 4 ABC ABC-01-049 049 01 VS VSSTRESN 34 ARM D 4 ARM D 4 Y Y
# 5 ABC ABC-01-049 049 01 VS VSSTRESN 35 ARM D 4 ARM D 4 Y Y
# 6 ABC ABC-01-049 049 01 VS VSSTRESN 36 ARM D 4 ARM D 4 Y Y
# 7 ABC ABC-01-049 049 01 VS VSSTRESN 37 ARM D 4 ARM D 4 Y Y
# 8 ABC ABC-01-049 049 01 VS VSSTRESN 38 ARM D 4 ARM D 4 Y Y
# 9 ABC ABC-01-049 049 01 VS VSSTRESN 39 ARM D 4 ARM D 4 Y Y
# 10 ABC ABC-01-049 049 01 VS VSSTRESN 40 ARM D 4 ARM D 4 Y Y
# 34 more rows
# 24 more variables: MITTFL , PPROTFL , TRTSDT , TRTEDT , ADT ,
# ADY , ADTF , AVISIT , AVISITN , PARAM , PARAMCD ,
# PARAMN , PARAMTYP , AVAL , BASE , CHG , AWRANGE ,
# AWTARGET , AWTDIFF , AWLO , AWHI , AWU , ABLFL ,
# ANL01FL
# Use `print(n = ...)` to see more rows
# View results of ADSL fetch
dt2 <- fetch(ct2$ADSL)
# A tibble: 1 × 56
# STUDYID USUBJID SUBJID SITEID AGE AGEU AGEGR1 SEX RACE RACEN ETHNIC ETHNICN COUNTRY
#
# 1 ABC ABC-01-049 049 01 39 YEARS 30-39 y… M WHITE 5 NOT H… 2 NA
# 43 more variables: ARM , ACTARM , TRT01P , TRT01PN , TRT01A ,
# TRT01AN , TRTSDT , TRTEDT , TRTDURN , TRTDURU , TR01SDT ,
# TR01EDT , INCNFL , RANDFL , RANDEXC1 , RANDEXC2 , RANDEXC3 ,
# RANDEXC4 , SAFFL , SAFEXC1 , SAFEXC2 , MITTFL , MITTEXC1 ,
# MITTEXC2 , PPROTFL , PPROTEX1 , PPROTEX2 , PPROTEX3 ,
# PPROTEX4 , COMPLFL , STDYDISP , STDYREAS , INCNDT , RANDDT ,
# DTHDT , DTHFL , MISSDOSE , TP1TRTR , TP2TRTR , TP3TRTR ,...
```
As you can see, both the ADVS and ADSL datasets have been subset for subject '049'.
This feature of the *fetch* package makes it easy to target just the rows
you are looking for.
### Add Import Specifications to a Catalog
Some data files contain accurate data type information within them, and others
do not. For example, CSV files do not contain data type information, but
RDATA files do. If the data type information is not available,
the data engine will try to guess the data types. However, it does not
always guess accurately. Sometimes you need to help the data engine determine
what the data type of a column is supposed to be. Import specs allow you to
to specify the correct data type for a column.
Examine the dictionary for the ADVS dataset:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create catalog without import spec
ct <- catalog(pkg, engines$csv)
# View dictionary for ADVS
ct$ADVS
# data item 'ADVS': 37 cols 3617 rows
# - Engine: csv
# - Size: 1.1 Mb
# - Last Modified: 2020-09-18 14:30:22
# Name Column Class Label Format NAs MaxChar
# 1 ADVS STUDYID character NA 0 3
# 2 ADVS USUBJID character NA 0 10
# 3 ADVS SUBJID character NA 0 3
# 4 ADVS SITEID character NA 0 2
# 5 ADVS SRCDOM character NA 85 2
# 6 ADVS SRCVAR character NA 85 8
# 7 ADVS SRCSEQ numeric NA 85 2
# 8 ADVS TRTP character NA 85 5
# 9 ADVS TRTPN numeric NA 85 1
# 10 ADVS TRTA character NA 85 5
# 11 ADVS TRTAN numeric NA 85 1
# 12 ADVS RANDFL character NA 0 1
# 13 ADVS SAFFL character NA 0 1
# 14 ADVS MITTFL character NA 0 1
# 15 ADVS PPROTFL character NA 0 1
# 16 ADVS TRTSDT character NA 54 9 # Character by default
# 17 ADVS TRTEDT character NA 119 9 # Character by default
# 18 ADVS ADT character NA 0 9
# 19 ADVS ADY numeric NA 54 4
# 20 ADVS ADTF logical NA 3617 0
# 21 ADVS AVISIT character NA 50 14
# 22 ADVS AVISITN numeric NA 50 2
# 23 ADVS PARAM character NA 0 35
# 24 ADVS PARAMCD character NA 0 6
# 25 ADVS PARAMN numeric NA 0 1
# 26 ADVS PARAMTYP character NA 3532 7
# 27 ADVS AVAL numeric NA 0 5
# 28 ADVS BASE numeric NA 70 5
# 29 ADVS CHG numeric NA 1312 4
# 30 ADVS AWRANGE character NA 1331 25
# 31 ADVS AWTARGET numeric NA 1331 3
# 32 ADVS AWTDIFF numeric NA 1331 2
# 33 ADVS AWLO numeric NA 1331 3
# 34 ADVS AWHI numeric NA 1331 3
# 35 ADVS AWU character NA 1331 4
# 36 ADVS ABLFL character NA 2869 1
# 37 ADVS ANL01FL character NA 448 1
```
Note that the data types of the "TRTSDT" and "TRTEDT" are listed as character.
In fact, these columns are supposed to be a date. You can force
the date data type assignment with an import spec, as follows:
```{r eval=FALSE, echo=TRUE}
# Get sample data directory
pkg <- system.file("extdata", package = "fetch")
# Create import spec
spc <- import_spec(TRTSDT = "date=%d%b%Y",
TRTEDT = "date=%d%b%Y")
# Create catalog with import spec
ct <- catalog(pkg, engines$csv, import_specs = spc)
# View dictionary for ADVS with Import Spec
ct$ADVS
# data item 'ADVS': 37 cols 3617 rows
# - Engine: csv
# - Size: 1.1 Mb
# - Last Modified: 2020-09-18 14:30:22
# Name Column Class Label Format NAs MaxChar
# 1 ADVS STUDYID character NA 0 3
# 2 ADVS USUBJID character NA 0 10
# 3 ADVS SUBJID character NA 0 3
# 4 ADVS SITEID character NA 0 2
# 5 ADVS SRCDOM character NA 85 2
# 6 ADVS SRCVAR character NA 85 8
# 7 ADVS SRCSEQ numeric NA 85 2
# 8 ADVS TRTP character NA 85 5
# 9 ADVS TRTPN numeric NA 85 1
# 10 ADVS TRTA character NA 85 5
# 11 ADVS TRTAN numeric NA 85 1
# 12 ADVS RANDFL character NA 0 1
# 13 ADVS SAFFL character NA 0 1
# 14 ADVS MITTFL character NA 0 1
# 15 ADVS PPROTFL character NA 0 1
# 16 ADVS TRTSDT Date NA 54 10 # Converted to Date
# 17 ADVS TRTEDT Date NA 119 10 # Converted to Date
# 18 ADVS ADT character NA 0 9
# 19 ADVS ADY numeric NA 54 4
# 20 ADVS ADTF logical