--- title: "fetch" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{fetch} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In 2020 I wrote a package called [libr](https://libr.r-sassy.org) to deal with the problem of managing multiple related datasets. This package allows you to manage data from several popular data sources: SAS, Excel, CSV, and R. While the package received generally positive feedback, users requested some features that the package did not support: * Users wanted to examine the data *before* loading it into memory. * Users wanted to pick only *some* datasets to load into memory. * Users wanted to subset all datasets in the library based on a condition. After consideration of these requests, I determined that these features would be most easily satisfied by creating a new package. That package is **fetch**. ## Key Features The **fetch** package allows you to retrieve data from any of several different data sources while minimizing memory usage. There are only two steps to retrieve data: * Create a data catalog * Fetch data from the catalog The above two steps are accomplished by the following functions: * `catalog()`: Defines a data catalog for a particular data source * `fetch()`: Pulls a data item from the catalog and loads it into memory The package has the following key features: * The `catalog()` function allows you to explore the data before reading from the source. * The `fetch()` function loads only the data you need for your analysis. * The "where" parameter on the `catalog()` function allows you to subset across all datasets in the catalog. * The "select" and "top" parameters on the `fetch()` function let you further limit the data returned. * The "import_spec" parameters on the `catalog()` and `fetch()` functions give you control over the column data types. ## How to use **fetch** Let's start with a simple example. In this example we will create a data catalog, examine its contents, and fetch a dataset from the catalog. First, create and view the data catalog: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog ct <- catalog(pkg, engines$csv) # View catalog ct # data catalog: 6 items # - Source: C:/packages/fetch/inst/extdata # - Engine: csv # - Items: # data item 'ADAE': 56 cols 150 rows # data item 'ADEX': 17 cols 348 rows # data item 'ADPR': 37 cols 552 rows # data item 'ADPSGA': 42 cols 695 rows # data item 'ADSL': 56 cols 87 rows # data item 'ADVS': 37 cols 3617 rows ``` As can be seen above, this catalog has 6 csv files in it. The number of columns and rows are displayed for each file. Information about each file is contained in the catalog. This information can be accessed using list notion, like so: ```{r eval=FALSE, echo=TRUE} # View info for catalog item ct$ADEX # data item 'ADEX': 17 cols 348 rows # - Engine: csv # - Size: 70.7 Kb # - Last Modified: 2020-09-18 14:30:22 # Name Column Class Label Format NAs MaxChar # 1 ADEX STUDYID character NA 0 3 # 2 ADEX USUBJID character NA 0 10 # 3 ADEX SUBJID character NA 0 3 # 4 ADEX SITEID character NA 0 2 # 5 ADEX TRTP character NA 8 5 # 6 ADEX TRTPN numeric NA 8 1 # 7 ADEX TRTA character NA 8 5 # 8 ADEX TRTAN numeric NA 8 1 # 9 ADEX RANDFL character NA 0 1 # 10 ADEX SAFFL character NA 0 1 # 11 ADEX MITTFL character NA 0 1 # 12 ADEX PPROTFL character NA 0 1 # 13 ADEX PARAM character NA 0 45 # 14 ADEX PARAMCD character NA 0 8 # 15 ADEX PARAMN numeric NA 0 1 # 16 ADEX AVAL numeric NA 16 4 # 17 ADEX AVALCAT1 character NA 87 10 ``` When the catalog item is printed, it shows a data dictionary for the specified dataset. Column names and data types are displayed, along with some other useful attributes of your data. Once the catalog is created, you are ready to fetch data from the catalog: ```{r eval=FALSE, echo=TRUE} # Fetch data from a catalog dt <- fetch(ct$ADEX) # View data dt # A tibble: 348 × 17 # STUDYID USUBJID SUBJID SITEID TRTP TRTPN TRTA TRTAN RANDFL SAFFL MITTFL PPROTFL # # 1 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y # 2 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y # 3 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y # 4 ABC ABC-01-049 049 01 ARM D 4 ARM D 4 Y Y Y Y # 5 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y # 6 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y # 7 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y # 8 ABC ABC-01-050 050 01 ARM B 2 ARM B 2 Y Y Y Y # 9 ABC ABC-01-051 051 01 ARM A 1 ARM A 1 Y Y Y Y # 10 ABC ABC-01-051 051 01 ARM A 1 ARM A 1 Y Y Y Y # 338 more rows # 5 more variables: PARAM , PARAMCD , PARAMN , AVAL , # AVALCAT1 # Use `print(n = ...)` to see more rows ``` At this point, you can proceed with your analysis, or fetch additional datasets from the catalog as needed. So easy! ## Additional Features The above example covers the very basics. Now let's look at some more useful features of the **fetch** package. ### Apply a Where Expression to a Fetch Operation What if you are performing an analysis on a subset of data? Is there a way to apply a subset when the data is fetched? You can apply subset criteria using the "where" parameter on the `fetch()` function. Like this: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog ct <- catalog(pkg, engines$csv) # Subset data for a specific subject dt <- fetch(ct$ADVS, where = expression(SUBJID == '049')) # View data subset dt # A tibble: 44 × 37 # STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL # # 1 ABC ABC-01-049 049 01 VS VSSTRESN 9 ARM D 4 ARM D 4 Y # 2 ABC ABC-01-049 049 01 VS VSSTRESN 41 ARM D 4 ARM D 4 Y # 3 ABC ABC-01-049 049 01 NA NA NA ARM D 4 ARM D 4 Y # 4 ABC ABC-01-049 049 01 VS VSSTRESN 34 ARM D 4 ARM D 4 Y # 5 ABC ABC-01-049 049 01 VS VSSTRESN 35 ARM D 4 ARM D 4 Y # 6 ABC ABC-01-049 049 01 VS VSSTRESN 36 ARM D 4 ARM D 4 Y # 7 ABC ABC-01-049 049 01 VS VSSTRESN 37 ARM D 4 ARM D 4 Y # 8 ABC ABC-01-049 049 01 VS VSSTRESN 38 ARM D 4 ARM D 4 Y # 9 ABC ABC-01-049 049 01 VS VSSTRESN 39 ARM D 4 ARM D 4 Y # 10 ABC ABC-01-049 049 01 VS VSSTRESN 40 ARM D 4 ARM D 4 Y # 34 more rows # 25 more variables: SAFFL , MITTFL , PPROTFL , TRTSDT , # TRTEDT , ADT , ADY , ADTF , AVISIT , AVISITN , # PARAM , PARAMCD , PARAMN , PARAMTYP , AVAL , BASE , # CHG , AWRANGE , AWTARGET , AWTDIFF , AWLO , AWHI , # AWU , ABLFL , ANL01FL # Use `print(n = ...)` to see more rows ``` The `expression` function can be used to define any R expression. This function accepts unquoted variable names, logical operators, and R functions like `is.null()` and `is.na()`. You can use the the "where" parameter with an expression to narrow down the data you are working with and reduce the memory footprint of your program. ### Get Top N Rows Another way to limit the data returned by your program is to use the "top" parameter. This parameter allows you to return only the "top N" number of rows in the target data item. The "top" parameter is useful for exploratory purposes. Here is an example: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog ct <- catalog(pkg, engines$csv) # Subset data for a specific subject dt <- fetch(ct$ADVS, top = 5, where = expression(SUBJID == '049')) # View results dt # A tibble: 5 × 37 # STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL SAFFL # # 1 ABC ABC-01-… 049 01 VS VSSTR… 9 ARM D 4 ARM D 4 Y Y # 2 ABC ABC-01-… 049 01 VS VSSTR… 41 ARM D 4 ARM D 4 Y Y # 3 ABC ABC-01-… 049 01 NA NA NA ARM D 4 ARM D 4 Y Y # 4 ABC ABC-01-… 049 01 VS VSSTR… 34 ARM D 4 ARM D 4 Y Y # 5 ABC ABC-01-… 049 01 VS VSSTR… 35 ARM D 4 ARM D 4 Y Y # 24 more variables: MITTFL , PPROTFL , TRTSDT , TRTEDT , # ADT , ADY , ADTF , AVISIT , AVISITN , PARAM , # PARAMCD , PARAMN , PARAMTYP , AVAL , BASE , CHG , # AWRANGE , AWTARGET , AWTDIFF , AWLO , AWHI , AWU , # ABLFL , ANL01FL ``` The above fetch returned the top 5 rows of the dataset for subject '049'. Note that the "top" parameter can be used with or without the "where" parameter. ### Apply a Pattern to a Catalog The previous examples loaded all datasets from the data source into the data catalog. There may be instances, however, when you have a very large number of datasets, and want to limit the data items loaded into the catalog. For this situation, you can use the "pattern" parameter on the `catalog()` function, like so: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog, applying pattern to dataset names ct <- catalog(pkg, engines$csv, pattern = "*S*") # View catalog ct # data catalog: 3 items # - Source: C:/packages/fetch/inst/extdata # - Engine: csv # - Pattern: *S* # - Items: # data item 'ADPSGA': 42 cols 695 rows # data item 'ADSL': 56 cols 87 rows # data item 'ADVS': 37 cols 3617 rows ``` The above example limits the items in the catalog to those with an "S" in the name. While this is not a realistic example, it nevertheless demonstrates how the "pattern" parameter can be used to reduce the datasets loaded into the catalog. ### Apply a Where Expression to an Entire Catalog Sometimes you want to subset all the datasets in your catalog. This subset can be accomplished with the "where" parameter on the `catalog()` function. Observe: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog without where expression ct1 <- catalog(pkg, engines$csv) # View catalog ct1 # data catalog: 6 items # - Source: C:/packages/fetch/inst/extdata # - Engine: csv # - Items: # data item 'ADAE': 56 cols 150 rows # data item 'ADEX': 17 cols 348 rows # data item 'ADPR': 37 cols 552 rows # data item 'ADPSGA': 42 cols 695 rows # data item 'ADSL': 56 cols 87 rows # data item 'ADVS': 37 cols 3617 rows # Create catalog with where expression ct2 <- catalog(pkg, engines$csv, where = expression(SUBJID == '049')) # View catalog ct2 # data catalog: 6 items # - Source: C:/packages/fetch/inst/extdata # - Engine: csv # - Where: SUBJID == "049" # - Items: # # data item 'ADAE': 56 cols 5 rows # - Where: SUBJID == "049" # # data item 'ADEX': 17 cols 4 rows # - Where: SUBJID == "049" # # data item 'ADPR': 37 cols 7 rows # - Where: SUBJID == "049" # # data item 'ADPSGA': 42 cols 10 rows # - Where: SUBJID == "049" # # data item 'ADSL': 56 cols 1 rows # - Where: SUBJID == "049" # # data item 'ADVS': 37 cols 44 rows # - Where: SUBJID == "049" ``` Notice that a where expression has been added to each data item in the catalog, and the row counts for each item have been reduced. Now let's fetch some data from this catalog: ```{r eval=FALSE, echo=TRUE} # Subset data for a specific subject dt1 <- fetch(ct2$ADVS) # View results of ADVS fetch dt1 # A tibble: 44 × 37 # STUDYID USUBJID SUBJID SITEID SRCDOM SRCVAR SRCSEQ TRTP TRTPN TRTA TRTAN RANDFL SAFFL # # 1 ABC ABC-01-049 049 01 VS VSSTRESN 9 ARM D 4 ARM D 4 Y Y # 2 ABC ABC-01-049 049 01 VS VSSTRESN 41 ARM D 4 ARM D 4 Y Y # 3 ABC ABC-01-049 049 01 NA NA NA ARM D 4 ARM D 4 Y Y # 4 ABC ABC-01-049 049 01 VS VSSTRESN 34 ARM D 4 ARM D 4 Y Y # 5 ABC ABC-01-049 049 01 VS VSSTRESN 35 ARM D 4 ARM D 4 Y Y # 6 ABC ABC-01-049 049 01 VS VSSTRESN 36 ARM D 4 ARM D 4 Y Y # 7 ABC ABC-01-049 049 01 VS VSSTRESN 37 ARM D 4 ARM D 4 Y Y # 8 ABC ABC-01-049 049 01 VS VSSTRESN 38 ARM D 4 ARM D 4 Y Y # 9 ABC ABC-01-049 049 01 VS VSSTRESN 39 ARM D 4 ARM D 4 Y Y # 10 ABC ABC-01-049 049 01 VS VSSTRESN 40 ARM D 4 ARM D 4 Y Y # 34 more rows # 24 more variables: MITTFL , PPROTFL , TRTSDT , TRTEDT , ADT , # ADY , ADTF , AVISIT , AVISITN , PARAM , PARAMCD , # PARAMN , PARAMTYP , AVAL , BASE , CHG , AWRANGE , # AWTARGET , AWTDIFF , AWLO , AWHI , AWU , ABLFL , # ANL01FL # Use `print(n = ...)` to see more rows # View results of ADSL fetch dt2 <- fetch(ct2$ADSL) # A tibble: 1 × 56 # STUDYID USUBJID SUBJID SITEID AGE AGEU AGEGR1 SEX RACE RACEN ETHNIC ETHNICN COUNTRY # # 1 ABC ABC-01-049 049 01 39 YEARS 30-39 y… M WHITE 5 NOT H… 2 NA # 43 more variables: ARM , ACTARM , TRT01P , TRT01PN , TRT01A , # TRT01AN , TRTSDT , TRTEDT , TRTDURN , TRTDURU , TR01SDT , # TR01EDT , INCNFL , RANDFL , RANDEXC1 , RANDEXC2 , RANDEXC3 , # RANDEXC4 , SAFFL , SAFEXC1 , SAFEXC2 , MITTFL , MITTEXC1 , # MITTEXC2 , PPROTFL , PPROTEX1 , PPROTEX2 , PPROTEX3 , # PPROTEX4 , COMPLFL , STDYDISP , STDYREAS , INCNDT , RANDDT , # DTHDT , DTHFL , MISSDOSE , TP1TRTR , TP2TRTR , TP3TRTR ,... ``` As you can see, both the ADVS and ADSL datasets have been subset for subject '049'. This feature of the *fetch* package makes it easy to target just the rows you are looking for. ### Add Import Specifications to a Catalog Some data files contain accurate data type information within them, and others do not. For example, CSV files do not contain data type information, but RDATA files do. If the data type information is not available, the data engine will try to guess the data types. However, it does not always guess accurately. Sometimes you need to help the data engine determine what the data type of a column is supposed to be. Import specs allow you to to specify the correct data type for a column. Examine the dictionary for the ADVS dataset: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create catalog without import spec ct <- catalog(pkg, engines$csv) # View dictionary for ADVS ct$ADVS # data item 'ADVS': 37 cols 3617 rows # - Engine: csv # - Size: 1.1 Mb # - Last Modified: 2020-09-18 14:30:22 # Name Column Class Label Format NAs MaxChar # 1 ADVS STUDYID character NA 0 3 # 2 ADVS USUBJID character NA 0 10 # 3 ADVS SUBJID character NA 0 3 # 4 ADVS SITEID character NA 0 2 # 5 ADVS SRCDOM character NA 85 2 # 6 ADVS SRCVAR character NA 85 8 # 7 ADVS SRCSEQ numeric NA 85 2 # 8 ADVS TRTP character NA 85 5 # 9 ADVS TRTPN numeric NA 85 1 # 10 ADVS TRTA character NA 85 5 # 11 ADVS TRTAN numeric NA 85 1 # 12 ADVS RANDFL character NA 0 1 # 13 ADVS SAFFL character NA 0 1 # 14 ADVS MITTFL character NA 0 1 # 15 ADVS PPROTFL character NA 0 1 # 16 ADVS TRTSDT character NA 54 9 # Character by default # 17 ADVS TRTEDT character NA 119 9 # Character by default # 18 ADVS ADT character NA 0 9 # 19 ADVS ADY numeric NA 54 4 # 20 ADVS ADTF logical NA 3617 0 # 21 ADVS AVISIT character NA 50 14 # 22 ADVS AVISITN numeric NA 50 2 # 23 ADVS PARAM character NA 0 35 # 24 ADVS PARAMCD character NA 0 6 # 25 ADVS PARAMN numeric NA 0 1 # 26 ADVS PARAMTYP character NA 3532 7 # 27 ADVS AVAL numeric NA 0 5 # 28 ADVS BASE numeric NA 70 5 # 29 ADVS CHG numeric NA 1312 4 # 30 ADVS AWRANGE character NA 1331 25 # 31 ADVS AWTARGET numeric NA 1331 3 # 32 ADVS AWTDIFF numeric NA 1331 2 # 33 ADVS AWLO numeric NA 1331 3 # 34 ADVS AWHI numeric NA 1331 3 # 35 ADVS AWU character NA 1331 4 # 36 ADVS ABLFL character NA 2869 1 # 37 ADVS ANL01FL character NA 448 1 ``` Note that the data types of the "TRTSDT" and "TRTEDT" are listed as character. In fact, these columns are supposed to be a date. You can force the date data type assignment with an import spec, as follows: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create import spec spc <- import_spec(TRTSDT = "date=%d%b%Y", TRTEDT = "date=%d%b%Y") # Create catalog with import spec ct <- catalog(pkg, engines$csv, import_specs = spc) # View dictionary for ADVS with Import Spec ct$ADVS # data item 'ADVS': 37 cols 3617 rows # - Engine: csv # - Size: 1.1 Mb # - Last Modified: 2020-09-18 14:30:22 # Name Column Class Label Format NAs MaxChar # 1 ADVS STUDYID character NA 0 3 # 2 ADVS USUBJID character NA 0 10 # 3 ADVS SUBJID character NA 0 3 # 4 ADVS SITEID character NA 0 2 # 5 ADVS SRCDOM character NA 85 2 # 6 ADVS SRCVAR character NA 85 8 # 7 ADVS SRCSEQ numeric NA 85 2 # 8 ADVS TRTP character NA 85 5 # 9 ADVS TRTPN numeric NA 85 1 # 10 ADVS TRTA character NA 85 5 # 11 ADVS TRTAN numeric NA 85 1 # 12 ADVS RANDFL character NA 0 1 # 13 ADVS SAFFL character NA 0 1 # 14 ADVS MITTFL character NA 0 1 # 15 ADVS PPROTFL character NA 0 1 # 16 ADVS TRTSDT Date NA 54 10 # Converted to Date # 17 ADVS TRTEDT Date NA 119 10 # Converted to Date # 18 ADVS ADT character NA 0 9 # 19 ADVS ADY numeric NA 54 4 # 20 ADVS ADTF logical NA 3617 0 # 21 ADVS AVISIT character NA 50 14 # 22 ADVS AVISITN numeric NA 50 2 # 23 ADVS PARAM character NA 0 35 # 24 ADVS PARAMCD character NA 0 6 # 25 ADVS PARAMN numeric NA 0 1 # 26 ADVS PARAMTYP character NA 3532 7 # 27 ADVS AVAL numeric NA 0 5 # 28 ADVS BASE numeric NA 70 5 # 29 ADVS CHG numeric NA 1312 4 # 30 ADVS AWRANGE character NA 1331 25 # 31 ADVS AWTARGET numeric NA 1331 3 # 32 ADVS AWTDIFF numeric NA 1331 2 # 33 ADVS AWLO numeric NA 1331 3 # 34 ADVS AWHI numeric NA 1331 3 # 35 ADVS AWU character NA 1331 4 # 36 ADVS ABLFL character NA 2869 1 # 37 ADVS ANL01FL character NA 448 1 ``` Observe that the columns "TRTSDT" and "TRTEDT" are now converted to Date columns. This import spec will apply to all "TRTSDT" and "TRTEDT" columns in all datasets. The format "%d%b%Y" was written specifically to read in the data value this files uses. See the `import_spec()` documentation for more information on these data formats, and for other possible data type specifications. Also see the `specs()` documentation if you want to assign a unique import specifications to each data item in the catalog. ### Add Import Specifications to a Fetch You may also add an import specification to the fetch operation. The import specification is defined with the `import_spec()` function in the same way as the catalog operation: ```{r eval=FALSE, echo=TRUE} # Get sample data directory pkg <- system.file("extdata", package = "fetch") # Create import spec spc <- import_spec(TRTSDT = "date=%d%b%Y", TRTEDT = "date=%d%b%Y") # Create catalog with import spec ct <- catalog(pkg, engines$csv, import_specs = spc) # View dictionary for ADVS with Import Spec ct$ADVS # data item 'ADVS': 37 cols 3617 rows # - Engine: csv # - Size: 1.1 Mb # - Last Modified: 2020-09-18 14:30:22 # Name Column Class Label Format NAs MaxChar # 1 ADVS STUDYID character NA 0 3 # 2 ADVS USUBJID character NA 0 10 # 3 ADVS SUBJID character NA 0 3 # 4 ADVS SITEID character NA 0 2 # 5 ADVS SRCDOM character NA 85 2 # 6 ADVS SRCVAR character NA 85 8 # 7 ADVS SRCSEQ numeric NA 85 2 # 8 ADVS TRTP character NA 85 5 # 9 ADVS TRTPN numeric NA 85 1 # 10 ADVS TRTA character NA 85 5 # 11 ADVS TRTAN numeric NA 85 1 # 12 ADVS RANDFL character NA 0 1 # 13 ADVS SAFFL character NA 0 1 # 14 ADVS MITTFL character NA 0 1 # 15 ADVS PPROTFL character NA 0 1 # 16 ADVS TRTSDT Date NA 54 10 # 17 ADVS TRTEDT Date NA 119 10 # 18 ADVS ADT character NA 0 9 # 19 ADVS ADY numeric NA 54 4 # 20 ADVS ADTF logical NA 3617 0 # 21 ADVS AVISIT character NA 50 14 # 22 ADVS AVISITN numeric NA 50 2 # 23 ADVS PARAM character NA 0 35 # 24 ADVS PARAMCD character NA 0 6 # 25 ADVS PARAMN numeric NA 0 1 # 26 ADVS PARAMTYP character NA 3532 7 # 27 ADVS AVAL numeric NA 0 5 # 28 ADVS BASE numeric NA 70 5 # 29 ADVS CHG numeric NA 1312 4 # 30 ADVS AWRANGE character NA 1331 25 # 31 ADVS AWTARGET numeric NA 1331 3 # 32 ADVS AWTDIFF numeric NA 1331 2 # 33 ADVS AWLO numeric NA 1331 3 # 34 ADVS AWHI numeric NA 1331 3 # 35 ADVS AWU character NA 1331 4 # 36 ADVS ABLFL character NA 2869 1 # 37 ADVS ANL01FL character NA 448 1 ``` The results of this operation are similar to the previous example, when we applied the import spec to the catalog. The difference is that now the import spec applies only to this specific fetch operation, not the entire catalog. ## Next Steps For next steps, it is suggested to play around with the **fetch** functions, and note how easy these functions are to use. You may also wish to review the documentation for the `catalog()`, `fetch()`, `import_spec()` functions. The documentation contains some additional examples and descriptions of all parameters.