Create a data library

A data library is a collection of data sets. The purpose of the data library is to combine related data sets, and provides the opportunity to manipulate all of them as a single object. A data library is created using the libname function. The libname function allows you to load an entire directory of data into memory in one step. The libr package contains additional functions to add and remove data from the library, copy the library, and write any changed data to the file system.

libname(
  name,
  directory_path,
  engine = "rds",
  read_only = FALSE,
  env = parent.frame(),
  import_specs = NULL,
  filter = NULL,
  standard_eval = FALSE,
  quiet = FALSE,
  log = TRUE,
  where = NULL
)

Arguments

name: The unquoted name of the library to create. The library name will be created as a variable in the environment specified on the env parameter. The default environment is the parent frame. If you want to pass the library name as a quoted string or a variable, set the standard_eval parameter to TRUE to turn off the non-standard evaluation.
directory_path: A directory path to associate with the library. If the directory contains data files of the type specified on the engine parameter, they will be imported into the library list. If the directory does not contains data sets of the appropriate type, it will be created as an empty library. If the directory does not exist, it will be created by the libname function.
engine: The engine to associate with the library. The specified engine will be used to import and export data. The engine name corresponds to the standard file extension of the data file type. The default engine is 'rds'. Valid values are 'rds', 'Rdata', 'rda', 'sas7bdat', 'xpt', 'xls', 'xlsx', 'dbf', 'csv', 'parquet'.
read_only: Whether the library should be created as read-only. Default is FALSE. If TRUE, the user will be restricted from appending, removing, or writing any data from memory to the file system.
env: The environment to use for the libname. Default is parent.frame(). When working inside a function, the parent.frame() will refer to the local function scope. When working outside a function, the parent.frame() will be the global environment. If the env parameter is set to a custom environment, the custom environment will be used for all subsequent operations with that libname.
import_specs: A collection of import specifications, defined using the specs function. The import specs should be named according to the file names in the library directory. See the specs function for additional information.
filter: One or more quoted strings to use as filters for the incoming file names. For more than one filter string, pass them as a vector of strings. The filter string can be a full or partial file name, without extension. If using a partial file name, use a wild-card character (*) to identify the missing portion. The match will be case-insensitive.
standard_eval: A TRUE or FALSE value which indicates whether to use standard (quoted) or non-standard (unquoted) evaluation on the library name parameter. Use standard evaluation when you want to pass the library name with a variable. Default is FALSE.
quiet: When TRUE, minimizes output to the console when loading files. Default is FALSE.
log: Whether to log the libname operation. Default is TRUE. This parameter is used internally.
where: An expression used to subset all datasets in the library. The where clause will be executed when the library is created. Use the Base R expression function to define the subset. If a where clause is supplied, the library will be opened read-only.

Value

The library object, with all data files loaded into the library list. Items in the list will be named according the the file name, minus the file extension.

Details

For most projects, a data file does not exist in isolation. There are sets of related files of the same file type. The aim of the libname function is to take advantage of this fact, and give you an easy way to manage the entire set.

The libname function points to a directory of data files, and associates a name with that set of data. The name refers to an object of class 'lib', which at its heart is a named list. When the libname function executes, it will load all the data in the directory into the list, and assign the file name (without extension) as the list item name. Data can be accessed using list syntax, or loaded directly into the local environment using the lib_load function.

The libname function provides several data engines to read data of different types. For example, there is an engine for Excel files, and another engine for SAS® datasets. The engines are identified by the extension of the file type they handle. The available engines are 'rds', 'RData', 'rda', 'csv', 'xlsx', 'xls', 'sas7bdat', 'xpt', 'dbf', and 'parquet'. Once an engine has been assigned to a library, all other read/write operations will be performed by that engine.

The data engines largely hide file import details from you. The purpose of the libname function is to make it easy to import a set of related data files that follow standard conventions. The function assumes that the data has file extensions that match the file type, and then makes further assumptions based on each type of file. As a result, there are very few import options on the libname function. If your data does not follow standard conventions, it is recommended that you import your data using a package that gives you more control over import options.

Data Engines

The libname function currently provides seven different engines for seven different types of data files. Here is a complete list of available engines and some commentary about each:

rds: For R data sets. This engine is the default. Because detailed data type and attribute information can be stored inside the rds file, the rds engine is the most reliable and easiest to use.
Rdata and rda: Older R data storage formats. Like the 'rds' engine, these storage types retain column attributes and data types.
csv: For comma separated value files. This engine assumes that the first row has column names, and that strings containing commas are quoted. Blank values and the string 'NA' will be interpreted as NA. Because data type information is not stored in csv files, the csv engine will attempt to guess the data types based on the available data. For most columns, the csv engine is able to guess accurately. Where it fails most commonly is with date and time columns. For csv date and time columns, it is therefore recommended to assign an import spec that tells the engine how to read the date or time. See the specs documentation for additional details.
xlsx: For Excel files produced with the current version of Excel. Excel provides more data type information than csv, but it is not as accurate as rds. Therefore, you may also need to provide import specifications with Excel files. Also note that currently the xlsx import engine will only import the first sheet of an Excel workbook. If you need to import a sheet that is not the first sheet, use a different package to import the data.
xls: An Excel file format used between 1997 and 2003, and still used in some organizations. As with xlsx, this file format provides more information than csv, but is not entirely reliable. Therefore, you may need to provide import specifications to the xls engine. Also note that the xls engine can read, but not write xls files. Any xls files read with the xls engine will be written as an xlsx file. Like the xlsx engine, the xls engine can only read the first sheet of a workbook.
sas7bdat: Handles SAS® datasets. SAS® datasets provide better type information than either csv or Excel. In most cases, you will not need to define import specifications for SAS® datasets. The sas7bdat engine interprets empty strings, single blanks, and a single dot (".") as missing values. While the import of SAS® datasets is fairly reliable, sas7bdat files cannot be written or exported with the sas7bdat engine. In these cases, it is recommended to export to another file format, such as csv, dbf, or parquet, and then import into SAS®.
xpt: The SAS® transport file engine. Transport format is a platform independent file format. Similar to SAS® datasets, it provides data type information. In most cases, you will not need to define import specifications. The xpt engine also interprets empty strings, single blanks, and a single dot (".") as missing values.
dbf: The DBASE file format engine. The DBASE engine was added to the libr package because many types of software can read and write in DBASE format reliably. Therefore it is a useful file format for interchange between software systems. The DBASE file format contains type information.
parquet: The parquet file format is an open source format supported by many languages and software. The format is strongly typed, and offers fast data storage and retrieval. The libr package can both read from and write to parquet files.

File Filters

If you wish to import only a portion of your data files into a library, you may accomplish it with the filter parameter. The filter parameter allows you to pass a vector of strings corresponding to the names of the files you want to import. The function allows a wild-card (*) for partial matching. For example, "te*" means any file name that that begins with a "te", and "*st" means any file name that ends with an "st".

Import Specifications

In most cases, it is not necessary to specify the data types for incoming columns in your data. Either the file format will preserve the appropriate data type information, or the assigned engine will guess correctly.

However, in some cases it will be necessary to control the column data types. For these cases, use the import_specs parameter. The import_specs parameter allows you to specify the data types by data set and column name. All the data type specifications are contained within a specs collection, and the specifications for a particular data set are defined by an import_spec function. See the specs and import_spec documentation for further information and examples of defining an import spec.

Examples

# Create temp directory
tmp <- tempdir()

# Save some data to temp directory
# for illustration purposes
saveRDS(trees, file.path(tmp, "trees.rds"))
saveRDS(rock, file.path(tmp, "rocks.rds"))
saveRDS(beaver1, file.path(tmp, "beaver1.rds"))

# Create data library
libname(dat, tmp)
# # library 'dat': 3 items
# - attributes: rds not loaded
# - path: C:\Users\User\AppData\Local\Temp\RtmpklJcfl
# - items:
#      Name Extension Rows Cols   Size        LastModified
# 1 beaver1       rds  114    4 5.9 Kb 2020-12-06 15:21:30
# 2   rocks       rds   48    4 3.6 Kb 2020-12-06 15:21:30
# 3   trees       rds   31    3 2.9 Kb 2020-12-06 15:21:30

# Print dictionary for library
dictionary(dat)
# A tibble: 11 x 10
#    Name    Column Class   Label Description Format Width Justify  Rows   NAs
#    <chr>   <chr>  <chr>   <chr> <chr>       <lgl>  <lgl> <chr>   <int> <int>
#  1 beaver1 day    numeric NA    NA          NA     NA    NA        114     0
#  2 beaver1 time   numeric NA    NA          NA     NA    NA        114     0
#  3 beaver1 temp   numeric NA    NA          NA     NA    NA        114     0
#  4 beaver1 activ  numeric NA    NA          NA     NA    NA        114     0
#  5 rocks   area   integer NA    NA          NA     NA    NA         48     0
#  6 rocks   peri   numeric NA    NA          NA     NA    NA         48     0
#  7 rocks   shape  numeric NA    NA          NA     NA    NA         48     0
#  8 rocks   perm   numeric NA    NA          NA     NA    NA         48     0
#  9 trees   Girth  numeric NA    NA          NA     NA    NA         31     0
# 10 trees   Height numeric NA    NA          NA     NA    NA         31     0
# 11 trees   Volume numeric NA    NA          NA     NA    NA         31     0

# Load library into workspace 
lib_load(dat)

# Print summaries for each data frame
# Note that once loaded into the workspace, 
# data can be accessed using two-level syntax.
summary(dat.rocks)
summary(dat.trees)
summary(dat.beaver1)

#Unload from workspace
lib_unload(dat)

# Clean up
lib_delete(dat)