A data library is a collection of data sets. The purpose of
the data library is to combine related data sets, and provides the opportunity
to manipulate all of them as a single object. A data library is created using
libname function. The
libname function allows you to
load an entire directory of data into memory in one step. The libr
package contains additional functions to add and remove data from the library,
copy the library, and write any changed data to the file system.
libname( name, directory_path, engine = "rds", read_only = FALSE, env = parent.frame(), import_specs = NULL, filter = NULL, standard_eval = FALSE, quiet = FALSE )
The unquoted name of the library to create. The library name will
be created as a variable in the environment specified on the
A directory path to associate with the library. If
the directory contains data files of the type specified on the
The engine to associate with the library. The specified engine will be used to import and export data. The engine name corresponds to the standard file extension of the data file type. The default engine is 'rds'. Valid values are 'rds', 'sas7bdat', 'xpt', 'xls', 'xlsx', 'dbf', and 'csv'.
Whether the library should be created as read-only. Default is FALSE. If TRUE, the user will be restricted from appending, removing, or writing any data from memory to the file system.
The environment to use for the libname.
A collection of import specifications,
defined using the
One or more quoted strings to use as filters for the incoming file names. For more than one filter string, pass them as a vector of strings. The filter string can be a full or partial file name, without extension. If using a partial file name, use a wild-card character (*) to identify the missing portion. The match will be case-insensitive.
A TRUE or FALSE value which indicates whether to
use standard (quoted) or non-standard (unquoted) evaluation on the library
When TRUE, minimizes output to the console when loading files. Default is FALSE.
The library object, with all data files loaded into the library list. Items in the list will be named according the the file name, minus the file extension.
For most projects, a data file does not exist in isolation. There are sets of
related files of the same file type. The aim of the
is to take advantage of this fact, and give you an easy way to manage
the entire set.
libname function points to a directory of data files, and associates
a name with that set of data. The name refers to an object of class 'lib',
which at its heart is a named list. When the
executes, it will load all the data in the directory into the list, and assign
the file name (without extension) as the list item name. Data can be accessed
using list syntax, or loaded directly into the local environment using the
libname function provides several data engines to read
data of different types. For example, there is an engine for Excel
files, and another engine for SAS® datasets. The engines are identified
by the extension of the file type they handle. The available engines are
'rds', 'csv', 'xlsx', 'xls', 'sas7bdat', 'xpt', and 'dbf'.
Once an engine has been assigned to a library, all other read/write
operations will be performed by that engine.
The data engines largely hide file import details from you.
The purpose of the
libname function is to make it easy to
import a set of related data files that follow standard conventions.
The function assumes that the
data has file extensions that match the file type, and then makes further
assumptions based on each type of file. As a result, there are very few
import options on the
libname function. If your data does not
follow standard conventions, it is recommended that you import your
data using a package that gives you more control over import options.
libname function currently provides seven different engines for
seven different types of data files.
Here is a complete list of available engines and some commentary about each:
rds: For R data sets. This engine is the default. Because detailed data type and attribute information can be stored inside the rds file, the rds engine is the most reliable and easiest to use.
csv: For comma separated value files. This engine assumes
that the first row has column names, and that strings containing commas are
quoted. Blank values and the string 'NA' will be interpreted as NA.
Because data type information is not stored in csv files, the
csv engine will attempt to guess the data types based on the available data.
For most columns, the csv engine is able to guess accurately. Where
it fails most commonly is with date and time columns. For csv date and time
columns, it is therefore recommended to assign an import spec that tells
the engine how to read the date or time. See the
documentation for additional details.
xlsx: For Excel files produced with the current version of Excel. Excel provides more data type information than csv, but it is not as accurate as rds. Therefore, you may also need to provide import specifications with Excel files. Also note that currently the xlsx import engine will only import the first sheet of an Excel workbook. If you need to import a sheet that is not the first sheet, use a different package to import the data.
xls: An Excel file format used between 1997 and 2003, and still used in some organizations. As with xlsx, this file format provides more information than csv, but is not entirely reliable. Therefore, you may need to provide import specifications to the xls engine. Also note that the xls engine can read, but not write xls files. Any xls files read with the xls engine will be written as an xlsx file. Like the xlsx engine, the xls engine can only read the first sheet of a workbook.
sas7bdat: Handles SAS® datasets. SAS® datasets provide better type information than either csv or Excel. In most cases, you will not need to define import specifications for SAS® datasets. The sas7bdat engine interprets empty strings, single blanks, and a single dot (".") as missing values. While the import of SAS® datasets is fairly reliable, sas7bdat files exported with the sas7bdat engine sometimes cannot be read by SAS® software. In these cases, it is recommended to export to another file format, such as csv or dbf, and then import into SAS®.
xpt: The SAS® transport file engine. Transport format is a platform independent file format. Similar to SAS® datasets, it provides data type information. In most cases, you will not need to define import specifications. The xpt engine also interprets empty strings, single blanks, and a single dot (".") as missing values.
dbf: The DBASE file format engine. The DBASE engine was added to the libr package because many types of software can read and write in DBASE format reliably. Therefore it is a useful file format for interchange between software systems. The DBASE file format contains type information.
If you wish to import only a portion of your data files into a library,
you may accomplish it with the
filter parameter. The filter
parameter allows you to pass a vector of strings corresponding to the
names of the files you want to import. The function allows a
wild-card (*) for partial matching. For example,
"te*" means any
file name that that begins with a "te", and
"*st" means any file name
that ends with an "st".
In most cases, it is not necessary to specify the data types for incoming columns in your data. Either the file format will preserve the appropriate data type information, or the assigned engine will guess correctly.
However, in some cases it will be necessary to control the column data types.
For these cases, use the
import_specs parameter. The
import_specs parameter allows you
to specify the data types by data set and column name. All the data type
specifications are contained within a
specs collection, and the
specifications for a particular data set are defined by an
import_spec function. See the
import_spec documentation for further information
and examples of defining an import spec.
# Create temp directory tmp <- tempdir() # Save some data to temp directory # for illustration purposes saveRDS(trees, file.path(tmp, "trees.rds")) saveRDS(rock, file.path(tmp, "rocks.rds")) saveRDS(beaver1, file.path(tmp, "beaver1.rds")) # Create data library libname(dat, tmp) # # library 'dat': 3 items # - attributes: rds not loaded # - path: C:\Users\User\AppData\Local\Temp\RtmpklJcfl # - items: # Name Extension Rows Cols Size LastModified # 1 beaver1 rds 114 4 5.9 Kb 2020-12-06 15:21:30 # 2 rocks rds 48 4 3.6 Kb 2020-12-06 15:21:30 # 3 trees rds 31 3 2.9 Kb 2020-12-06 15:21:30 # Print dictionary for library dictionary(dat) # A tibble: 11 x 10 # Name Column Class Label Description Format Width Justify Rows NAs # <chr> <chr> <chr> <chr> <chr> <lgl> <lgl> <chr> <int> <int> # 1 beaver1 day numeric NA NA NA NA NA 114 0 # 2 beaver1 time numeric NA NA NA NA NA 114 0 # 3 beaver1 temp numeric NA NA NA NA NA 114 0 # 4 beaver1 activ numeric NA NA NA NA NA 114 0 # 5 rocks area integer NA NA NA NA NA 48 0 # 6 rocks peri numeric NA NA NA NA NA 48 0 # 7 rocks shape numeric NA NA NA NA NA 48 0 # 8 rocks perm numeric NA NA NA NA NA 48 0 # 9 trees Girth numeric NA NA NA NA NA 31 0 # 10 trees Height numeric NA NA NA NA NA 31 0 # 11 trees Volume numeric NA NA NA NA NA 31 0 # Load library into workspace lib_load(dat) # Print summaries for each data frame # Note that once loaded into the workspace, # data can be accessed using two-level syntax. summary(dat.rocks) summary(dat.trees) summary(dat.beaver1) #Unload from workspace lib_unload(dat) # Clean up lib_delete(dat)