Create an unregistered, in-memory Dataset from delimited files. Use this method to read delimited text files when you want to control the options used.

create_tabular_dataset_from_delimited_files(
  path,
  validate = TRUE,
  include_path = FALSE,
  infer_column_types = TRUE,
  set_column_types = NULL,
  separator = ",",
  header = TRUE,
  partition_format = NULL,
  support_multi_line = FALSE,
  empty_as_string = FALSE
)

Arguments

path

A data path in a registered datastore, a local path, or an HTTP URL.

validate

Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute.

include_path

Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path.

infer_column_types

Indicates whether column data types are inferred.

set_column_types

A named list to set column data type, where key is column name and value is data type.

separator

The separator used to split columns.

header

Controls how column headers are promoted when reading from files. Defaults to True for all files having the same header. Files will read as having no header When header=False. More options can be specified using PromoteHeadersBehavior.

partition_format

Specify the partition format in path and create string columns from format 'x' and datetime column from format 'x:yyyy/MM/dd/HH/mm/ss', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/Country/PartitionDate:yyyy/MM/dd/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type.

support_multi_line

By default (support_multi_line=FALSE), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to TRUE when the delimited files are known to contain quoted line breaks.

empty_as_string

Specify if empty field values should be loaded as empty strings. The default (FALSE) will read empty field values as nulls. Passing this as TRUE will read empty field values as empty strings. If the values are converted to numeric or datetime then this has no effect, as empty values will be converted to nulls.

Value

The Tabular Dataset object.

See also

data_path