R/datasets.R
create_tabular_dataset_from_delimited_files.Rd
Create an unregistered, in-memory Dataset from delimited files. Use this method to read delimited text files when you want to control the options used.
create_tabular_dataset_from_delimited_files( path, validate = TRUE, include_path = FALSE, infer_column_types = TRUE, set_column_types = NULL, separator = ",", header = TRUE, partition_format = NULL, support_multi_line = FALSE, empty_as_string = FALSE )
path | A data path in a registered datastore, a local path, or an HTTP URL. |
---|---|
validate | Boolean to validate if data can be loaded from the returned dataset. Defaults to True. Validation requires that the data source is accessible from the current compute. |
include_path | Whether to include a column containing the path of the file from which the data was read. This is useful when you are reading multiple files, and want to know which file a particular record originated from, or to keep useful information in file path. |
infer_column_types | Indicates whether column data types are inferred. |
set_column_types | A named list to set column data type, where key is column name and value is data type. |
separator | The separator used to split columns. |
header | Controls how column headers are promoted when reading from files. Defaults to True for all
files having the same header. Files will read as having no header When header=False. More options can
be specified using |
partition_format | Specify the partition format in path and create string columns from format 'x' and datetime column from format 'x:yyyy/MM/dd/HH/mm/ss', where 'yyyy', 'MM', 'dd', 'HH', 'mm' and 'ss' are used to extrat year, month, day, hour, minute and second for the datetime type. The format should start from the postition of first partition key until the end of file path. For example, given a file path '../USA/2019/01/01/data.csv' and data is partitioned by country and time, we can define '/Country/PartitionDate:yyyy/MM/dd/data.csv' to create columns 'Country' of string type and 'PartitionDate' of datetime type. |
support_multi_line | By default (support_multi_line=FALSE), all line breaks, including those in quoted field values, will be interpreted as a record break. Reading data this way is faster and more optimized for parallel execution on multiple CPU cores. However, it may result in silently producing more records with misaligned field values. This should be set to TRUE when the delimited files are known to contain quoted line breaks. |
empty_as_string | Specify if empty field values should be loaded as empty strings. The default (FALSE) will read empty field values as nulls. Passing this as TRUE will read empty field values as empty strings. If the values are converted to numeric or datetime then this has no effect, as empty values will be converted to nulls. |
The Tabular Dataset object.
data_path