Releases: hubverse-org/hubData
Releases · hubverse-org/hubData
hubData 2.2.1
- Fixed bug in
collect_hub()that caused returned tibbles to contain arrow-ALTREP-backed columns rather than plain materialised R vectors (#141). The ALTREP views could not be safelysave()d,serialize()d, or sent to parallel workers in R sessions withoutarrowinstalled, silently corrupting columns (most came back length-0).collect_hub()now setsarrow.use_altrep = FALSEfor the duration of the call, restoring the long-standingdplyr::collect()contract of materialising into plain in-memory R vectors. Users who want lazy / ALTREP-friendly access can continue to useconnect_hub()and work with the arrow dataset directly.
hubData 2.2.0
- Fixed bug in
load_model_metadata()that caused array-valued top-level fields in model metadata YAML files to parse incorrectly or fail to parse at all (#137). Array-valued top-level fields are now translated into list columns, per the function specification. Note: The bug also meant that length-1 top-level arrays parsed equivalently to top-level scalars, e.g. bothy: "x"andy: ["x"]parsed as a character columnywith valuex. With the fix, those entries parse differently:y: "x"parses as a character column;y: ["x"]parses as a list column. Thanks @dylanhmorris.
hubData 2.1.0
New features and improvements
- Added
date_colparameter tocreate_oracle_output_schema()andconnect_target_oracle_output()to match time-series functionality (#121). In inference mode (whentarget-data.jsonis not present), users can now specify which column contains date information to ensure it's typed asdate32(). The parameter is ignored in v6+ config mode where date columns are defined in the configuration.
Bug fixes and minor improvements
create_timeseries_schema()andcreate_oracle_output_schema()now issue a warning instead of an error when no date column is detected during schema inference (whentarget-data.jsonconfig is not present) (#119). This allows schema creation to succeed and enables validation tools to properly catch and report missing date columns with better error messages.
hubData 2.0.0
hubData 2.0.0
Breaking changes
- BREAKING: Changed default value of
skip_checksparameter fromFALSEtoTRUEinconnect_hub()(#114). This significantly improves performance, especially for large cloud hubs (AWS S3, GCS), by skipping file validation checks that require high I/O operations. The previous default behavior of detecting and excluding invalid files can still be accessed by explicitly settingskip_checks = FALSE. This change aligns with the Python hubdata package default and reflects that hubs validated through hubValidations should not require additional file checks. Users with model output directories containing invalid files should either:- Use the
ignore_filesargument to exclude specific files, or - Set
skip_checks = FALSEexplicitly, or - Ensure their model-output directories contain only valid model output files
- Use the
- Note:
connect_model_output()retains its default ofskip_checks = FALSEas it is designed for working with model output directories that may be in draft form.
New features and improvements
- Added comprehensive "Accessing Target Data" vignette demonstrating how to use
connect_target_timeseries()andconnect_target_oracle_output()to access target data, including filtering, joining with model outputs, and working with cloud-based hubs (#108). - Added
r_to_arrow_datatypes()function providing an inverse mapping from R data types to Arrow data types, enabling vectorized type conversion when processingtarget-data.jsonconfigurations (#107). - Enhanced
create_timeseries_schema()andcreate_oracle_output_schema()to support config-based schema creation whentarget-data.json(v6.0.0+) is present (#107). This enables fast, deterministic schema creation without filesystem I/O, especially beneficial for cloud storage. Functions automatically fall back to inference-based schema creation for pre-v6 hubs or hubs withouttarget-data.json, maintaining backward compatibility. This functionality is propagated toconnect_target_timeseries()andconnect_target_oracle_output(), which use these schema creation functions internally. - Enhanced documentation for
connect_target_timeseries()andconnect_target_oracle_output()to clarify column ordering behavior: v6+ Parquet files are reordered to hubverse convention, while CSV files preserve original ordering to avoid column name/position mismatches during collection (#107). - Added
get_target_data_colnames()function for extracting and ordering expected column names for target data from target-data.json configuration files (#109).
hubData 1.5.0
- Added Arrow schema utilities for safely converting and validating column types from
arrow::Schemaobjects:as_r_schema(): Converts an Arrow schema to a named character vector of equivalent R types (e.g.,"int32"→"integer"). Errors on unsupported types.arrow_schema_to_string(): Extracts the raw Arrow type strings for field in a schema.is_supported_arrow_type(): Returns a named logical vector indicating which schema fields have supported types.validate_arrow_schema(): Validates that all field types in an Arrow schema are supported. Throws a helpful error otherwise.
- Added
arrow_to_r_datatypes, a named character vector defining the mapping of safe and portable Arrow types to their R equivalents. - Added
r_schemaargument tocreate_timeseries_schema()andcreate_oracle_output_schema()functions to enable returning the schema as a vector of R data types instead of anarrow::Schemaobject (#95) - Added
output_type_id_datatypeargument tocreate_oracle_output_schema()andconnect_target_oracle_output()functions to allow users to explicitly specify the data type of theoutput_type_idcolumn in the schema. This ensuring compatibility withcreate_hub_schema()andconnect_hub()(#95). - (Internal) Refactored target data schema and connection tests to use embedded example hubs and reusable schema fixtures, improving reliability and making tests independent of dataset size and ordering.
- Added utilities for working with hive-partitioned data file paths:
extract_hive_partitions()for extracting key value pairs from paths to hive-partitioned data files.is_hive_partitioned_path()for checking if a path is hive-partitioned.
create_oracle_output_schema()andcreate_timeseries_schema()now define a schema for hive-partitions whose data types are defined in thetasks.json config(#89).
hubData 1.4.0
- Added
connect_target_timeseries()function (experimental) for accessing time-series target data from a hub (#71). This includes accessing target data from cloud hubs (#75). - Added
create_timeseries_schema()function for creating a schema for time-series target data (#71). - Added
connect_target_oracle_output()function (experimental) for accessing oracle-output target data from a hub (#72). This includes accessing target data from cloud hubs (#76). - Added
create_oracle_output_schema()function for creating a schema for oracle-output target data (#72). - Added
get_target_path()function for retrieving the path to the appropriate target data file or directory in a hub. - Added
get_target_file_ext()function for retrieving the file extensions of target data file(s) in a hub. - Added
get_s3_bucket_name()for extracting the bucket name of a cloud enabled hub from a hub's config (#75). - Added
naargument toconnect_hub(),connect_model_output(),connect_target_timeseries(),connect_target_oracle_output(),create_timeseries_schema(), andcreate_oracle_output_schema()to allow for the specification of how to handle missing values in CSV files. The default is to useNAor"", but users can restrict this to""(empty string) when needing to include character"NA"values in their CSV data (#80). Note this approach only works ifNAvalues are written to the CSV file as""(empty string) and not asNAor"NA". - Added
ignore_filesargument toconnect_hub()andconnect_model_output()to allow users to specify a vector of file name prefixes to ignore when scanning the hub's model output directory for files. This is useful for excluding files that are not relevant to the hub's model output, such as README files or other documentation as well as potentially invalid files (#87). The feature is also used internally inconnect_hub()to enable skipping expensive file validity checks when connecting to cloud-based hubs with multiple file formats usingskip_checks = TRUE. - Refactored
connect_hub()andconnect_model_output()internally to reduce the number of calls to cloud hubs, improving performance when connecting to cloud-based hubs. - Added
ignore_filesargument toconnect_target_oracle_output(),connect_target_timeseries(),create_timeseries_schema(), andcreate_oracle_output_schema()to allow users to specify a vector of file name prefixes to ignore when scanning the hub's target data directory for files (#87).
hubData 1.3.0 - Support v4.0.0 schema
- Support the determination of hub schema from v4 configuration files (#63).
- Also fixes bug in
create_hub_schema()whereoutput_type_iddata type was being incorrectly auto-determined aslogicalwhen only point estimate output types where being collected by a hub. Nowcharacterdata type is returned for theoutput_type_idfor all schema versions in such situations when auto-determined.
hubData 1.2.3
- Fix bug in
create_hub_schema()whereoutput_type_iddata type was being incorrectly determined asDateinstead ofcharacter(Reported in reichlab/variant-nowcast-hub#87 (comment) and fixed in #58)
Full Changelog: v1.2.2...v1.2.3
v1.2.2
What's Changed
- Split hubutils - Release v0.0.1 by @annakrystalli in #1
- Remote refs from hubverse remotes by @annakrystalli in #12
- Replace all null task id properties with required = NA by @annakrystalli in #19
- Add collect_hub function by @annakrystalli in #18
- use dev version of arrow until fixed version v16 released on CRAN by @annakrystalli in #23
- fix broken vignette link by @nickreich in #28
- add load_forecasts_zoltar() function by @matthewcornell in #36
- Change org name by @annakrystalli in #42
- Support v3.0.0 schema sample specification by @annakrystalli in #31
- Defunct expand_model_out_val_grid and create_model_out_submit_tmpl and move to hubValidations by @annakrystalli in #46
- 44/ Configure output_type_id from config by @annakrystalli in #49
- fix tidyselect warnings by @zkamvar in #51
- Bsweger/skip data checks option by @bsweger in #47
- Hotfix/zoltr dep by @annakrystalli in #53
- change hubdocs URL to hubverse.io by @zkamvar in #52
- Hotfix / Use CRAN arrow by @annakrystalli in #55
New Contributors
- @annakrystalli made their first contribution in #1
- @nickreich made their first contribution in #28
- @matthewcornell made their first contribution in #36
- @zkamvar made their first contribution in #51
- @bsweger made their first contribution in #47
Full Changelog: https://github.com/hubverse-org/hubData/commits/v1.2.2