Releases · hubverse-org/hubData

15 May 13:16

annakrystalli

v2.2.1

73a329b

hubData 2.2.1 Latest

Latest

Fixed bug in collect_hub() that caused returned tibbles to contain arrow-ALTREP-backed columns rather than plain materialised R vectors (#141). The ALTREP views could not be safely save()d, serialize()d, or sent to parallel workers in R sessions without arrow installed, silently corrupting columns (most came back length-0). collect_hub() now sets arrow.use_altrep = FALSE for the duration of the call, restoring the long-standing dplyr::collect() contract of materialising into plain in-memory R vectors. Users who want lazy / ALTREP-friendly access can continue to use connect_hub() and work with the arrow dataset directly.

Assets 2

07 May 12:14

annakrystalli

v2.2.0

49b7e0a

hubData 2.2.0

Fixed bug in load_model_metadata() that caused array-valued top-level fields in model metadata YAML files to parse incorrectly or fail to parse at all (#137). Array-valued top-level fields are now translated into list columns, per the function specification. Note: The bug also meant that length-1 top-level arrays parsed equivalently to top-level scalars, e.g. both y: "x" and y: ["x"] parsed as a character column y with value x. With the fix, those entries parse differently: y: "x" parses as a character column; y: ["x"] parses as a list column. Thanks @dylanhmorris.

Contributors

dylanhmorris

Assets 2

13 Jan 14:23

annakrystalli

v2.1.0

34245de

hubData 2.1.0

New features and improvements

Added date_col parameter to create_oracle_output_schema() and connect_target_oracle_output() to match time-series functionality (#121). In inference mode (when target-data.json is not present), users can now specify which column contains date information to ensure it's typed as date32(). The parameter is ignored in v6+ config mode where date columns are defined in the configuration.

Bug fixes and minor improvements

create_timeseries_schema() and create_oracle_output_schema() now issue a warning instead of an error when no date column is detected during schema inference (when target-data.json config is not present) (#119). This allows schema creation to succeed and enables validation tools to properly catch and report missing date columns with better error messages.

Assets 2

20 Nov 13:12

annakrystalli

v2.0.0

fd6d6e0

hubData 2.0.0

Breaking changes

BREAKING: Changed default value of skip_checks parameter from FALSE to TRUE in connect_hub() (#114). This significantly improves performance, especially for large cloud hubs (AWS S3, GCS), by skipping file validation checks that require high I/O operations. The previous default behavior of detecting and excluding invalid files can still be accessed by explicitly setting skip_checks = FALSE. This change aligns with the Python hubdata package default and reflects that hubs validated through hubValidations should not require additional file checks. Users with model output directories containing invalid files should either:
- Use the ignore_files argument to exclude specific files, or
- Set skip_checks = FALSE explicitly, or
- Ensure their model-output directories contain only valid model output files
Note: connect_model_output() retains its default of skip_checks = FALSE as it is designed for working with model output directories that may be in draft form.

New features and improvements

Added comprehensive "Accessing Target Data" vignette demonstrating how to use connect_target_timeseries() and connect_target_oracle_output() to access target data, including filtering, joining with model outputs, and working with cloud-based hubs (#108).
Added r_to_arrow_datatypes() function providing an inverse mapping from R data types to Arrow data types, enabling vectorized type conversion when processing target-data.json configurations (#107).
Enhanced create_timeseries_schema() and create_oracle_output_schema() to support config-based schema creation when target-data.json (v6.0.0+) is present (#107). This enables fast, deterministic schema creation without filesystem I/O, especially beneficial for cloud storage. Functions automatically fall back to inference-based schema creation for pre-v6 hubs or hubs without target-data.json, maintaining backward compatibility. This functionality is propagated to connect_target_timeseries() and connect_target_oracle_output(), which use these schema creation functions internally.
Enhanced documentation for connect_target_timeseries() and connect_target_oracle_output() to clarify column ordering behavior: v6+ Parquet files are reordered to hubverse convention, while CSV files preserve original ordering to avoid column name/position mismatches during collection (#107).
Added get_target_data_colnames() function for extracting and ordering expected column names for target data from target-data.json configuration files (#109).

Assets 2

23 Sep 16:38

annakrystalli

v1.5.0

3bda7d0

hubData 1.5.0

Added Arrow schema utilities for safely converting and validating column types from arrow::Schema objects:
- as_r_schema(): Converts an Arrow schema to a named character vector of equivalent R types (e.g., "int32" → "integer"). Errors on unsupported types.
- arrow_schema_to_string(): Extracts the raw Arrow type strings for field in a schema.
- is_supported_arrow_type(): Returns a named logical vector indicating which schema fields have supported types.
- validate_arrow_schema(): Validates that all field types in an Arrow schema are supported. Throws a helpful error otherwise.
Added arrow_to_r_datatypes, a named character vector defining the mapping of safe and portable Arrow types to their R equivalents.
Added r_schema argument to create_timeseries_schema() and create_oracle_output_schema() functions to enable returning the schema as a vector of R data types instead of an arrow::Schema object (#95)
Added output_type_id_datatype argument to create_oracle_output_schema() and connect_target_oracle_output() functions to allow users to explicitly specify the data type of the output_type_id column in the schema. This ensuring compatibility with create_hub_schema() and connect_hub() (#95).
(Internal) Refactored target data schema and connection tests to use embedded example hubs and reusable schema fixtures, improving reliability and making tests independent of dataset size and ordering.
Added utilities for working with hive-partitioned data file paths:
- extract_hive_partitions() for extracting key value pairs from paths to hive-partitioned data files.
- is_hive_partitioned_path() for checking if a path is hive-partitioned.

create_oracle_output_schema() and create_timeseries_schema() now define a schema for hive-partitions whose data types are defined in the tasks.json config (#89).

Assets 2

13 Jun 13:29

annakrystalli

v1.4.0

a63cdfb

hubData 1.4.0

Added connect_target_timeseries() function (experimental) for accessing time-series target data from a hub (#71). This includes accessing target data from cloud hubs (#75).
Added create_timeseries_schema() function for creating a schema for time-series target data (#71).
Added connect_target_oracle_output() function (experimental) for accessing oracle-output target data from a hub (#72). This includes accessing target data from cloud hubs (#76).
Added create_oracle_output_schema() function for creating a schema for oracle-output target data (#72).
Added get_target_path() function for retrieving the path to the appropriate target data file or directory in a hub.
Added get_target_file_ext() function for retrieving the file extensions of target data file(s) in a hub.
Added get_s3_bucket_name() for extracting the bucket name of a cloud enabled hub from a hub's config (#75).
Added na argument to connect_hub(), connect_model_output(), connect_target_timeseries(), connect_target_oracle_output(), create_timeseries_schema(), and create_oracle_output_schema() to allow for the specification of how to handle missing values in CSV files. The default is to use NA or "", but users can restrict this to "" (empty string) when needing to include character "NA" values in their CSV data (#80). Note this approach only works if NA values are written to the CSV file as "" (empty string) and not as NA or "NA".
Added ignore_files argument to connect_hub() and connect_model_output() to allow users to specify a vector of file name prefixes to ignore when scanning the hub's model output directory for files. This is useful for excluding files that are not relevant to the hub's model output, such as README files or other documentation as well as potentially invalid files (#87). The feature is also used internally in connect_hub() to enable skipping expensive file validity checks when connecting to cloud-based hubs with multiple file formats using skip_checks = TRUE.
Refactored connect_hub() and connect_model_output() internally to reduce the number of calls to cloud hubs, improving performance when connecting to cloud-based hubs.
Added ignore_files argument to connect_target_oracle_output(), connect_target_timeseries(), create_timeseries_schema(), and create_oracle_output_schema() to allow users to specify a vector of file name prefixes to ignore when scanning the hub's target data directory for files (#87).

Assets 2

25 Nov 09:51

annakrystalli

v1.3.0

b7dd168

hubData 1.3.0 - Support v4.0.0 schema

Support the determination of hub schema from v4 configuration files (#63).
Also fixes bug in create_hub_schema() where output_type_id data type was being incorrectly auto-determined as logical when only point estimate output types where being collected by a hub. Now character data type is returned for the output_type_id for all schema versions in such situations when auto-determined.

Assets 2

02 Oct 15:14

zkamvar

v1.2.3

50bff18

hubData 1.2.3

Fix bug in create_hub_schema() where output_type_id data type was being incorrectly determined as Date instead of character (Reported in reichlab/variant-nowcast-hub#87 (comment) and fixed in #58)

Full Changelog: v1.2.2...v1.2.3

Assets 2

20 Sep 18:45

zkamvar

v1.2.2

c9d55de

v1.2.2

What's Changed

Split hubutils - Release v0.0.1 by @annakrystalli in #1
Remote refs from hubverse remotes by @annakrystalli in #12
Replace all null task id properties with required = NA by @annakrystalli in #19
Add collect_hub function by @annakrystalli in #18
use dev version of arrow until fixed version v16 released on CRAN by @annakrystalli in #23
fix broken vignette link by @nickreich in #28
add load_forecasts_zoltar() function by @matthewcornell in #36
Change org name by @annakrystalli in #42
Support v3.0.0 schema sample specification by @annakrystalli in #31
Defunct expand_model_out_val_grid and create_model_out_submit_tmpl and move to hubValidations by @annakrystalli in #46
44/ Configure output_type_id from config by @annakrystalli in #49
fix tidyselect warnings by @zkamvar in #51
Bsweger/skip data checks option by @bsweger in #47
Hotfix/zoltr dep by @annakrystalli in #53
change hubdocs URL to hubverse.io by @zkamvar in #52
Hotfix / Use CRAN arrow by @annakrystalli in #55

New Contributors

@annakrystalli made their first contribution in #1
@nickreich made their first contribution in #28
@matthewcornell made their first contribution in #36
@zkamvar made their first contribution in #51
@bsweger made their first contribution in #47

Full Changelog: https://github.com/hubverse-org/hubData/commits/v1.2.2

Contributors

bsweger, matthewcornell, and 3 other contributors

Assets 2

26 Aug 17:36

zkamvar

1.2.2

c9d55de

hubData 1.2.2

What's Changed

Hotfix / Use CRAN arrow by @annakrystalli in #55

Full Changelog: ef58779...1.2.2

Contributors

annakrystalli

Assets 2

Uh oh!

Releases: hubverse-org/hubData

hubData 2.2.1

Uh oh!

hubData 2.2.0

Contributors

Uh oh!

hubData 2.1.0

New features and improvements

Bug fixes and minor improvements

Uh oh!

hubData 2.0.0

hubData 2.0.0

Breaking changes

New features and improvements

Uh oh!

hubData 1.5.0

Uh oh!

hubData 1.4.0

Uh oh!

hubData 1.3.0 - Support v4.0.0 schema

Uh oh!

hubData 1.2.3

Uh oh!

v1.2.2

What's Changed

New Contributors

Contributors

Uh oh!

hubData 1.2.2

What's Changed

Contributors

Uh oh!