Skip to content

Releases: hubverse-org/hubData

hubData 2.2.1

15 May 13:16
v2.2.1
73a329b

Choose a tag to compare

  • Fixed bug in collect_hub() that caused returned tibbles to contain arrow-ALTREP-backed columns rather than plain materialised R vectors (#141). The ALTREP views could not be safely save()d, serialize()d, or sent to parallel workers in R sessions without arrow installed, silently corrupting columns (most came back length-0). collect_hub() now sets arrow.use_altrep = FALSE for the duration of the call, restoring the long-standing dplyr::collect() contract of materialising into plain in-memory R vectors. Users who want lazy / ALTREP-friendly access can continue to use connect_hub() and work with the arrow dataset directly.

hubData 2.2.0

07 May 12:14
v2.2.0
49b7e0a

Choose a tag to compare

  • Fixed bug in load_model_metadata() that caused array-valued top-level fields in model metadata YAML files to parse incorrectly or fail to parse at all (#137). Array-valued top-level fields are now translated into list columns, per the function specification. Note: The bug also meant that length-1 top-level arrays parsed equivalently to top-level scalars, e.g. both y: "x" and y: ["x"] parsed as a character column y with value x. With the fix, those entries parse differently: y: "x" parses as a character column; y: ["x"] parses as a list column. Thanks @dylanhmorris.

hubData 2.1.0

13 Jan 14:23
v2.1.0
34245de

Choose a tag to compare

New features and improvements

  • Added date_col parameter to create_oracle_output_schema() and connect_target_oracle_output() to match time-series functionality (#121). In inference mode (when target-data.json is not present), users can now specify which column contains date information to ensure it's typed as date32(). The parameter is ignored in v6+ config mode where date columns are defined in the configuration.

Bug fixes and minor improvements

  • create_timeseries_schema() and create_oracle_output_schema() now issue a warning instead of an error when no date column is detected during schema inference (when target-data.json config is not present) (#119). This allows schema creation to succeed and enables validation tools to properly catch and report missing date columns with better error messages.

hubData 2.0.0

20 Nov 13:12
v2.0.0
fd6d6e0

Choose a tag to compare

hubData 2.0.0

Breaking changes

  • BREAKING: Changed default value of skip_checks parameter from FALSE to TRUE in connect_hub() (#114). This significantly improves performance, especially for large cloud hubs (AWS S3, GCS), by skipping file validation checks that require high I/O operations. The previous default behavior of detecting and excluding invalid files can still be accessed by explicitly setting skip_checks = FALSE. This change aligns with the Python hubdata package default and reflects that hubs validated through hubValidations should not require additional file checks. Users with model output directories containing invalid files should either:
    • Use the ignore_files argument to exclude specific files, or
    • Set skip_checks = FALSE explicitly, or
    • Ensure their model-output directories contain only valid model output files
  • Note: connect_model_output() retains its default of skip_checks = FALSE as it is designed for working with model output directories that may be in draft form.

New features and improvements

  • Added comprehensive "Accessing Target Data" vignette demonstrating how to use connect_target_timeseries() and connect_target_oracle_output() to access target data, including filtering, joining with model outputs, and working with cloud-based hubs (#108).
  • Added r_to_arrow_datatypes() function providing an inverse mapping from R data types to Arrow data types, enabling vectorized type conversion when processing target-data.json configurations (#107).
  • Enhanced create_timeseries_schema() and create_oracle_output_schema() to support config-based schema creation when target-data.json (v6.0.0+) is present (#107). This enables fast, deterministic schema creation without filesystem I/O, especially beneficial for cloud storage. Functions automatically fall back to inference-based schema creation for pre-v6 hubs or hubs without target-data.json, maintaining backward compatibility. This functionality is propagated to connect_target_timeseries() and connect_target_oracle_output(), which use these schema creation functions internally.
  • Enhanced documentation for connect_target_timeseries() and connect_target_oracle_output() to clarify column ordering behavior: v6+ Parquet files are reordered to hubverse convention, while CSV files preserve original ordering to avoid column name/position mismatches during collection (#107).
  • Added get_target_data_colnames() function for extracting and ordering expected column names for target data from target-data.json configuration files (#109).

hubData 1.5.0

23 Sep 16:38
v1.5.0
3bda7d0

Choose a tag to compare

  • Added Arrow schema utilities for safely converting and validating column types from arrow::Schema objects:
    • as_r_schema(): Converts an Arrow schema to a named character vector of equivalent R types (e.g., "int32""integer"). Errors on unsupported types.
    • arrow_schema_to_string(): Extracts the raw Arrow type strings for field in a schema.
    • is_supported_arrow_type(): Returns a named logical vector indicating which schema fields have supported types.
    • validate_arrow_schema(): Validates that all field types in an Arrow schema are supported. Throws a helpful error otherwise.
  • Added arrow_to_r_datatypes, a named character vector defining the mapping of safe and portable Arrow types to their R equivalents.
  • Added r_schema argument to create_timeseries_schema() and create_oracle_output_schema() functions to enable returning the schema as a vector of R data types instead of an arrow::Schema object (#95)
  • Added output_type_id_datatype argument to create_oracle_output_schema() and connect_target_oracle_output() functions to allow users to explicitly specify the data type of the output_type_id column in the schema. This ensuring compatibility with create_hub_schema() and connect_hub() (#95).
  • (Internal) Refactored target data schema and connection tests to use embedded example hubs and reusable schema fixtures, improving reliability and making tests independent of dataset size and ordering.
  • Added utilities for working with hive-partitioned data file paths:
    • extract_hive_partitions() for extracting key value pairs from paths to hive-partitioned data files.
    • is_hive_partitioned_path() for checking if a path is hive-partitioned.
  • create_oracle_output_schema() and create_timeseries_schema() now define a schema for hive-partitions whose data types are defined in the tasks.json config (#89).

hubData 1.4.0

13 Jun 13:29
v1.4.0
a63cdfb

Choose a tag to compare

  • Added connect_target_timeseries() function (experimental) for accessing time-series target data from a hub (#71). This includes accessing target data from cloud hubs (#75).
  • Added create_timeseries_schema() function for creating a schema for time-series target data (#71).
  • Added connect_target_oracle_output() function (experimental) for accessing oracle-output target data from a hub (#72). This includes accessing target data from cloud hubs (#76).
  • Added create_oracle_output_schema() function for creating a schema for oracle-output target data (#72).
  • Added get_target_path() function for retrieving the path to the appropriate target data file or directory in a hub.
  • Added get_target_file_ext() function for retrieving the file extensions of target data file(s) in a hub.
  • Added get_s3_bucket_name() for extracting the bucket name of a cloud enabled hub from a hub's config (#75).
  • Added na argument to connect_hub(), connect_model_output(), connect_target_timeseries(), connect_target_oracle_output(), create_timeseries_schema(), and create_oracle_output_schema() to allow for the specification of how to handle missing values in CSV files. The default is to use NA or "", but users can restrict this to "" (empty string) when needing to include character "NA" values in their CSV data (#80). Note this approach only works if NA values are written to the CSV file as "" (empty string) and not as NA or "NA".
  • Added ignore_files argument to connect_hub() and connect_model_output() to allow users to specify a vector of file name prefixes to ignore when scanning the hub's model output directory for files. This is useful for excluding files that are not relevant to the hub's model output, such as README files or other documentation as well as potentially invalid files (#87). The feature is also used internally in connect_hub() to enable skipping expensive file validity checks when connecting to cloud-based hubs with multiple file formats using skip_checks = TRUE.
  • Refactored connect_hub() and connect_model_output() internally to reduce the number of calls to cloud hubs, improving performance when connecting to cloud-based hubs.
  • Added ignore_files argument to connect_target_oracle_output(), connect_target_timeseries(), create_timeseries_schema(), and create_oracle_output_schema() to allow users to specify a vector of file name prefixes to ignore when scanning the hub's target data directory for files (#87).

hubData 1.3.0 - Support v4.0.0 schema

25 Nov 09:51
v1.3.0
b7dd168

Choose a tag to compare

  • Support the determination of hub schema from v4 configuration files (#63).
  • Also fixes bug in create_hub_schema() where output_type_id data type was being incorrectly auto-determined as logical when only point estimate output types where being collected by a hub. Now character data type is returned for the output_type_id for all schema versions in such situations when auto-determined.

hubData 1.2.3

02 Oct 15:14
v1.2.3
50bff18

Choose a tag to compare

Full Changelog: v1.2.2...v1.2.3

v1.2.2

20 Sep 18:45
v1.2.2
c9d55de

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/hubverse-org/hubData/commits/v1.2.2

hubData 1.2.2

26 Aug 17:36
1.2.2
c9d55de

Choose a tag to compare

What's Changed

Full Changelog: ef58779...1.2.2