Skip to content

Issue 1460 export results as parquet format#1487

Open
samuelafolabi wants to merge 11 commits intowadpac:mainfrom
samuelafolabi:issue_1460_export_results_as_parquet_format
Open

Issue 1460 export results as parquet format#1487
samuelafolabi wants to merge 11 commits intowadpac:mainfrom
samuelafolabi:issue_1460_export_results_as_parquet_format

Conversation

@samuelafolabi
Copy link
Copy Markdown

@samuelafolabi samuelafolabi commented Mar 27, 2026

Related to #1460

This PR adds a new parameter save_dashboard_parquet to GGIR that allows users to export a consolidated, web dashboard-ready Parquet file from the GGIR output CSVs.

  • When save_dashboard_parquet = TRUE, GGIR writes results/ggir_results.parquet after all requested parts have completed
  • Attaches Parquet key-value metadata recording the variable dictionary, activity threshold configuration, and accelerometer metric used
  • Cleans all column names to be SQL-friendly (lowercase, no special characters) for direct use with DuckDB or similar query engines
  • Find web dashboard repo and live link here- https://github.com/samuelafolabi/ggir-web-dashboard

Checklist before merging:

  • Existing tests still work (check by running the test suite, e.g. from RStudio).
  • Added tests (if you added functionality) or fixed existing test (if you fixed a bug).
  • Clean code has been attempted, e.g. intuitive object names and no code redundancy.
  • Documentation updated:
    • Function documentation
    • Chapter vignettes for GitHub IO
    • Vignettes for CRAN
  • Corresponding issue tagged in PR message. If no issue exist, please create an issue and tag it.
  • Updated release notes in inst/NEWS.Rd with a user-readable summary. Please, include references to relevant issues or PR discussions.
  • If you think you made a significant contribution, add your name to the contributors lists in the DESCRIPTION, zenodo.json, and inst/CITATION files.
  • GGIR parameters were added/removed. If yes, please also complete checklist below.
    If NEW GGIR parameter(s) were added then these NEW parameter(s) are:
    • documented in man/GGIR.Rd
    • included with a default in R/load_params.R
    • included with value class check in R/check_params.R
    • included in table of vignettes/GGIRParameters.Rmd with references to the GGIR parts the parameter is used in.
    • mentioned in NEWS.Rd as NEW parameter

@vincentvanhees
Copy link
Copy Markdown
Member

Thanks @samuelafolabi looks good, I will review it tomorrow.
I just made some updates to the main branch, could you update your branch with those changes? I do not expect them to interfere with your work, but it may be good to have it all up to date.

Copy link
Copy Markdown
Member

@vincentvanhees vincentvanhees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks great Samuel, only a few comments below. My main concern is an error message I get when I test it on one of my files (Axivity cwa).

I am only using GGIR defaults parameter values and save_dashboard_parquet = TRUE.

GGIR(
  datadir = datadir,
  outputdir = outputdir,
  overwrite = TRUE,
  mode = 1:5,
  do.report = c(2, 4, 5),
  save_dashboard_parquet  = TRUE
)

which results in error

Error: Invalid: Can only convert data frames to Struct type

I think the error originates around line 174 in write_parquet.R, where the nested epoch-level time series are attached as a list-column. From that point onward arrow::arrow_table(consolidated) is no longer possible.

Comment thread R/write_parquet.R
Comment thread vignettes/GGIRParameters.Rmd Outdated
@samuelafolabi
Copy link
Copy Markdown
Author

samuelafolabi commented Apr 2, 2026

Code looks great Samuel, only a few comments below. My main concern is an error message I get when I test it on one of my files (Axivity cwa).

I am only using GGIR defaults parameter values and save_dashboard_parquet = TRUE.

GGIR(
  datadir = datadir,
  outputdir = outputdir,
  overwrite = TRUE,
  mode = 1:5,
  do.report = c(2, 4, 5),
  save_dashboard_parquet  = TRUE
)

which results in error

Error: Invalid: Can only convert data frames to Struct type

I think the error originates around line 174 in write_parquet.R, where the nested epoch-level time series are attached as a list-column. From that point onward arrow::arrow_table(consolidated) is no longer possible.

@vincentvanhees Thank you for reveiwing.
I'll check this out and fix as soon as I can. I was only able to test with the test file I created using the create_test_file function before I made a PR.

Is it possible you share some files with me so I could rigorously test before tagging you to the PR again?

Thanks once again.

@samuelafolabi
Copy link
Copy Markdown
Author

Thanks @samuelafolabi looks good, I will review it tomorrow. I just made some updates to the main branch, could you update your branch with those changes? I do not expect them to interfere with your work, but it may be good to have it all up to date.

Yea, thanks for the heads-up. I saw the recent commits. I'll pull them in before I make these fixes.

@vincentvanhees
Copy link
Copy Markdown
Member

Is it possible you share some files

No problem, I have sent you an email with the link.

Improve dashboard parquet generation by hardening epoch merge behavior and filename-to-ID matching for edge-case recordings, preventing Arrow struct conversion failures.
Add and refine .Rd docs for parquet export helpers (including author info and dashboard privacy-focused usage note), and align documentation style with GGIR conventions.
@samuelafolabi
Copy link
Copy Markdown
Author

Copy link
Copy Markdown
Member

@vincentvanhees vincentvanhees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, all looks better now. A few more comments.

Please also update the NEWS.md file in the root of the repository. As you will see there is a block for GGIR version 3.3-? at the top of that file to reflect the upcoming release. Please add a line to this block. For example:

- Functionality added to save all key output to one parquet file per person. #1460

where the #1460 is my way of tracking the GitHub item an update corresponded to.

Comment thread R/write_parquet.R
# ---------------------------------------------------------------
# Build Arrow table and attach Parquet key-value metadata
# ---------------------------------------------------------------
parquet_path = paste0(results_dir, "/ggir_epochs.parquet")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGIR users expect to be able to easily navigate the results folder while searching for the main GGIR output files. This will be complicated if all individual parquet files are stored in the root of this folder.
Please create a new subdirectory and store the parquet files in there, e.g. results/parquet.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Duely noted.

metadatadir = "path/to/output_run/output_test_run",
params_output = list(),
params_general = list(desiredtz = "UTC", acc.metric = "ENMO"),
params_phyact = list(part6_threshold_combi = "WW_L40M100V400_T5A5"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a plausible setting, please replace by:
part6_threshold_combi = "40_100_400"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted

write_epoch_parquet(
metadatadir = "path/to/output_run/output_test_run",
params_output = list(),
params_general = list(desiredtz = "UTC", acc.metric = "ENMO"),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tz = "UTC" is not meaningful for GGIR. Data is always collected in a specific timezone with specific DST conditions. Therefore, GGIR expects a specific timezone in order to understand how to account for DST. Time is always expressed in local time and never in UTC time (Greenwich winter time).
Instead, use tz = "" as example, which uses the timezone where the data is processed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks for the clarification! I was actually wondering about what happens when a participant moves between two or more places with different timezones. This is clearer now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

People moving between timezones is not accounted for at the moment, but GGIR does account for DST within the same time zone.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thank you!

}
\arguments{
\item{metadatadir}{
Directory that holds a folder 'meta' and inside this a folder 'basic'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folder basic is not used, should this be replaced by ms5.outraw?

\arguments{
\item{metadatadir}{
Directory that holds a folder 'meta' and inside this a folder 'basic'
which contains the milestone data produced by \link{g.part1}. The folder structure
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should \link{g.part1} be \link{g.part5}? Same comment for next sentence.

@vincentvanhees
Copy link
Copy Markdown
Member

vincentvanhees commented Apr 15, 2026

"Related to #1460" in the first message should probably be "Fixes #1460" because that will make sure the issue is auto-closed once the PR is merged.

@samuelafolabi
Copy link
Copy Markdown
Author

"Related to #1460" in the first message should probably be "Fixes #1460" because that will make sure the issue is auto-closed once the PR is merged.

Oh yes! I had plans of reading the comment section of the issue over again after the PR is merged. I assumed it would help reinforce my understanding of things and avoid making similar mistakes in future. Then the issue could be manually closed after that. @vincentvanhees

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants