Issue 1460 export results as parquet format#1487
Issue 1460 export results as parquet format#1487samuelafolabi wants to merge 11 commits intowadpac:mainfrom
Conversation
…ue_1460_export_results_as_parquet_format # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
|
Thanks @samuelafolabi looks good, I will review it tomorrow. |
vincentvanhees
left a comment
There was a problem hiding this comment.
Code looks great Samuel, only a few comments below. My main concern is an error message I get when I test it on one of my files (Axivity cwa).
I am only using GGIR defaults parameter values and save_dashboard_parquet = TRUE.
GGIR(
datadir = datadir,
outputdir = outputdir,
overwrite = TRUE,
mode = 1:5,
do.report = c(2, 4, 5),
save_dashboard_parquet = TRUE
)
which results in error
Error: Invalid: Can only convert data frames to Struct type
I think the error originates around line 174 in write_parquet.R, where the nested epoch-level time series are attached as a list-column. From that point onward arrow::arrow_table(consolidated) is no longer possible.
@vincentvanhees Thank you for reveiwing. Is it possible you share some files with me so I could rigorously test before tagging you to the PR again? Thanks once again. |
Yea, thanks for the heads-up. I saw the recent commits. I'll pull them in before I make these fixes. |
No problem, I have sent you an email with the link. |
Improve dashboard parquet generation by hardening epoch merge behavior and filename-to-ID matching for edge-case recordings, preventing Arrow struct conversion failures. Add and refine .Rd docs for parquet export helpers (including author info and dashboard privacy-focused usage note), and align documentation style with GGIR conventions.
…_export_results_as_parquet_format
vincentvanhees
left a comment
There was a problem hiding this comment.
Thanks, all looks better now. A few more comments.
Please also update the NEWS.md file in the root of the repository. As you will see there is a block for GGIR version 3.3-? at the top of that file to reflect the upcoming release. Please add a line to this block. For example:
- Functionality added to save all key output to one parquet file per person. #1460
where the #1460 is my way of tracking the GitHub item an update corresponded to.
| # --------------------------------------------------------------- | ||
| # Build Arrow table and attach Parquet key-value metadata | ||
| # --------------------------------------------------------------- | ||
| parquet_path = paste0(results_dir, "/ggir_epochs.parquet") |
There was a problem hiding this comment.
GGIR users expect to be able to easily navigate the results folder while searching for the main GGIR output files. This will be complicated if all individual parquet files are stored in the root of this folder.
Please create a new subdirectory and store the parquet files in there, e.g. results/parquet.
| metadatadir = "path/to/output_run/output_test_run", | ||
| params_output = list(), | ||
| params_general = list(desiredtz = "UTC", acc.metric = "ENMO"), | ||
| params_phyact = list(part6_threshold_combi = "WW_L40M100V400_T5A5"), |
There was a problem hiding this comment.
This is not a plausible setting, please replace by:
part6_threshold_combi = "40_100_400"
| write_epoch_parquet( | ||
| metadatadir = "path/to/output_run/output_test_run", | ||
| params_output = list(), | ||
| params_general = list(desiredtz = "UTC", acc.metric = "ENMO"), |
There was a problem hiding this comment.
tz = "UTC" is not meaningful for GGIR. Data is always collected in a specific timezone with specific DST conditions. Therefore, GGIR expects a specific timezone in order to understand how to account for DST. Time is always expressed in local time and never in UTC time (Greenwich winter time).
Instead, use tz = "" as example, which uses the timezone where the data is processed.
There was a problem hiding this comment.
Okay. Thanks for the clarification! I was actually wondering about what happens when a participant moves between two or more places with different timezones. This is clearer now.
There was a problem hiding this comment.
People moving between timezones is not accounted for at the moment, but GGIR does account for DST within the same time zone.
| } | ||
| \arguments{ | ||
| \item{metadatadir}{ | ||
| Directory that holds a folder 'meta' and inside this a folder 'basic' |
There was a problem hiding this comment.
folder basic is not used, should this be replaced by ms5.outraw?
| \arguments{ | ||
| \item{metadatadir}{ | ||
| Directory that holds a folder 'meta' and inside this a folder 'basic' | ||
| which contains the milestone data produced by \link{g.part1}. The folder structure |
There was a problem hiding this comment.
Should \link{g.part1} be \link{g.part5}? Same comment for next sentence.
Oh yes! I had plans of reading the comment section of the issue over again after the PR is merged. I assumed it would help reinforce my understanding of things and avoid making similar mistakes in future. Then the issue could be manually closed after that. @vincentvanhees |
Related to #1460
This PR adds a new parameter save_dashboard_parquet to GGIR that allows users to export a consolidated, web dashboard-ready Parquet file from the GGIR output CSVs.
Checklist before merging:
inst/NEWS.Rdwith a user-readable summary. Please, include references to relevant issues or PR discussions.DESCRIPTION,zenodo.json, andinst/CITATIONfiles.If NEW GGIR parameter(s) were added then these NEW parameter(s) are:
man/GGIR.RdR/load_params.RR/check_params.Rvignettes/GGIRParameters.Rmdwith references to the GGIR parts the parameter is used in.