Skip to content

Add analytical data search notebook#443

Merged
troyraen merged 12 commits intonasa-fornax:mainfrom
adrianlucy:beyond_metadata
Feb 26, 2026
Merged

Add analytical data search notebook#443
troyraen merged 12 commits intonasa-fornax:mainfrom
adrianlucy:beyond_metadata

Conversation

@adrianlucy
Copy link
Copy Markdown
Contributor

@adrianlucy adrianlucy commented Jul 10, 2025

This notebook from MAST for Fornax, "Beyond metadata: analytical data search in the cloud with JWST spectral cubes", answers a data-intensive archival search query ("I want to find JWST spectral image cubes containing Fe II emission from jets launched by young stellar objects"), with broader pedagogical goals related to large-scale target coordinate searches beyond the capabilities of archive services, parallelizing with dask, algorithmically searching for select data characteristics, and loading data from cloud URIs into memory.

The focus is on JWST spectral cubes from MAST, but I hope that the strategies taught will be of more general use for other missions and archives. I also touch briefly on SOFIA data from IRSA at the end, and hope to maybe switch that to SPHEREx in a future PR when more SPHEREx data is available.

The notebook has been reviewed internally by several staff internal to MAST, their feedback has been incorporated, and we are ready for review by the Fornax team!

A couple practical notes:

  • You may need to pip install the requirements file. I've recently noticed that the Default Astrophysics kernel in Fornax no longer has boto3 and pyarrow, but it used to as recently as a couple weeks ago. I'm wondering if this might be accidental; I'll file an issue later today or tomorrow. (Edit: actually I'm not sure where I'm supposed to report that. Probably not this repo...)
  • In any case, you need a more recent astroquery version than the one currently installed in Fornax (last time I checked). This is discussed in the notebook.
  • An executed and rendered copy is temporarily available in a temporary repo, in case it helps illuminate the issue if something goes wrong.

@troyraen troyraen requested review from bsipocz and jkrick July 11, 2025 16:44
@troyraen
Copy link
Copy Markdown
Contributor

Thanks @adrianlucy! A couple folks will take a look, likely next week. We usually do a science review and a technical review. The checklists are posted here https://github.com/nasa-fornax/fornax-documentation/blob/main/notebook_review_process.md, in case you haven't seen them yet. Notebooks go through a two step process: first step is into the repo and the second gets it into the published tutorials here https://nasa-fornax.github.io/fornax-demo-notebooks/. This PR is the first step. The notebook doesn't have to meet all of the checklist items before merging, but that will give you an idea of what the reviewers will look for. Items not covered in this PR can be handled before or during step two.

Copy link
Copy Markdown
Contributor

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This notebook looks great and is very impressive! Thank you for writing it. I have only a few, very minor, comments, with no major changes requested.

I would suggest a different title, as I had no idea what to expect from the notebook when I first saw its title. Maybe something like: "detect_emission_spectral_cubes" or "detect_FE_YSOs" or "search_YSO_cubes" or "FE_emission_YSOs" or ??

Comment thread data_search/analytical_data_search.md
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated

+++

For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A suggestion to use the Medium server as a benchmark, only because resources are limited on Fornax and it will run to completion on the medium server ( I just did it, but didn't time it).

Copy link
Copy Markdown
Contributor Author

@adrianlucy adrianlucy Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get a total 33 min on Medium, vs. 10 min on Large.

I definitely appreciate Fornax's concerns about overtaxing your servers, and I'm not very familiar with what expected resource utilization looks like on your end once you leave beta. But I do have a couple concerns about this proposed change that I'd appreciate your thoughts on:

  1. As a user, it feels a lot more interactive and conducive to trial-and-error/scientist workflows to have a single cell take 2 minutes (on Large) vs. 13 minutes (on Medium).

  2. It's a lot harder to sell a user on the utility of parallelizing when using dask only gets you a ~1.5x speed-up (on Medium), as opposed to a ~4x speed-up (on Large).

The performance I'm seeing on that second point is interesting. For the results below, I'm looking at:

  • Without dask: L276-292
  • With Dask: Take L375, but for a fair comparison remove gnomonic projection from the check_points_in_polygon function.*

And I get:

  • Large server, without Dask: 9.4 minutes, <10% CPU usage
  • Large server, with Dask: 2.4 minutes, 70--80% CPU usage
  • Medium server, without Dask: 12 minutes, 30--40% CPU usage
  • Medium server, with Dask: 7.5 minutes, 80% CPU usage

So, a lot going on here. Maybe there's some overhead imposed by using dask that makes using it on 4 cores in this case barely worth it? Also, seems like maybe we're using more than one core even without using dask, at least on Medium.

In any case, my main concern is that the improvement from parallelizing is much more salient and instructive to the user on the Large server (4x) than on Medium (1.5x).

*check_points_in_polygon() without gnomonic projection, used for the "with Dask" results above:

def check_points_in_polygon(row, coordinates_to_test):
    polygon = parse_polygon(row[‘s_region’])
    polygon_path = Path(polygon, closed=True)
    if polygon_path.contains_points(coordinates_to_test).any():
        return row
    else:
        return pd.Series([np.nan] * len(row), index=row.index)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

(Notes to self so I don't forget, no need to read: 1. If I show that table, or something like it in prose, maybe do with gnomonic projection instead, for all combinations. 2. Check for instances throughout notebook where I talk about how long a cell will take, and maybe include estimates for both medium and large. Similarly, if I talk about speed-up from parallelizing, mention that this is dependent on the number of available cores. Also fix the place where I say "next cell" referring to like 3 cells downwards.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to consider using python's builtin multiprocessing for the parallelization instead of dask. After a quick look at your code, I'm guessing the calls with dask are using the same number of workers as there are CPU, so 4 on Medium server and 16 on Large. While the 4x speedup on Large is much better than the 1.5x on Medium, both are poor relative to the number of workers (and CPU, and probably memory as well) being used. If the parallelization were going well, I would expect a 16x speedup from 16 workers. As you guessed above, dask comes with a lot of overhead. The way the code is written also affects the efficiency of the parallelization but, while I didn't look at your code in enough detail to comment on it, I've had enough experience with dask that is similar to what you're reporting here that I suspect dask just isn't doing a very good job running it efficiently. Perhaps dask can be made to do better, but unfortunately I haven't seen much documentation instructing users about how to do that (though I haven't checked in quite awhile). Also, I'm hearing reports of significant issues with dask development right now with things in broken states and few developers available to fix it. So multiprocessing seems the better option all around. But that being said, I would certainly understand if you don't want to put the effort into switching your code at this point.

Copy link
Copy Markdown
Contributor Author

@adrianlucy adrianlucy Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@troyraen Interesting, thank you! Yeah, dask wasn't a deeply considered choice on my part, I was just following guidance that it's the strategic direction we wanted to focus on. But we shouldn't recommend it in cases where it's the wrong choice.

Indeed, I am seeing much better results with multiprocessing (with num_workers = num_cores). Relative to no explicit parallelization at all, now a 10x speed-up on Large (was 4x with dask), 3x speed-up on Medium (was 1.5x with dask). CPU usage is now ~100% (was ~70-80% with dask).

The choice between dask and multiprocessing may also have to do with the function being parallelized being very brief, in this case ~30--90 milliseconds per usage.

The speed up is still short of [number of cores]x. The reason may be that my code needs improvement, but my best guess is that it's at least partially because some python libraries are implicitly managing to make use of multiple cores even without explicit parallelization. On medium, CPU usage without explicit parallelization was 30--40% according to the kernel usage tab in Fornax, so a 3x speed-up from parallelizing seems about right. On large, CPU usage without explicit parallelization was 7%, so our 10x speed-up is a bit short of the 14x speed-up I might expect—but in this case 14x vs. 10x is a difference of 15 seconds so I could see start-up/wrap-up overhead being an issue. No improvement from increasing chunksize.

Anyway, I think I will indeed switch this to multiprocessing. And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows may yield better performance on either or both of dask and/or multiprocessing.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good, ya those are obviously better speedups. And I should have added a few more words to this sentence since the overhead may not be completely surmountable... "If the parallelization were going well, I would expect closer to a 16x speedup from 16 workers." There are probably things you can tweak to get more efficiency, but also fine to leave as is.

I should also mention, multiprocessing is not without issues. One in particular is that the behavior is a bit different on Linux, Mac, and Windows and it can be tricky to write code that works out of the box on all of them. In this repo we assume Linux, since that's what Fornax Console runs, and then include some text with info/instructions for Mac and Windows if we can. We've been hoping that dask would eventually be a better solution but so far haven't had enough success to fully adopt it.

And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows

One simple way to do this is with the chunksize argument. The default is usually 1 which means that a new task is generated for every item in the input iterable, whereas a chunksize of (eg) 100 only generates a new task for every 100 items and so reduces the overhead. My rule of thumb is to generate at least 2-3 chunks per worker using something like chunksize = min(100, len(my_iterable) // (3 * nworkers)). Making chunks too big can result in some workers finishing a lot earlier than others and then just sitting around idle because all tasks have already been assigned out.

Another thing you can look into is the different functions that are available and what they're recommended for. For example, if the iterable is very long and the order in which results are returned is not important, Pool.imap_unordered() is recommended. But again, the 3x and 10x speedups are already reasonably good.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched from dask to multiprocessing. One interesting nuance is that a task with an I/O bound component (opening FITS files) benefits from having more workers than cores, so that a core always has something to do.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Runtimes for both medium and large server now included, removed advice to use large server.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also I did try Pool.imap_unordered, changing chunksize, etc., but didn't find anything that improved performance except as noted above.)

Co-authored-by: jkrick <jkrick@caltech.edu>
@adrianlucy
Copy link
Copy Markdown
Contributor Author

adrianlucy commented Jul 16, 2025

I would suggest a different title, as I had no idea what to expect from the notebook when I first saw its title. Maybe something like: "detect_emission_spectral_cubes" or "detect_FE_YSOs" or "search_YSO_cubes" or "FE_emission_YSOs" or ??

Thank you @jkrick! Just to check, when you ask for a different title, are you only referring to the filename, or do you also suggest changing the in-notebook title ("Beyond metadata: analytical data search in the cloud with JWST spectral cubes")?

Also, do you want the directory holding the notebook to be named the same as the file? Currently, the directory name is even more generic - I guess I was imagining that maybe you'd want to group this notebook with ones sharing similar goals in the future, or something like that.

Copy link
Copy Markdown
Contributor

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see the rationale for the folder name being either data_search or something more specific. You can choose and then if we feel it doesn't fit later, we can always change it. And I think the in-notebook title is also fine. It really was a comment about the filename itself. Other people might have stronger opinions on this organization question.

Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated

+++

For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.

Copy link
Copy Markdown
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first round (dry-run -- I haven't run the notebook) reviews. The biggest issue is with the very right requirements.

And the manual TOC will certainly be an issue as this further evolves.

Also, I wonder if we can try to add this to the rendering as part of this PR, or would you prefer to do that in a follow-up (e.g. TOC cleanup can be postponed to that one).

Comment thread data_search/requirements_analytical_data_search.txt Outdated
Comment thread data_search/README.md
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
Comment thread data_search/analytical_data_search.md Outdated
plt.show()
```

> Figure 4 image description: three astronomical images in pixel coordinates, side by side, display three different wavebin slices of a spectral cube. The left and right images are labeled "-10 wavebins away" and "+10 wavebins away", respectively, and feature a bright central source with faint hints of adjacent extention to the lower right. The middle image is labeled "5.34 micron line". In the middle image, the central source is much dimmer, emission extends all the way to the lower-right corner, and a blob in the lower-right corner is the brightest feature, in a region where the other two images exhibit no emission.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these meant to be ALT-texts? If yes, then we should be able to work it properly into the figure. If it's for all users, then I would suggest to work it more into the narrative text, this notebook has very little of that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they are intended to be a best effort at alt text. My understanding was that matplotlib unfortunately doesn't have alt text support, and Jupyter in general is not well designed for accessibility. Very open to ideas!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that there has been some recent work on it in the wider ecosystem, and also asked around.
What's certain atm, is that myst supports alt text for figures, but you may need to wizard around to have it added to the mpl generated figure.

https://myst-parser.readthedocs.io/en/latest/syntax/images_and_figures.html

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that MAST is very interested in as well---it's really critical with so much user support moving to notebooks on science platforms---but we haven't made much progress on it. We've had conversations with QuantStack about notebook accessibility issues, and it sounds like they recently got some funding for that general topic.

I took a look at the link you mentioned, but I suspect where you say "wizard around" might unfortunately mean "carry out a major work effort". Our team thinks that putting alt text in block quotes is the best quick stop-gap solution we have right now. Note that, for sighted users, this looks like the attached screenshot.
Screenshot 2025-12-02 at 6 25 05 PM

Anyway, I changed "image description" to say "alt text" for more clarity. I can't think of any better solution that we have the capacity for at MAST. Let me know if you want me to delete the alt text, but my team strongly believes that imperfectly located alt text is better than no alt text.

And please let me know if you do come up with any alternatives!

@adrianlucy
Copy link
Copy Markdown
Contributor Author

adrianlucy commented Jul 23, 2025

Also, I wonder if we can try to add this to the rendering as part of this PR, or would you prefer to do that in a follow-up

I'd be happy to do it all in one PR. I'll work on incorporating the feedback thus far soon, probably finishing sometime next week.

@troyraen
Copy link
Copy Markdown
Contributor

The rendering actually requires some configuration that would be easier to do in a separate PR. We can handle that bit once this one gets merged. All of the above comments would be good to address but here's the list of minimum necessary before it can be added to the rendering to get published:

  • Loosen the dependency version requirements where possible.
  • Remove the TOC.
  • Better solution for the alt text in images. Holler if you continue to have trouble with it and we can help more to work something out.
  • Remove the try/except if possible. If it is needed, restrict the except to catch only the error you're expecting.

adrianlucy and others added 4 commits July 29, 2025 14:27
-delete TOC

-loosen requirements

-switch from dask to multiprocessing module

-rename row to obs throughout

-rename exists function to not_none

-more elegant wrangling of yso_coords

-admonitions instead of html

-preface alt text more clearly

-further refine multiprocessing, add more workers for IO bound task

-move parse_polygon to code_src

-change requirement to astroquery 0.4.11, but add admonition that on Fornax need to install boto3 currently

-fix function docstrings

-use Quantities for wavelengths
@adrianlucy
Copy link
Copy Markdown
Contributor Author

Hi all - sorry for the extraordinary delay, we're a bit short-staffed here. I believe I've now addressed all comments.

In particular, to Troy's compiled list of all the items that need to be addressed before this notebook can be added to the rendering/published: all have been done except fixing the alt text, which doesn't seem feasible with the resources I have access to (though I did revise the language to make it clearer that it's alt text). See my post just now in the discussion thread for alt text, for more details: #443 (comment) Let me know if you have any ideas!

Many other comments were addressed, and a summary change log is in this commit: 5da4e9d The main highlight is the switch from dask to multiprocessing, and a little further tuning in multiprocessing.

I also wanted to bring to your attention that boto3 is not installed in the Default Astrophysics kernel. I think maybe it should be, since at least astroquery.mast needs it for retrieval of AWS cloud data. Anyway, I added an admonition tag noting that the user needs to install it, which we can remove if the situation changes.

Let me know:

  1. if we can merge this in, and
  2. what the next step is to get this added to the rendering.

Thank you! Hope you're all doing okay, and sorry again for the delay.

@troyraen
Copy link
Copy Markdown
Contributor

Thanks Adrian and no worries!

@bsipocz can you take another look here? I plan to as well.

@troyraen troyraen requested review from bsipocz and troyraen December 11, 2025 04:47
Copy link
Copy Markdown
Contributor

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adrianlucy, this is an interesting use case. Thanks for putting so much work into it! Pending @bsipocz's review, I think we should go ahead and merge this PR and move forward with releasing the notebook.

I took a quick look with respect to the notebook review checklists. The major thing I want to check on is the multi-archive access (since it's a key characteristic of notebooks in this repo):

  • Does it include all three archives HEASARC, MAST, IRSA?
    • if not, is that justified?
  • Has each NASA archive been given the option to comment on modules for their relevant data access?
    • Discourse staff channel is one method of contacting people to check for this

Have you considered including any HEASARC datasets? And have you checked with IRSA about SOFIA/Herschel/Spitzer spectral cube access? (Maybe @jkrick covered that in her review?) If there are preferred improvements and/or additional datasets, they don't necessarily need to be implemented immediately but let's at least make a plan before releasing the notebook.

I left some minor suggestions in code comments below. It's mostly about clarifying function input as that's something we've gotten feedback from users about.

Comment thread spectral_cubes/emission_search_cubes.md Outdated
Comment thread spectral_cubes/emission_search_cubes.md
Comment thread spectral_cubes/emission_search_cubes.md
Comment thread spectral_cubes/emission_search_cubes.md
Comment thread spectral_cubes/emission_search_cubes.md
Comment thread spectral_cubes/code_src/parse_polygon.py
Copy link
Copy Markdown
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates to this PR. I have a lot of comments here, some of them are blockers while others are questions or pointers for follow-ups.

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running. So ultimately more comments could possible come, but I do not block this review it that right now.

Comment thread spectral_cubes/requirements_emission_search_cubes.txt Outdated
Comment on lines +1091 to +1101
### References

This notebook relies on the following papers:
- Assani et al. 2025 ([2025arXiv250402136A](https://ui.adsabs.harvard.edu/abs/2025arXiv250402136A/abstract))
- Karska et al. 2025 ([2025A&A...697A.186K](https://ui.adsabs.harvard.edu/abs/2025A%26A...697A.186K/abstract))

And the following software:
- Astroquery; Ginsburg et al. 2019 (2019AJ….157…98G)
- Astropy; Astropy Collaboration 2022, Astropy Collaboration 2018, Astropy Collaboration 2013 (2022ApJ…935..167A, 2018AJ….156..123A, 2013A&A…558A..33A)
- Matplotlib; Hunter 2007 (2007CSE.....9...90H)
- SciPy; Pauli et al. 2020 (2020NatMe..17..261V)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myst can handle references super nicely, I would recommend using that rather than a manual list at the bottom: https://mystmd.org/guide/citations

Copy link
Copy Markdown
Contributor Author

@adrianlucy adrianlucy Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say that (judging by the approach taken so far by other Fornax notebooks) this is not needed for MVP, if that's okay. But I agree that's a good idea.

Comment on lines +363 to +382
```{code-cell} ipython3
tmc1a_meta = Simbad.query_object('TMC1A')
tmc1a_ra = float(tmc1a_meta['ra']) # ICRS RA of TMC1A in degrees
tmc1a_dec = float(tmc1a_meta['dec']) # ICRS Dec of TMC1A in degrees
```

...and retrieve its observations from our table of JWST spectral cube observations of YSOs:

```{code-cell} ipython3
tmc1a_jwst = []

# As before...
for obs in yso_jwst_obstable:
polygon = parse_polygon(obs['s_region'])
polygon_path = Path(polygon, closed=True)
# We'll use matplotlib.Path.contains_point (singular) this time,
# instead of .contains_points (plural)
is_inside = polygon_path.contains_point((tmc1a_ra, tmc1a_dec))
if is_inside:
tmc1a_jwst.append(obs)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be possible natively with astropy and regions; please use that rather than brute force mpl.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree; please see my comment linked here (and the preceding comment above it) in our previous discussion about this, and we can continue the conversation in that thread if necessary. #443 (comment)

Comment on lines +337 to +344
# Convert back to a pandas dataframe
results = pd.DataFrame(results)

# Remove empty rows, where check_points_in_polygon reached the "else" part of the return statement.
results = results.dropna(how='all')

# Convert back to an astropy table.
yso_jwst_obstable = Table.from_pandas(results)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why convert back and forth between pandas and astropy? If you want to work with astropy Tables, then stick with those all the way.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this a while back. Unfortunately, looping through an equivalent iterable of an astropy table (the output/input of astroquery) drastically hurts multiprocessing performance relative to iterating through a pandas dataframe, in this case. But I have cleaned up the code further up to define an iterable straight away rather than bothering to define a pandas table.

Comment thread spectral_cubes/code_src/detect_spikes.py Outdated
Comment thread spectral_cubes/emission_search_cubes.md Outdated
Comment thread spectral_cubes/emission_search_cubes.md Outdated
plt.show()
```

> Figure 1 alt text: a line plot of "Fraction of non-NaN pixels above threshold" versus "Wavelength bin index", the latter ranging from around 0 to 1000. The plotted values have substantial scatter and features, but one feature stands out by a factor of two above all others: a narrow, abrupt spike near wavelength bin index 548.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to have alt-text for the figure, but I have serious concerns with its accuracy over time. Also, for actual alt-text I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.
Or, as this code cell is nothing but the plotting, use the code-cell caption?

What to do with this is more of a question to @troyraen -- I don't think we can guarantee that any of these type of alt-text would stay accurate, unless we somehow auto-generate it from the data/with matplotlib.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can guarantee that any of these type of alt-text would stay accurate

Do we expect the figures to change if the notebook/code itself doesn't change?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot. Captions are usually more abstract as it doesn't really spell out details and numericals from the figure, so those should not be affected.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.

As discussed in an existing thread here #443 (comment), I welcome your ideas but I don't personally see how to implement a near-term solution in that direction. FWIW our team thinks that putting alt text in block quotes is the best quick stop-gap solution/universal design approach we have right now.

Or, as this code cell is nothing but the plotting, use the code-cell caption?

Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? https://quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?

Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot

Certainly, accessibility has a labor cost, but there are 508/ADA requirements at the very least. In any case, in this case I believe I have written the alt text to be flexible enough that it should not need maintenance.

But of course, I leave it up to you.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?

No, we're using Jupyter-book and the mystmd engine: https://mystmd.org/guide/code#code-blocks

Putting captions properly into captions and alt-text into alt-text would help with accessibility much better than leaving it to the screen readers to scrape the quoted text for alt-text.

Here is another entry point; but also would recommend following the links to the {figure} and {image} directives.
https://mystmd.org/guide/reuse-jupyter-outputs#alternative-text-for-accessibility

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oo, very interesting, thank you! I'll try that today.

Copy link
Copy Markdown
Contributor Author

@adrianlucy adrianlucy Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsipocz, I feel foolish but I can't figure out how this helps include alt-text or a caption in a notebook. If you think it can do that, could you perhaps show the change to the markdown that you think would work? I can't get it to work. e.g. for something easy to test I tried this in the markdown:

```{code-cell} ipython3
:caption: TESTING CAPTIONS

plt.plot(test_blobs, color='k')
plt.xlabel('Wavelength bin index')
plt.ylabel('Fraction of non-NaN pixels above threshold')
plt.show()

"TESTING CAPTIONS" does not appear in the executed notebook, and in my reading nothing in the mystmd documentation suggests it would. But maybe I'm doing something wrong.

Perhaps you mean that the alt-text or caption would manifest in the render. But isn't it better for alt-text to be available when interacting with the notebook itself in the Fornax platform, even if imperfectly?

Thank you!

Comment thread spectral_cubes/emission_search_cubes.md
Comment thread spectral_cubes/emission_search_cubes.md Outdated
return obs
```

Phew! Let's try this out on our TMC1A observations.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the narrative professional.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, revised. FWIW it might be useful to write a style guide that explains the tone you are aiming for more concretely. While the content you want removed is clear in this case, "professional" is a moving target; the MAST Notebook Repository, for our part, generally does not have the same position on conversational tone.

@troyraen
Copy link
Copy Markdown
Contributor

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running.

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

@bsipocz
Copy link
Copy Markdown
Member

bsipocz commented Dec 23, 2025

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running.

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

This made me thing that what if we save out a cell -- or do something in a dropdown? That could keep the best of both worlds.

-Biggest change is support for macOS and Windows by saving and importing the functions defined in the notebook

-See PR review responses for details
@adrianlucy
Copy link
Copy Markdown
Contributor Author

From @troyraen:

Have you considered including any HEASARC datasets? And have you checked with IRSA about SOFIA/Herschel/Spitzer spectral cube access? (Maybe @jkrick covered that in her review?) If there are preferred improvements and/or additional datasets, they don't necessarily need to be implemented immediately but let's at least make a plan before releasing the notebook.

I did consider this already, but I don't think there are any cubically-formatted spectral images in the HEASARC, so creating new algorithms for HEASARC would greatly overcomplicate the narrative. IRSA is discussed in the last section of the notebook; @jkrick please do let me know if you see anything wrong there. My suggestion would be that we plan to look at the possibility of expanding the IRSA section when SphereX all-sky data cubes are released.

@adrianlucy
Copy link
Copy Markdown
Contributor Author

From @troyraen:

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

From @bsipocz:

This made me thing that what if we save out a cell -- or do something in a dropdown? That could keep the best of both worlds.

Thanks very much for the great catch. I'm now implementing @bsipocz's "save out a cell" idea, which I like, with the %%writefile magic. That does introduce a little complexity with imports, but I think it's worth it.

Note that there's another nuance here because passing in e.g. cloud_uri_map in every iteration would be a drastic performance hit, and cloud_uri_map is generated on the fly in the notebook. What I'd done before is pass it in within a higher wrapper function, but that can't be done with an imported function.

So what I've done now is create an option to serialize such objects (like cloud_uri_map) on the fly in a pickle file, deserializing in every iteration. This turns out to be only a ~10% performance hit, which seems totally fine for this.

@adrianlucy
Copy link
Copy Markdown
Contributor Author

adrianlucy commented Feb 12, 2026

Thank you all very much for the second round of comments, @troyraen @jkrick @bsipocz. I've done my best to incorporate them, and I think the notebook is much improved from where we started.

I'm hopeful that, wherever at all possible, we can consider the current version to be good enough for an initial merge and rendering, while identifying priorities for future improvement. Leadership here would quite like to get this added to the rendering.

From @troyraen's initial summary of MVP to be added to the rendering, everything is implemented except finding an alternative solution for alt text, which I don't currently see a way forward for on my end. (See two threads above for discussion on that topic: #443 (comment), #443 (comment))

Let me know what you think. Thank you!

@troyraen
Copy link
Copy Markdown
Contributor

Thanks for your patience and persistence @adrianlucy. This is a lot of work! I haven't tracked all the recent conversations yet, but after a quick look, I see that you've certainly been responsive to feedback. And also have two approving reviews. I'm headed out for a long weekend but will follow up next week.

%%writefile magic hadn't been on my radar - looking forward to seeing if it's a solution for problems with a couple notebooks I wrote for IRSA!

@troyraen
Copy link
Copy Markdown
Contributor

I think we're ready to merge this. After it's merged, our team will work on some integration and deployment tasks. Things I'm aware of that are likely to be included are:

  • Merge the files into the existing spectroscopy directory.
  • Reframe the section currently called "Postscript" to be an introduction to spectral cube products from other NASA missions that highlights the similarities, with less focus on how hard it would be to do exactly the same thing as being done for JWST.
  • Configure the notebook to be tested by CI. We may need to limit the number of sources in the input sample in order to do this. We would leave all the code to get the full sample but then proceed to operate on a subset by default. That would also help users by providing a notebook that runs by default on modest compute resources in a shorter amount of time while still giving interested users the option to run it at full scale. This is how we've handled this for the other notebooks in this repo, which have the same tension between wanting a scientifically robust sample and needing a notebook that doesn't require a ton of compute power and/or time by default.
  • Add it to the website and Fornax deployments.

After it's released, we'll reach out to you or MAST if/when we become aware of something substantive that needs to be updated in the notebook text or code. You're also welcome to make other updates on an ongoing basis.

Do you have any comments about the above, or anything else you want to add to this PR before merging, @adrianlucy?

@adrianlucy
Copy link
Copy Markdown
Contributor Author

@troyraen That all sounds good to me—thank you!

Don't hesitate to let me know if you need anything from me during this process. I'm out Wedn and Thurs this week, but back Friday. (PS, if you happen to have an approximate guess on when it will be added to the website rendering, there are folks asking me for updates. No worries if not!)

Many thanks for the work that you all have and will put into this, and for your extremely useful feedback throughout this review.

@troyraen troyraen merged commit 14c6278 into nasa-fornax:main Feb 26, 2026
0 of 2 checks passed
github-actions Bot pushed a commit that referenced this pull request Feb 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants