Add analytical data search notebook by adrianlucy · Pull Request #443 · nasa-fornax/fornax-demo-notebooks

adrianlucy · 2025-07-10T22:21:28Z

This notebook from MAST for Fornax, "Beyond metadata: analytical data search in the cloud with JWST spectral cubes", answers a data-intensive archival search query ("I want to find JWST spectral image cubes containing Fe II emission from jets launched by young stellar objects"), with broader pedagogical goals related to large-scale target coordinate searches beyond the capabilities of archive services, parallelizing with dask, algorithmically searching for select data characteristics, and loading data from cloud URIs into memory.

The focus is on JWST spectral cubes from MAST, but I hope that the strategies taught will be of more general use for other missions and archives. I also touch briefly on SOFIA data from IRSA at the end, and hope to maybe switch that to SPHEREx in a future PR when more SPHEREx data is available.

The notebook has been reviewed internally by several staff internal to MAST, their feedback has been incorporated, and we are ready for review by the Fornax team!

A couple practical notes:

You may need to pip install the requirements file. I've recently noticed that the Default Astrophysics kernel in Fornax no longer has boto3 and pyarrow, but it used to as recently as a couple weeks ago. I'm wondering if this might be accidental; I'll file an issue later today or tomorrow. (Edit: actually I'm not sure where I'm supposed to report that. Probably not this repo...)
In any case, you need a more recent astroquery version than the one currently installed in Fornax (last time I checked). This is discussed in the notebook.
An executed and rendered copy is temporarily available in a temporary repo, in case it helps illuminate the issue if something goes wrong.

troyraen · 2025-07-11T17:27:24Z

Thanks @adrianlucy! A couple folks will take a look, likely next week. We usually do a science review and a technical review. The checklists are posted here https://github.com/nasa-fornax/fornax-documentation/blob/main/notebook_review_process.md, in case you haven't seen them yet. Notebooks go through a two step process: first step is into the repo and the second gets it into the published tutorials here https://nasa-fornax.github.io/fornax-demo-notebooks/. This PR is the first step. The notebook doesn't have to meet all of the checklist items before merging, but that will give you an idea of what the reviewers will look for. Items not covered in this PR can be handled before or during step two.

jkrick

This notebook looks great and is very impressive! Thank you for writing it. I have only a few, very minor, comments, with no major changes requested.

I would suggest a different title, as I had no idea what to expect from the notebook when I first saw its title. Maybe something like: "detect_emission_spectral_cubes" or "detect_FE_YSOs" or "search_YSO_cubes" or "FE_emission_YSOs" or ??

jkrick · 2025-07-15T17:24:15Z

+
+++
+
+For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes.


A suggestion to use the Medium server as a benchmark, only because resources are limited on Fornax and it will run to completion on the medium server ( I just did it, but didn't time it).

I get a total 33 min on Medium, vs. 10 min on Large.

I definitely appreciate Fornax's concerns about overtaxing your servers, and I'm not very familiar with what expected resource utilization looks like on your end once you leave beta. But I do have a couple concerns about this proposed change that I'd appreciate your thoughts on:

As a user, it feels a lot more interactive and conducive to trial-and-error/scientist workflows to have a single cell take 2 minutes (on Large) vs. 13 minutes (on Medium).

It's a lot harder to sell a user on the utility of parallelizing when using dask only gets you a ~1.5x speed-up (on Medium), as opposed to a ~4x speed-up (on Large).

The performance I'm seeing on that second point is interesting. For the results below, I'm looking at:

Without dask: L276-292

With Dask: Take L375, but for a fair comparison remove gnomonic projection from the check_points_in_polygon function.*

And I get:

Large server, without Dask: 9.4 minutes, <10% CPU usage

Large server, with Dask: 2.4 minutes, 70--80% CPU usage

Medium server, without Dask: 12 minutes, 30--40% CPU usage

Medium server, with Dask: 7.5 minutes, 80% CPU usage

So, a lot going on here. Maybe there's some overhead imposed by using dask that makes using it on 4 cores in this case barely worth it? Also, seems like maybe we're using more than one core even without using dask, at least on Medium.

In any case, my main concern is that the improvement from parallelizing is much more salient and instructive to the user on the Large server (4x) than on Medium (1.5x).

*check_points_in_polygon() without gnomonic projection, used for the "with Dask" results above:

def check_points_in_polygon(row, coordinates_to_test): polygon = parse_polygon(row[‘s_region’]) polygon_path = Path(polygon, closed=True) if polygon_path.contains_points(coordinates_to_test).any(): return row else: return pd.Series([np.nan] * len(row), index=row.index)

probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.

Good idea!

(Notes to self so I don't forget, no need to read: 1. If I show that table, or something like it in prose, maybe do with gnomonic projection instead, for all combinations. 2. Check for instances throughout notebook where I talk about how long a cell will take, and maybe include estimates for both medium and large. Similarly, if I talk about speed-up from parallelizing, mention that this is dependent on the number of available cores. Also fix the place where I say "next cell" referring to like 3 cells downwards.)

You might want to consider using python's builtin multiprocessing for the parallelization instead of dask. After a quick look at your code, I'm guessing the calls with dask are using the same number of workers as there are CPU, so 4 on Medium server and 16 on Large. While the 4x speedup on Large is much better than the 1.5x on Medium, both are poor relative to the number of workers (and CPU, and probably memory as well) being used. If the parallelization were going well, I would expect a 16x speedup from 16 workers. As you guessed above, dask comes with a lot of overhead. The way the code is written also affects the efficiency of the parallelization but, while I didn't look at your code in enough detail to comment on it, I've had enough experience with dask that is similar to what you're reporting here that I suspect dask just isn't doing a very good job running it efficiently. Perhaps dask can be made to do better, but unfortunately I haven't seen much documentation instructing users about how to do that (though I haven't checked in quite awhile). Also, I'm hearing reports of significant issues with dask development right now with things in broken states and few developers available to fix it. So multiprocessing seems the better option all around. But that being said, I would certainly understand if you don't want to put the effort into switching your code at this point.

@troyraen Interesting, thank you! Yeah, dask wasn't a deeply considered choice on my part, I was just following guidance that it's the strategic direction we wanted to focus on. But we shouldn't recommend it in cases where it's the wrong choice.

Indeed, I am seeing much better results with multiprocessing (with num_workers = num_cores). Relative to no explicit parallelization at all, now a 10x speed-up on Large (was 4x with dask), 3x speed-up on Medium (was 1.5x with dask). CPU usage is now ~100% (was ~70-80% with dask).

The choice between dask and multiprocessing may also have to do with the function being parallelized being very brief, in this case ~30--90 milliseconds per usage.

The speed up is still short of [number of cores]x. The reason may be that my code needs improvement, but my best guess is that it's at least partially because some python libraries are implicitly managing to make use of multiple cores even without explicit parallelization. On medium, CPU usage without explicit parallelization was 30--40% according to the kernel usage tab in Fornax, so a 3x speed-up from parallelizing seems about right. On large, CPU usage without explicit parallelization was 7%, so our 10x speed-up is a bit short of the 14x speed-up I might expect—but in this case 14x vs. 10x is a difference of 15 seconds so I could see start-up/wrap-up overhead being an issue. No improvement from increasing chunksize.

Anyway, I think I will indeed switch this to multiprocessing. And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows may yield better performance on either or both of dask and/or multiprocessing.

Oh good, ya those are obviously better speedups. And I should have added a few more words to this sentence since the overhead may not be completely surmountable... "If the parallelization were going well, I would expect closer to a 16x speedup from 16 workers." There are probably things you can tweak to get more efficiency, but also fine to leave as is.

I should also mention, multiprocessing is not without issues. One in particular is that the behavior is a bit different on Linux, Mac, and Windows and it can be tricky to write code that works out of the box on all of them. In this repo we assume Linux, since that's what Fornax Console runs, and then include some text with info/instructions for Mac and Windows if we can. We've been hoping that dask would eventually be a better solution but so far haven't had enough success to fully adopt it.

And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows

One simple way to do this is with the chunksize argument. The default is usually 1 which means that a new task is generated for every item in the input iterable, whereas a chunksize of (eg) 100 only generates a new task for every 100 items and so reduces the overhead. My rule of thumb is to generate at least 2-3 chunks per worker using something like chunksize = min(100, len(my_iterable) // (3 * nworkers)). Making chunks too big can result in some workers finishing a lot earlier than others and then just sitting around idle because all tasks have already been assigned out.

Another thing you can look into is the different functions that are available and what they're recommended for. For example, if the iterable is very long and the order in which results are returned is not important, Pool.imap_unordered() is recommended. But again, the 3x and 10x speedups are already reasonably good.

Switched from dask to multiprocessing. One interesting nuance is that a task with an I/O bound component (opening FITS files) benefits from having more workers than cores, so that a core always has something to do.

Runtimes for both medium and large server now included, removed advice to use large server.

(Also I did try Pool.imap_unordered, changing chunksize, etc., but didn't find anything that improved performance except as noted above.)

Co-authored-by: jkrick <jkrick@caltech.edu>

adrianlucy · 2025-07-16T20:01:56Z

I would suggest a different title, as I had no idea what to expect from the notebook when I first saw its title. Maybe something like: "detect_emission_spectral_cubes" or "detect_FE_YSOs" or "search_YSO_cubes" or "FE_emission_YSOs" or ??

Thank you @jkrick! Just to check, when you ask for a different title, are you only referring to the filename, or do you also suggest changing the in-notebook title ("Beyond metadata: analytical data search in the cloud with JWST spectral cubes")?

Also, do you want the directory holding the notebook to be named the same as the file? Currently, the directory name is even more generic - I guess I was imagining that maybe you'd want to group this notebook with ones sharing similar goals in the future, or something like that.

jkrick

I can see the rationale for the folder name being either data_search or something more specific. You can choose and then if we feel it doesn't fit later, we can always change it. And I think the in-notebook title is also fine. It really was a comment about the filename itself. Other people might have stronger opinions on this organization question.

jkrick · 2025-07-16T21:31:16Z

+
+++
+
+For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes.


probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.

bsipocz

first round (dry-run -- I haven't run the notebook) reviews. The biggest issue is with the very right requirements.

And the manual TOC will certainly be an issue as this further evolves.

Also, I wonder if we can try to add this to the rendering as part of this PR, or would you prefer to do that in a follow-up (e.g. TOC cleanup can be postponed to that one).

bsipocz · 2025-07-23T18:11:33Z

+plt.show()
+```
+
+> Figure 4 image description: three astronomical images in pixel coordinates, side by side, display three different wavebin slices of a spectral cube. The left and right images are labeled "-10 wavebins away" and "+10 wavebins away", respectively, and feature a bright central source with faint hints of adjacent extention to the lower right. The middle image is labeled "5.34 micron line". In the middle image, the central source is much dimmer, emission extends all the way to the lower-right corner, and a blob in the lower-right corner is the brightest feature, in a region where the other two images exhibit no emission.


Are these meant to be ALT-texts? If yes, then we should be able to work it properly into the figure. If it's for all users, then I would suggest to work it more into the narrative text, this notebook has very little of that.

Yes, they are intended to be a best effort at alt text. My understanding was that matplotlib unfortunately doesn't have alt text support, and Jupyter in general is not well designed for accessibility. Very open to ideas!

I know that there has been some recent work on it in the wider ecosystem, and also asked around.
What's certain atm, is that myst supports alt text for figures, but you may need to wizard around to have it added to the mpl generated figure.

https://myst-parser.readthedocs.io/en/latest/syntax/images_and_figures.html

This is something that MAST is very interested in as well---it's really critical with so much user support moving to notebooks on science platforms---but we haven't made much progress on it. We've had conversations with QuantStack about notebook accessibility issues, and it sounds like they recently got some funding for that general topic.

I took a look at the link you mentioned, but I suspect where you say "wizard around" might unfortunately mean "carry out a major work effort". Our team thinks that putting alt text in block quotes is the best quick stop-gap solution we have right now. Note that, for sighted users, this looks like the attached screenshot.

Anyway, I changed "image description" to say "alt text" for more clarity. I can't think of any better solution that we have the capacity for at MAST. Let me know if you want me to delete the alt text, but my team strongly believes that imperfectly located alt text is better than no alt text.

And please let me know if you do come up with any alternatives!

adrianlucy · 2025-07-23T18:23:07Z

Also, I wonder if we can try to add this to the rendering as part of this PR, or would you prefer to do that in a follow-up

I'd be happy to do it all in one PR. I'll work on incorporating the feedback thus far soon, probably finishing sometime next week.

troyraen · 2025-07-24T07:51:03Z

The rendering actually requires some configuration that would be easier to do in a separate PR. We can handle that bit once this one gets merged. All of the above comments would be good to address but here's the list of minimum necessary before it can be added to the rendering to get published:

Loosen the dependency version requirements where possible.
Remove the TOC.
Better solution for the alt text in images. Holler if you continue to have trouble with it and we can help more to work something out.
Remove the try/except if possible. If it is needed, restrict the except to catch only the error you're expecting.

-delete TOC -loosen requirements -switch from dask to multiprocessing module -rename row to obs throughout -rename exists function to not_none -more elegant wrangling of yso_coords -admonitions instead of html -preface alt text more clearly -further refine multiprocessing, add more workers for IO bound task -move parse_polygon to code_src -change requirement to astroquery 0.4.11, but add admonition that on Fornax need to install boto3 currently -fix function docstrings -use Quantities for wavelengths

adrianlucy · 2025-12-02T23:55:28Z

Hi all - sorry for the extraordinary delay, we're a bit short-staffed here. I believe I've now addressed all comments.

In particular, to Troy's compiled list of all the items that need to be addressed before this notebook can be added to the rendering/published: all have been done except fixing the alt text, which doesn't seem feasible with the resources I have access to (though I did revise the language to make it clearer that it's alt text). See my post just now in the discussion thread for alt text, for more details: #443 (comment) Let me know if you have any ideas!

Many other comments were addressed, and a summary change log is in this commit: 5da4e9d The main highlight is the switch from dask to multiprocessing, and a little further tuning in multiprocessing.

I also wanted to bring to your attention that boto3 is not installed in the Default Astrophysics kernel. I think maybe it should be, since at least astroquery.mast needs it for retrieval of AWS cloud data. Anyway, I added an admonition tag noting that the user needs to install it, which we can remove if the situation changes.

Let me know:

if we can merge this in, and
what the next step is to get this added to the rendering.

Thank you! Hope you're all doing okay, and sorry again for the delay.

troyraen · 2025-12-11T04:46:55Z

Thanks Adrian and no worries!

@bsipocz can you take another look here? I plan to as well.

troyraen

Thanks @adrianlucy, this is an interesting use case. Thanks for putting so much work into it! Pending @bsipocz's review, I think we should go ahead and merge this PR and move forward with releasing the notebook.

I took a quick look with respect to the notebook review checklists. The major thing I want to check on is the multi-archive access (since it's a key characteristic of notebooks in this repo):

Does it include all three archives HEASARC, MAST, IRSA?

if not, is that justified?

Has each NASA archive been given the option to comment on modules for their relevant data access?

Discourse staff channel is one method of contacting people to check for this

Have you considered including any HEASARC datasets? And have you checked with IRSA about SOFIA/Herschel/Spitzer spectral cube access? (Maybe @jkrick covered that in her review?) If there are preferred improvements and/or additional datasets, they don't necessarily need to be implemented immediately but let's at least make a plan before releasing the notebook.

I left some minor suggestions in code comments below. It's mostly about clarifying function input as that's something we've gotten feedback from users about.

bsipocz

Thanks for the updates to this PR. I have a lot of comments here, some of them are blockers while others are questions or pointers for follow-ups.

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running. So ultimately more comments could possible come, but I do not block this review it that right now.

bsipocz · 2025-12-12T03:40:43Z

+### References
+
+This notebook relies on the following papers:
+- Assani et al. 2025 ([2025arXiv250402136A](https://ui.adsabs.harvard.edu/abs/2025arXiv250402136A/abstract))
+- Karska et al. 2025 ([2025A&A...697A.186K](https://ui.adsabs.harvard.edu/abs/2025A%26A...697A.186K/abstract))
+
+And the following software:
+- Astroquery; Ginsburg et al. 2019 (2019AJ….157…98G)
+- Astropy; Astropy Collaboration 2022, Astropy Collaboration 2018, Astropy Collaboration 2013 (2022ApJ…935..167A, 2018AJ….156..123A, 2013A&A…558A..33A)
+- Matplotlib; Hunter 2007 (2007CSE.....9...90H)
+- SciPy; Pauli et al. 2020 (2020NatMe..17..261V)


Myst can handle references super nicely, I would recommend using that rather than a manual list at the bottom: https://mystmd.org/guide/citations

I'm going to say that (judging by the approach taken so far by other Fornax notebooks) this is not needed for MVP, if that's okay. But I agree that's a good idea.

bsipocz · 2025-12-12T03:49:06Z

+```{code-cell} ipython3
+tmc1a_meta = Simbad.query_object('TMC1A')
+tmc1a_ra = float(tmc1a_meta['ra'])  # ICRS RA of TMC1A in degrees
+tmc1a_dec = float(tmc1a_meta['dec'])  # ICRS Dec of TMC1A in degrees
+```
+
+...and retrieve its observations from our table of JWST spectral cube observations of YSOs:
+
+```{code-cell} ipython3
+tmc1a_jwst = []
+
+# As before...
+for obs in yso_jwst_obstable:
+    polygon = parse_polygon(obs['s_region'])
+    polygon_path = Path(polygon, closed=True)
+    # We'll use matplotlib.Path.contains_point (singular) this time,
+    # instead of .contains_points (plural)
+    is_inside = polygon_path.contains_point((tmc1a_ra, tmc1a_dec))
+    if is_inside:
+        tmc1a_jwst.append(obs)


This should be possible natively with astropy and regions; please use that rather than brute force mpl.

I still disagree; please see my comment linked here (and the preceding comment above it) in our previous discussion about this, and we can continue the conversation in that thread if necessary. #443 (comment)

bsipocz · 2025-12-12T03:51:33Z

+# Convert back to a pandas dataframe
+results = pd.DataFrame(results)
+
+# Remove empty rows, where check_points_in_polygon reached the "else" part of the return statement.
+results = results.dropna(how='all')
+
+# Convert back to an astropy table.
+yso_jwst_obstable = Table.from_pandas(results)


Why convert back and forth between pandas and astropy? If you want to work with astropy Tables, then stick with those all the way.

I looked at this a while back. Unfortunately, looping through an equivalent iterable of an astropy table (the output/input of astroquery) drastically hurts multiprocessing performance relative to iterating through a pandas dataframe, in this case. But I have cleaned up the code further up to define an iterable straight away rather than bothering to define a pandas table.

bsipocz · 2025-12-23T22:18:53Z

+plt.show()
+```
+
+> Figure 1 alt text: a line plot of "Fraction of non-NaN pixels above threshold" versus "Wavelength bin index", the latter ranging from around 0 to 1000. The plotted values have substantial scatter and features, but one feature stands out by a factor of two above all others: a narrow, abrupt spike near wavelength bin index 548.


It's great to have alt-text for the figure, but I have serious concerns with its accuracy over time. Also, for actual alt-text I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.
Or, as this code cell is nothing but the plotting, use the code-cell caption?

What to do with this is more of a question to @troyraen -- I don't think we can guarantee that any of these type of alt-text would stay accurate, unless we somehow auto-generate it from the data/with matplotlib.

I don't think we can guarantee that any of these type of alt-text would stay accurate

Do we expect the figures to change if the notebook/code itself doesn't change?

Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot. Captions are usually more abstract as it doesn't really spell out details and numericals from the figure, so those should not be affected.

I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.

As discussed in an existing thread here #443 (comment), I welcome your ideas but I don't personally see how to implement a near-term solution in that direction. FWIW our team thinks that putting alt text in block quotes is the best quick stop-gap solution/universal design approach we have right now.

Or, as this code cell is nothing but the plotting, use the code-cell caption?

Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? https://quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?

Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot

Certainly, accessibility has a labor cost, but there are 508/ADA requirements at the very least. In any case, in this case I believe I have written the alt text to be flexible enough that it should not need maintenance.

But of course, I leave it up to you.

Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?

No, we're using Jupyter-book and the mystmd engine: https://mystmd.org/guide/code#code-blocks

Putting captions properly into captions and alt-text into alt-text would help with accessibility much better than leaving it to the screen readers to scrape the quoted text for alt-text.

Here is another entry point; but also would recommend following the links to the {figure} and {image} directives.
https://mystmd.org/guide/reuse-jupyter-outputs#alternative-text-for-accessibility

Oo, very interesting, thank you! I'll try that today.

@bsipocz, I feel foolish but I can't figure out how this helps include alt-text or a caption in a notebook. If you think it can do that, could you perhaps show the change to the markdown that you think would work? I can't get it to work. e.g. for something easy to test I tried this in the markdown:

```{code-cell} ipython3 :caption: TESTING CAPTIONS plt.plot(test_blobs, color='k') plt.xlabel('Wavelength bin index') plt.ylabel('Fraction of non-NaN pixels above threshold') plt.show()

"TESTING CAPTIONS" does not appear in the executed notebook, and in my reading nothing in the mystmd documentation suggests it would. But maybe I'm doing something wrong.

Perhaps you mean that the alt-text or caption would manifest in the render. But isn't it better for alt-text to be available when interacting with the notebook itself in the Fornax platform, even if imperfectly?

Thank you!

bsipocz · 2025-12-23T22:26:09Z

+    return obs
+```
+
+Phew! Let's try this out on our TMC1A observations.


Please keep the narrative professional.

Sure, revised. FWIW it might be useful to write a style guide that explains the tone you are aiming for more concretely. While the content you want removed is clear in this case, "professional" is a moving target; the MAST Notebook Repository, for our part, generally does not have the same position on conversational tone.

troyraen · 2025-12-23T23:21:30Z

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running.

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

bsipocz · 2025-12-23T23:36:24Z

Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running.

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

This made me thing that what if we save out a cell -- or do something in a dropdown? That could keep the best of both worlds.

-Biggest change is support for macOS and Windows by saving and importing the functions defined in the notebook -See PR review responses for details

adrianlucy · 2026-02-12T09:09:29Z

From @troyraen:

Have you considered including any HEASARC datasets? And have you checked with IRSA about SOFIA/Herschel/Spitzer spectral cube access? (Maybe @jkrick covered that in her review?) If there are preferred improvements and/or additional datasets, they don't necessarily need to be implemented immediately but let's at least make a plan before releasing the notebook.

I did consider this already, but I don't think there are any cubically-formatted spectral images in the HEASARC, so creating new algorithms for HEASARC would greatly overcomplicate the narrative. IRSA is discussed in the last section of the notebook; @jkrick please do let me know if you see anything wrong there. My suggestion would be that we plan to look at the possibility of expanding the IRSA section when SphereX all-sky data cubes are released.

adrianlucy · 2026-02-12T09:21:08Z

From @troyraen:

I believe that multiprocessing on OSX or Windows requires that the functions be in .py files and imported, so I would recommend moving them out of the notebook. (When functions are defined directly in the notebook, they only exist in-memory in the main process and do not get copied into child processes.)

From @bsipocz:

This made me thing that what if we save out a cell -- or do something in a dropdown? That could keep the best of both worlds.

Thanks very much for the great catch. I'm now implementing @bsipocz's "save out a cell" idea, which I like, with the %%writefile magic. That does introduce a little complexity with imports, but I think it's worth it.

Note that there's another nuance here because passing in e.g. cloud_uri_map in every iteration would be a drastic performance hit, and cloud_uri_map is generated on the fly in the notebook. What I'd done before is pass it in within a higher wrapper function, but that can't be done with an imported function.

So what I've done now is create an option to serialize such objects (like cloud_uri_map) on the fly in a pickle file, deserializing in every iteration. This turns out to be only a ~10% performance hit, which seems totally fine for this.

adrianlucy · 2026-02-12T09:28:29Z

Thank you all very much for the second round of comments, @troyraen @jkrick @bsipocz. I've done my best to incorporate them, and I think the notebook is much improved from where we started.

I'm hopeful that, wherever at all possible, we can consider the current version to be good enough for an initial merge and rendering, while identifying priorities for future improvement. Leadership here would quite like to get this added to the rendering.

From @troyraen's initial summary of MVP to be added to the rendering, everything is implemented except finding an alternative solution for alt text, which I don't currently see a way forward for on my end. (See two threads above for discussion on that topic: #443 (comment), #443 (comment))

Let me know what you think. Thank you!

troyraen · 2026-02-13T04:08:32Z

Thanks for your patience and persistence @adrianlucy. This is a lot of work! I haven't tracked all the recent conversations yet, but after a quick look, I see that you've certainly been responsive to feedback. And also have two approving reviews. I'm headed out for a long weekend but will follow up next week.

%%writefile magic hadn't been on my radar - looking forward to seeing if it's a solution for problems with a couple notebooks I wrote for IRSA!

troyraen · 2026-02-25T00:23:59Z

I think we're ready to merge this. After it's merged, our team will work on some integration and deployment tasks. Things I'm aware of that are likely to be included are:

Merge the files into the existing spectroscopy directory.
Reframe the section currently called "Postscript" to be an introduction to spectral cube products from other NASA missions that highlights the similarities, with less focus on how hard it would be to do exactly the same thing as being done for JWST.
Configure the notebook to be tested by CI. We may need to limit the number of sources in the input sample in order to do this. We would leave all the code to get the full sample but then proceed to operate on a subset by default. That would also help users by providing a notebook that runs by default on modest compute resources in a shorter amount of time while still giving interested users the option to run it at full scale. This is how we've handled this for the other notebooks in this repo, which have the same tension between wanting a scientifically robust sample and needing a notebook that doesn't require a ton of compute power and/or time by default.
Add it to the website and Fornax deployments.

After it's released, we'll reach out to you or MAST if/when we become aware of something substantive that needs to be updated in the notebook text or code. You're also welcome to make other updates on an ongoing basis.

Do you have any comments about the above, or anything else you want to add to this PR before merging, @adrianlucy?

adrianlucy · 2026-02-25T03:05:08Z

@troyraen That all sounds good to me—thank you!

Don't hesitate to let me know if you need anything from me during this process. I'm out Wedn and Thurs this week, but back Friday. (PS, if you happen to have an approximate guess on when it will be added to the website rendering, there are folks asking me for updates. No worries if not!)

Many thanks for the work that you all have and will put into this, and for your extremely useful feedback throughout this review.

Add analytical data search notebook 14c6278

Add data search notebook

3f02564

troyraen requested review from bsipocz and jkrick July 11, 2025 16:44

jkrick approved these changes Jul 15, 2025

View reviewed changes

Add runtime section header

f17b8da

Co-authored-by: jkrick <jkrick@caltech.edu>

jkrick reviewed Jul 16, 2025

View reviewed changes

bsipocz reviewed Jul 23, 2025

View reviewed changes

adrianlucy and others added 4 commits July 29, 2025 14:27

Change directory and file names

a2a3cd8

Adjust note about needing dev astroquery

672f282

Remove slideshow metadata

a06baff

adrianlucy added 5 commits December 2, 2025 19:29

Fix typo

43292d1

Fix requirements typo leftover from testing

841933d

Move runtime section under introduction

679a8d0

Remove out-of-date timing estimate

0461a55

Remove dask citation from references

e26ed20

troyraen requested review from bsipocz and troyraen December 11, 2025 04:47

troyraen approved these changes Dec 23, 2025

View reviewed changes

bsipocz reviewed Dec 23, 2025

View reviewed changes

Incorporate second-round feedback on JWST emission line search

5b03033

-Biggest change is support for macOS and Windows by saving and importing the functions defined in the notebook -See PR review responses for details

troyraen merged commit 14c6278 into nasa-fornax:main Feb 26, 2026
0 of 2 checks passed

github-actions Bot pushed a commit that referenced this pull request Feb 26, 2026

Merge pull request #443 from adrianlucy/beyond_metadata

466ae95

Add analytical data search notebook 14c6278

troyraen mentioned this pull request Feb 26, 2026

Integrate and deploy the spectral cubes notebook #650

Open


		+++

		For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes.

Conversation

adrianlucy commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

troyraen commented Jul 11, 2025

Uh oh!

jkrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianlucy Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianlucy Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianlucy commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkrick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bsipocz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianlucy commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

troyraen commented Jul 24, 2025

Uh oh!

adrianlucy commented Dec 2, 2025

Uh oh!

troyraen commented Dec 11, 2025

Uh oh!

troyraen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adrianlucy commented Jul 10, 2025 •

edited

Loading

adrianlucy Jul 16, 2025 •

edited

Loading

adrianlucy Jul 21, 2025 •

edited

Loading

adrianlucy commented Jul 16, 2025 •

edited

Loading

adrianlucy commented Jul 23, 2025 •

edited

Loading

adrianlucy Feb 12, 2026 •

edited

Loading

adrianlucy Feb 13, 2026 •

edited

Loading