Add analytical data search notebook#443
Conversation
|
Thanks @adrianlucy! A couple folks will take a look, likely next week. We usually do a science review and a technical review. The checklists are posted here https://github.com/nasa-fornax/fornax-documentation/blob/main/notebook_review_process.md, in case you haven't seen them yet. Notebooks go through a two step process: first step is into the repo and the second gets it into the published tutorials here https://nasa-fornax.github.io/fornax-demo-notebooks/. This PR is the first step. The notebook doesn't have to meet all of the checklist items before merging, but that will give you an idea of what the reviewers will look for. Items not covered in this PR can be handled before or during step two. |
jkrick
left a comment
There was a problem hiding this comment.
This notebook looks great and is very impressive! Thank you for writing it. I have only a few, very minor, comments, with no major changes requested.
I would suggest a different title, as I had no idea what to expect from the notebook when I first saw its title. Maybe something like: "detect_emission_spectral_cubes" or "detect_FE_YSOs" or "search_YSO_cubes" or "FE_emission_YSOs" or ??
|
|
||
| +++ | ||
|
|
||
| For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes. |
There was a problem hiding this comment.
A suggestion to use the Medium server as a benchmark, only because resources are limited on Fornax and it will run to completion on the medium server ( I just did it, but didn't time it).
There was a problem hiding this comment.
I get a total 33 min on Medium, vs. 10 min on Large.
I definitely appreciate Fornax's concerns about overtaxing your servers, and I'm not very familiar with what expected resource utilization looks like on your end once you leave beta. But I do have a couple concerns about this proposed change that I'd appreciate your thoughts on:
-
As a user, it feels a lot more interactive and conducive to trial-and-error/scientist workflows to have a single cell take 2 minutes (on Large) vs. 13 minutes (on Medium).
-
It's a lot harder to sell a user on the utility of parallelizing when using
daskonly gets you a ~1.5x speed-up (on Medium), as opposed to a ~4x speed-up (on Large).
The performance I'm seeing on that second point is interesting. For the results below, I'm looking at:
- Without dask: L276-292
- With Dask: Take L375, but for a fair comparison remove gnomonic projection from the
check_points_in_polygonfunction.*
And I get:
- Large server, without Dask: 9.4 minutes, <10% CPU usage
- Large server, with Dask: 2.4 minutes, 70--80% CPU usage
- Medium server, without Dask: 12 minutes, 30--40% CPU usage
- Medium server, with Dask: 7.5 minutes, 80% CPU usage
So, a lot going on here. Maybe there's some overhead imposed by using dask that makes using it on 4 cores in this case barely worth it? Also, seems like maybe we're using more than one core even without using dask, at least on Medium.
In any case, my main concern is that the improvement from parallelizing is much more salient and instructive to the user on the Large server (4x) than on Medium (1.5x).
*check_points_in_polygon() without gnomonic projection, used for the "with Dask" results above:
def check_points_in_polygon(row, coordinates_to_test):
polygon = parse_polygon(row[‘s_region’])
polygon_path = Path(polygon, closed=True)
if polygon_path.contains_points(coordinates_to_test).any():
return row
else:
return pd.Series([np.nan] * len(row), index=row.index)
There was a problem hiding this comment.
probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.
There was a problem hiding this comment.
Good idea!
(Notes to self so I don't forget, no need to read: 1. If I show that table, or something like it in prose, maybe do with gnomonic projection instead, for all combinations. 2. Check for instances throughout notebook where I talk about how long a cell will take, and maybe include estimates for both medium and large. Similarly, if I talk about speed-up from parallelizing, mention that this is dependent on the number of available cores. Also fix the place where I say "next cell" referring to like 3 cells downwards.)
There was a problem hiding this comment.
You might want to consider using python's builtin multiprocessing for the parallelization instead of dask. After a quick look at your code, I'm guessing the calls with dask are using the same number of workers as there are CPU, so 4 on Medium server and 16 on Large. While the 4x speedup on Large is much better than the 1.5x on Medium, both are poor relative to the number of workers (and CPU, and probably memory as well) being used. If the parallelization were going well, I would expect a 16x speedup from 16 workers. As you guessed above, dask comes with a lot of overhead. The way the code is written also affects the efficiency of the parallelization but, while I didn't look at your code in enough detail to comment on it, I've had enough experience with dask that is similar to what you're reporting here that I suspect dask just isn't doing a very good job running it efficiently. Perhaps dask can be made to do better, but unfortunately I haven't seen much documentation instructing users about how to do that (though I haven't checked in quite awhile). Also, I'm hearing reports of significant issues with dask development right now with things in broken states and few developers available to fix it. So multiprocessing seems the better option all around. But that being said, I would certainly understand if you don't want to put the effort into switching your code at this point.
There was a problem hiding this comment.
@troyraen Interesting, thank you! Yeah, dask wasn't a deeply considered choice on my part, I was just following guidance that it's the strategic direction we wanted to focus on. But we shouldn't recommend it in cases where it's the wrong choice.
Indeed, I am seeing much better results with multiprocessing (with num_workers = num_cores). Relative to no explicit parallelization at all, now a 10x speed-up on Large (was 4x with dask), 3x speed-up on Medium (was 1.5x with dask). CPU usage is now ~100% (was ~70-80% with dask).
The choice between dask and multiprocessing may also have to do with the function being parallelized being very brief, in this case ~30--90 milliseconds per usage.
The speed up is still short of [number of cores]x. The reason may be that my code needs improvement, but my best guess is that it's at least partially because some python libraries are implicitly managing to make use of multiple cores even without explicit parallelization. On medium, CPU usage without explicit parallelization was 30--40% according to the kernel usage tab in Fornax, so a 3x speed-up from parallelizing seems about right. On large, CPU usage without explicit parallelization was 7%, so our 10x speed-up is a bit short of the 14x speed-up I might expect—but in this case 14x vs. 10x is a difference of 15 seconds so I could see start-up/wrap-up overhead being an issue. No improvement from increasing chunksize.
Anyway, I think I will indeed switch this to multiprocessing. And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows may yield better performance on either or both of dask and/or multiprocessing.
There was a problem hiding this comment.
Oh good, ya those are obviously better speedups. And I should have added a few more words to this sentence since the overhead may not be completely surmountable... "If the parallelization were going well, I would expect closer to a 16x speedup from 16 workers." There are probably things you can tweak to get more efficiency, but also fine to leave as is.
I should also mention, multiprocessing is not without issues. One in particular is that the behavior is a bit different on Linux, Mac, and Windows and it can be tricky to write code that works out of the box on all of them. In this repo we assume Linux, since that's what Fornax Console runs, and then include some text with info/instructions for Mac and Windows if we can. We've been hoping that dask would eventually be a better solution but so far haven't had enough success to fully adopt it.
And I guess I'll try to see if changing the parallelized function to apply to batches of rows instead of individual rows
One simple way to do this is with the chunksize argument. The default is usually 1 which means that a new task is generated for every item in the input iterable, whereas a chunksize of (eg) 100 only generates a new task for every 100 items and so reduces the overhead. My rule of thumb is to generate at least 2-3 chunks per worker using something like chunksize = min(100, len(my_iterable) // (3 * nworkers)). Making chunks too big can result in some workers finishing a lot earlier than others and then just sitting around idle because all tasks have already been assigned out.
Another thing you can look into is the different functions that are available and what they're recommended for. For example, if the iterable is very long and the order in which results are returned is not important, Pool.imap_unordered() is recommended. But again, the 3x and 10x speedups are already reasonably good.
There was a problem hiding this comment.
Switched from dask to multiprocessing. One interesting nuance is that a task with an I/O bound component (opening FITS files) benefits from having more workers than cores, so that a core always has something to do.
There was a problem hiding this comment.
Runtimes for both medium and large server now included, removed advice to use large server.
There was a problem hiding this comment.
(Also I did try Pool.imap_unordered, changing chunksize, etc., but didn't find anything that improved performance except as noted above.)
Co-authored-by: jkrick <jkrick@caltech.edu>
Thank you @jkrick! Just to check, when you ask for a different title, are you only referring to the filename, or do you also suggest changing the in-notebook title ("Beyond metadata: analytical data search in the cloud with JWST spectral cubes")? Also, do you want the directory holding the notebook to be named the same as the file? Currently, the directory name is even more generic - I guess I was imagining that maybe you'd want to group this notebook with ones sharing similar goals in the future, or something like that. |
jkrick
left a comment
There was a problem hiding this comment.
I can see the rationale for the folder name being either data_search or something more specific. You can choose and then if we feel it doesn't fit later, we can always change it. And I think the in-notebook title is also fine. It really was a comment about the filename itself. Other people might have stronger opinions on this organization question.
|
|
||
| +++ | ||
|
|
||
| For this notebook, we recommend using the [Fornax science platform](https://nasa-fornax.github.io/fornax-demo-notebooks/documentation/README.html) with Server Type = "Large - 64GB RAM/16 CPU" and Environment = "Default Astrophysics". As of June 2025, running this notebook once on that server will take about 10 minutes. |
There was a problem hiding this comment.
probably, given all of this valuable information, we should just say both server sizes in the runtime section with their respective times, or maybe that mini-table with the 4 lines and their runtimes + CPU usages. Likely people aren't going to be reading the runtime section to figure out how to optimize their codes and weather or not to use dask, but it is valuable info to keep around.
bsipocz
left a comment
There was a problem hiding this comment.
first round (dry-run -- I haven't run the notebook) reviews. The biggest issue is with the very right requirements.
And the manual TOC will certainly be an issue as this further evolves.
Also, I wonder if we can try to add this to the rendering as part of this PR, or would you prefer to do that in a follow-up (e.g. TOC cleanup can be postponed to that one).
| plt.show() | ||
| ``` | ||
|
|
||
| > Figure 4 image description: three astronomical images in pixel coordinates, side by side, display three different wavebin slices of a spectral cube. The left and right images are labeled "-10 wavebins away" and "+10 wavebins away", respectively, and feature a bright central source with faint hints of adjacent extention to the lower right. The middle image is labeled "5.34 micron line". In the middle image, the central source is much dimmer, emission extends all the way to the lower-right corner, and a blob in the lower-right corner is the brightest feature, in a region where the other two images exhibit no emission. |
There was a problem hiding this comment.
Are these meant to be ALT-texts? If yes, then we should be able to work it properly into the figure. If it's for all users, then I would suggest to work it more into the narrative text, this notebook has very little of that.
There was a problem hiding this comment.
Yes, they are intended to be a best effort at alt text. My understanding was that matplotlib unfortunately doesn't have alt text support, and Jupyter in general is not well designed for accessibility. Very open to ideas!
There was a problem hiding this comment.
I know that there has been some recent work on it in the wider ecosystem, and also asked around.
What's certain atm, is that myst supports alt text for figures, but you may need to wizard around to have it added to the mpl generated figure.
https://myst-parser.readthedocs.io/en/latest/syntax/images_and_figures.html
There was a problem hiding this comment.
This is something that MAST is very interested in as well---it's really critical with so much user support moving to notebooks on science platforms---but we haven't made much progress on it. We've had conversations with QuantStack about notebook accessibility issues, and it sounds like they recently got some funding for that general topic.
I took a look at the link you mentioned, but I suspect where you say "wizard around" might unfortunately mean "carry out a major work effort". Our team thinks that putting alt text in block quotes is the best quick stop-gap solution we have right now. Note that, for sighted users, this looks like the attached screenshot.

Anyway, I changed "image description" to say "alt text" for more clarity. I can't think of any better solution that we have the capacity for at MAST. Let me know if you want me to delete the alt text, but my team strongly believes that imperfectly located alt text is better than no alt text.
And please let me know if you do come up with any alternatives!
I'd be happy to do it all in one PR. I'll work on incorporating the feedback thus far soon, probably finishing sometime next week. |
|
The rendering actually requires some configuration that would be easier to do in a separate PR. We can handle that bit once this one gets merged. All of the above comments would be good to address but here's the list of minimum necessary before it can be added to the rendering to get published:
|
-delete TOC -loosen requirements -switch from dask to multiprocessing module -rename row to obs throughout -rename exists function to not_none -more elegant wrangling of yso_coords -admonitions instead of html -preface alt text more clearly -further refine multiprocessing, add more workers for IO bound task -move parse_polygon to code_src -change requirement to astroquery 0.4.11, but add admonition that on Fornax need to install boto3 currently -fix function docstrings -use Quantities for wavelengths
|
Hi all - sorry for the extraordinary delay, we're a bit short-staffed here. I believe I've now addressed all comments. In particular, to Troy's compiled list of all the items that need to be addressed before this notebook can be added to the rendering/published: all have been done except fixing the alt text, which doesn't seem feasible with the resources I have access to (though I did revise the language to make it clearer that it's alt text). See my post just now in the discussion thread for alt text, for more details: #443 (comment) Let me know if you have any ideas! Many other comments were addressed, and a summary change log is in this commit: 5da4e9d The main highlight is the switch from I also wanted to bring to your attention that Let me know:
Thank you! Hope you're all doing okay, and sorry again for the delay. |
|
Thanks Adrian and no worries! @bsipocz can you take another look here? I plan to as well. |
troyraen
left a comment
There was a problem hiding this comment.
Thanks @adrianlucy, this is an interesting use case. Thanks for putting so much work into it! Pending @bsipocz's review, I think we should go ahead and merge this PR and move forward with releasing the notebook.
I took a quick look with respect to the notebook review checklists. The major thing I want to check on is the multi-archive access (since it's a key characteristic of notebooks in this repo):
- Does it include all three archives HEASARC, MAST, IRSA?
- if not, is that justified?
- Has each NASA archive been given the option to comment on modules for their relevant data access?
- Discourse staff channel is one method of contacting people to check for this
Have you considered including any HEASARC datasets? And have you checked with IRSA about SOFIA/Herschel/Spitzer spectral cube access? (Maybe @jkrick covered that in her review?) If there are preferred improvements and/or additional datasets, they don't necessarily need to be implemented immediately but let's at least make a plan before releasing the notebook.
I left some minor suggestions in code comments below. It's mostly about clarifying function input as that's something we've gotten feedback from users about.
bsipocz
left a comment
There was a problem hiding this comment.
Thanks for the updates to this PR. I have a lot of comments here, some of them are blockers while others are questions or pointers for follow-ups.
Also, please note that the review was done on the code without running the notebook, execution of it is currently stuck on L332 with some traceback coming from the cell, but it's still running. So ultimately more comments could possible come, but I do not block this review it that right now.
| ### References | ||
|
|
||
| This notebook relies on the following papers: | ||
| - Assani et al. 2025 ([2025arXiv250402136A](https://ui.adsabs.harvard.edu/abs/2025arXiv250402136A/abstract)) | ||
| - Karska et al. 2025 ([2025A&A...697A.186K](https://ui.adsabs.harvard.edu/abs/2025A%26A...697A.186K/abstract)) | ||
|
|
||
| And the following software: | ||
| - Astroquery; Ginsburg et al. 2019 (2019AJ….157…98G) | ||
| - Astropy; Astropy Collaboration 2022, Astropy Collaboration 2018, Astropy Collaboration 2013 (2022ApJ…935..167A, 2018AJ….156..123A, 2013A&A…558A..33A) | ||
| - Matplotlib; Hunter 2007 (2007CSE.....9...90H) | ||
| - SciPy; Pauli et al. 2020 (2020NatMe..17..261V) |
There was a problem hiding this comment.
Myst can handle references super nicely, I would recommend using that rather than a manual list at the bottom: https://mystmd.org/guide/citations
There was a problem hiding this comment.
I'm going to say that (judging by the approach taken so far by other Fornax notebooks) this is not needed for MVP, if that's okay. But I agree that's a good idea.
| ```{code-cell} ipython3 | ||
| tmc1a_meta = Simbad.query_object('TMC1A') | ||
| tmc1a_ra = float(tmc1a_meta['ra']) # ICRS RA of TMC1A in degrees | ||
| tmc1a_dec = float(tmc1a_meta['dec']) # ICRS Dec of TMC1A in degrees | ||
| ``` | ||
|
|
||
| ...and retrieve its observations from our table of JWST spectral cube observations of YSOs: | ||
|
|
||
| ```{code-cell} ipython3 | ||
| tmc1a_jwst = [] | ||
|
|
||
| # As before... | ||
| for obs in yso_jwst_obstable: | ||
| polygon = parse_polygon(obs['s_region']) | ||
| polygon_path = Path(polygon, closed=True) | ||
| # We'll use matplotlib.Path.contains_point (singular) this time, | ||
| # instead of .contains_points (plural) | ||
| is_inside = polygon_path.contains_point((tmc1a_ra, tmc1a_dec)) | ||
| if is_inside: | ||
| tmc1a_jwst.append(obs) |
There was a problem hiding this comment.
This should be possible natively with astropy and regions; please use that rather than brute force mpl.
There was a problem hiding this comment.
I still disagree; please see my comment linked here (and the preceding comment above it) in our previous discussion about this, and we can continue the conversation in that thread if necessary. #443 (comment)
| # Convert back to a pandas dataframe | ||
| results = pd.DataFrame(results) | ||
|
|
||
| # Remove empty rows, where check_points_in_polygon reached the "else" part of the return statement. | ||
| results = results.dropna(how='all') | ||
|
|
||
| # Convert back to an astropy table. | ||
| yso_jwst_obstable = Table.from_pandas(results) |
There was a problem hiding this comment.
Why convert back and forth between pandas and astropy? If you want to work with astropy Tables, then stick with those all the way.
There was a problem hiding this comment.
I looked at this a while back. Unfortunately, looping through an equivalent iterable of an astropy table (the output/input of astroquery) drastically hurts multiprocessing performance relative to iterating through a pandas dataframe, in this case. But I have cleaned up the code further up to define an iterable straight away rather than bothering to define a pandas table.
| plt.show() | ||
| ``` | ||
|
|
||
| > Figure 1 alt text: a line plot of "Fraction of non-NaN pixels above threshold" versus "Wavelength bin index", the latter ranging from around 0 to 1000. The plotted values have substantial scatter and features, but one feature stands out by a factor of two above all others: a narrow, abrupt spike near wavelength bin index 548. |
There was a problem hiding this comment.
It's great to have alt-text for the figure, but I have serious concerns with its accuracy over time. Also, for actual alt-text I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.
Or, as this code cell is nothing but the plotting, use the code-cell caption?
What to do with this is more of a question to @troyraen -- I don't think we can guarantee that any of these type of alt-text would stay accurate, unless we somehow auto-generate it from the data/with matplotlib.
There was a problem hiding this comment.
I don't think we can guarantee that any of these type of alt-text would stay accurate
Do we expect the figures to change if the notebook/code itself doesn't change?
There was a problem hiding this comment.
Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot. Captions are usually more abstract as it doesn't really spell out details and numericals from the figure, so those should not be affected.
There was a problem hiding this comment.
I strongly suggest to use some mystmd native approach for embedding the output into a figure directive.
As discussed in an existing thread here #443 (comment), I welcome your ideas but I don't personally see how to implement a near-term solution in that direction. FWIW our team thinks that putting alt text in block quotes is the best quick stop-gap solution/universal design approach we have right now.
Or, as this code cell is nothing but the plotting, use the code-cell caption?
Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? https://quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?
Sure, we don't expect that but it may still change if something goes wrong unnoticed -- thus the alt-text can bit rot
Certainly, accessibility has a labor cost, but there are 508/ADA requirements at the very least. In any case, in this case I believe I have written the alt text to be flexible enough that it should not need maintenance.
But of course, I leave it up to you.
There was a problem hiding this comment.
Interesting, but I can't quite figure out what you're referring to. Are you using Quarto? quarto.org/docs/reference/cells/cells-jupyter.html Or are you talking about something else?
No, we're using Jupyter-book and the mystmd engine: https://mystmd.org/guide/code#code-blocks
Putting captions properly into captions and alt-text into alt-text would help with accessibility much better than leaving it to the screen readers to scrape the quoted text for alt-text.
Here is another entry point; but also would recommend following the links to the {figure} and {image} directives.
https://mystmd.org/guide/reuse-jupyter-outputs#alternative-text-for-accessibility
There was a problem hiding this comment.
Oo, very interesting, thank you! I'll try that today.
There was a problem hiding this comment.
@bsipocz, I feel foolish but I can't figure out how this helps include alt-text or a caption in a notebook. If you think it can do that, could you perhaps show the change to the markdown that you think would work? I can't get it to work. e.g. for something easy to test I tried this in the markdown:
```{code-cell} ipython3
:caption: TESTING CAPTIONS
plt.plot(test_blobs, color='k')
plt.xlabel('Wavelength bin index')
plt.ylabel('Fraction of non-NaN pixels above threshold')
plt.show()
"TESTING CAPTIONS" does not appear in the executed notebook, and in my reading nothing in the mystmd documentation suggests it would. But maybe I'm doing something wrong.
Perhaps you mean that the alt-text or caption would manifest in the render. But isn't it better for alt-text to be available when interacting with the notebook itself in the Fornax platform, even if imperfectly?
Thank you!
| return obs | ||
| ``` | ||
|
|
||
| Phew! Let's try this out on our TMC1A observations. |
There was a problem hiding this comment.
Please keep the narrative professional.
There was a problem hiding this comment.
Sure, revised. FWIW it might be useful to write a style guide that explains the tone you are aiming for more concretely. While the content you want removed is clear in this case, "professional" is a moving target; the MAST Notebook Repository, for our part, generally does not have the same position on conversational tone.
I believe that |
This made me thing that what if we save out a cell -- or do something in a dropdown? That could keep the best of both worlds. |
-Biggest change is support for macOS and Windows by saving and importing the functions defined in the notebook -See PR review responses for details
|
From @troyraen:
I did consider this already, but I don't think there are any cubically-formatted spectral images in the HEASARC, so creating new algorithms for HEASARC would greatly overcomplicate the narrative. IRSA is discussed in the last section of the notebook; @jkrick please do let me know if you see anything wrong there. My suggestion would be that we plan to look at the possibility of expanding the IRSA section when SphereX all-sky data cubes are released. |
|
From @troyraen:
From @bsipocz:
Thanks very much for the great catch. I'm now implementing @bsipocz's "save out a cell" idea, which I like, with the Note that there's another nuance here because passing in e.g. So what I've done now is create an option to serialize such objects (like |
|
Thank you all very much for the second round of comments, @troyraen @jkrick @bsipocz. I've done my best to incorporate them, and I think the notebook is much improved from where we started. I'm hopeful that, wherever at all possible, we can consider the current version to be good enough for an initial merge and rendering, while identifying priorities for future improvement. Leadership here would quite like to get this added to the rendering. From @troyraen's initial summary of MVP to be added to the rendering, everything is implemented except finding an alternative solution for alt text, which I don't currently see a way forward for on my end. (See two threads above for discussion on that topic: #443 (comment), #443 (comment)) Let me know what you think. Thank you! |
|
Thanks for your patience and persistence @adrianlucy. This is a lot of work! I haven't tracked all the recent conversations yet, but after a quick look, I see that you've certainly been responsive to feedback. And also have two approving reviews. I'm headed out for a long weekend but will follow up next week. %%writefile magic hadn't been on my radar - looking forward to seeing if it's a solution for problems with a couple notebooks I wrote for IRSA! |
|
I think we're ready to merge this. After it's merged, our team will work on some integration and deployment tasks. Things I'm aware of that are likely to be included are:
After it's released, we'll reach out to you or MAST if/when we become aware of something substantive that needs to be updated in the notebook text or code. You're also welcome to make other updates on an ongoing basis. Do you have any comments about the above, or anything else you want to add to this PR before merging, @adrianlucy? |
|
@troyraen That all sounds good to me—thank you! Don't hesitate to let me know if you need anything from me during this process. I'm out Wedn and Thurs this week, but back Friday. (PS, if you happen to have an approximate guess on when it will be added to the website rendering, there are folks asking me for updates. No worries if not!) Many thanks for the work that you all have and will put into this, and for your extremely useful feedback throughout this review. |
Add analytical data search notebook 14c6278
This notebook from MAST for Fornax, "Beyond metadata: analytical data search in the cloud with JWST spectral cubes", answers a data-intensive archival search query ("I want to find JWST spectral image cubes containing Fe II emission from jets launched by young stellar objects"), with broader pedagogical goals related to large-scale target coordinate searches beyond the capabilities of archive services, parallelizing with dask, algorithmically searching for select data characteristics, and loading data from cloud URIs into memory.
The focus is on JWST spectral cubes from MAST, but I hope that the strategies taught will be of more general use for other missions and archives. I also touch briefly on SOFIA data from IRSA at the end, and hope to maybe switch that to SPHEREx in a future PR when more SPHEREx data is available.
The notebook has been reviewed internally by several staff internal to MAST, their feedback has been incorporated, and we are ready for review by the Fornax team!
A couple practical notes: