Skip to content

Pyloos subset#132

Open
lgsmith wants to merge 6 commits into
GrossfieldLab:mainfrom
lgsmith:pyloos-subset
Open

Pyloos subset#132
lgsmith wants to merge 6 commits into
GrossfieldLab:mainfrom
lgsmith:pyloos-subset

Conversation

@lgsmith
Copy link
Copy Markdown
Contributor

@lgsmith lgsmith commented May 14, 2026


name: Pyloos Subset
about: adding an optional subset functionality that deploys update group coords on the subset kwarg AtomicGroup.
title: 'Pyloos Subset'
labels: 'feature'
assignees: ''

Changes proposed in this pull request

  • Within the PyLOOS.Trajectory object, set full_update=False to set the self._target atomic group to the self._subset atomic group already defined. Otherwise _model is bound to target.
  • updateGroupCoords is always applied to self._target
  • Emit a warning about how this will leave whatever group you used for 'model' to instantiate the traj as stale.
  • define a method refreshModel() that calls updateGroupCoords on the self._model AG instead.
    • This allows users to do a 'two step' analysis, if they want.

Comment

This was done to enhance performance on reading subsets of trajectories where the difference in atom counts beween the full frame and the subset is really significant. For example, when analyzing a protein in a large water box, the fraction of C$\alpha$s to all atoms might be 0.2%. At that kind of ratio, I was seeing a 4.4x speedup from using this subsetting for just reading through the coordinates I wanted.

@agrossfield
Copy link
Copy Markdown
Member

agrossfield commented May 14, 2026

@tromo : I thought we had a version of this implemented in the C++, but poking around in a couple of the implementations, it doesn't look like it. The closest I could find is DCD::mappedCoords, but that's not called anywhere and theres a comment elsewhere saying it's slated for removal. Am I crazy?

In any event, I have some misgivings about this feature, because the main AtomicGroup (and thus any other AtomicGroups created by selecting from it) will be stale, which could lead to surprising outcomes, since in general we make a big deal about not having to worry about it. However, @lgsmith put LOTS of warnings in there, so it's probably ok.

My bigger concern is with the call to updateGroupCoords. I think in general that just does a straight for-loop over atoms, so it's not clear to me how that will work if the subset selection isn't a contiguous block at the beginning of the full model. @lgsmith: have you tested this with a selection for eg alpha carbons? I'd need to look more carefully, but I suspect you won't get the same thing.

For a test, I'd suggest opening 2 copies of the same trajectory, one using subset update, one using full update, then comparing the coordinates for the subset.

Moreover, you'll need to test it with a couple of different trajectory formats, because they each have their own implementation of updateGroupCoords.

@lgsmith
Copy link
Copy Markdown
Contributor Author

lgsmith commented May 14, 2026 via email

@lgsmith
Copy link
Copy Markdown
Contributor Author

lgsmith commented May 15, 2026

Actually hang on a minute. The warnings are irritating, so I'm going to give the user a flag to suppress them that will be more typing once but won't fill stderr with nonsense about your potentially foolhardy selection choices if you read many trajs in one script.

@agrossfield
Copy link
Copy Markdown
Member

Did you check with multiple trajectory file formats? From offline conversation, I think you tried it with DCD.

Maybe mdtraj's hdf5 format? or Amber NETCDF? They don't share an implementation of updateGroupCoords(), so I want to make sure this works everywhere before we merge.

@agrossfield
Copy link
Copy Markdown
Member

Actually hang on a minute. The warnings are irritating, so I'm going to give the user a flag to suppress them that will be more typing once but won't fill stderr with nonsense about your potentially foolhardy selection choices if you read many trajs in one script.

I don't think it's actually necessary have the warnings get printed -- the point is to warn developers, not end users. Maybe a giant box comment above the implementation? (I recognize you've already marked it as dangerous in the options)

In addition, it would be wonderful if you prepped a howto for the wiki, explaining the use case and benefits as well as the risks. (that's not a barrier to merging, just something that would make me happy)

@lgsmith
Copy link
Copy Markdown
Contributor Author

lgsmith commented May 15, 2026

OK, this should be working the way that I want it to now. Here's the test I did to check for the coordinate match:

import time
import loos
import loos.pyloos
import random
import numpy as np

PDB = '/home/lsmith/loostests/1ezg-sol.pdb'
DCD = '/home/lsmith/loostests/1ezg-sol.dcd'

REPEATS = 5


def time_iter(update_full, modelfn, trajfn):
    model = loos.createSystem(modelfn)
    traj = loos.pyloos.Trajectory(
        trajfn, model, subset='name == "CA"', update_full=update_full, suppress_update_full_warning=True
    )
    t0 = time.perf_counter()
    n = 0
    for frame in traj:
        n += len(frame)
    return time.perf_counter() - t0, n


# warm OS cache
print("warming OS cache...")
time_iter(True, PDB, DCD)

print(f"\n{'mode':<18} {'time (s)':>10}  {'atoms touched':>14}")
print('-' * 46)
for mode_name, update_full in (("update_full=True", True),
                               ("update_full=False", False)):
    times = []
    for _ in range(REPEATS):
        dt, n = time_iter(update_full, PDB, DCD)
        times.append(dt)
    best = min(times)
    mean = sum(times) / len(times)
    print(f"{mode_name:<18} {best:>10.3f}  {n:>14d}    (mean over {REPEATS}: {mean:.3f}s)")


# sanity: same subset coords either way
def grab(update_full, modelfn, trajfn, frame_ix):
    model = loos.createSystem(modelfn)
    t = loos.pyloos.Trajectory(trajfn, model, subset='name == "CA"',
                               update_full=update_full, suppress_update_full_warning=True)
    frame = t[frame_ix]
    return frame.getCoords()
    # return [(frame[i].coords().x(), frame[i].coords().y(), frame[i].coords().z())
    #        for i in range(len(frame))]

model = loos.createSystem(PDB)
traj = loos.pyloos.Trajectory(DCD, model)
nframes = len(traj)
test_frame_ix = random.randint(0, nframes - 1)
print(f"\nsanity check: subset coords match between modes for a randomly chosen frame: {test_frame_ix}?")
coords_a = grab(True, PDB, DCD, test_frame_ix)
coords_b = grab(False, PDB, DCD, test_frame_ix)
print("  match:", np.array_equal(coords_a, coords_b), f"(n={len(coords_a)})")

@agrossfield
Copy link
Copy Markdown
Member

The test looks good, but I want to see it for other trajectory file formats as well.

@lgsmith
Copy link
Copy Markdown
Contributor Author

lgsmith commented May 15, 2026

Here's an updated benchmark:

import time
import loos
import loos.pyloos
import random
import numpy as np

PDB = '/home/lsmith/loostests/1ezg-sol.pdb'
TRAJFN = '/home/lsmith/loostests/1ezg-sol.xtc'

REPEATS = 10


def time_iter(update_full, modelfn, trajfn):
    model = loos.createSystem(modelfn)
    traj = loos.pyloos.Trajectory(
        trajfn, model, subset='name == "CA"', update_full=update_full, suppress_update_full_warning=True
    )
    t0 = time.perf_counter()
    n = 0
    for frame in traj:
        n += len(frame)
    return time.perf_counter() - t0, n


# warm OS cache
print("warming OS cache...")
time_iter(True, PDB, TRAJFN)

print(f"\n{'mode':<18} {'time (s)':>10}  {'atoms touched':>14}")
print('-' * 46)
model_means = {}
for mode_name, update_full in (("update_full=True", True),
                               ("update_full=False", False)):
    times = []
    for _ in range(REPEATS):
        dt, n = time_iter(update_full, PDB, TRAJFN)
        times.append(dt)
    best = min(times)
    mean = sum(times) / len(times)
    model_means[mode_name] = (best, mean)
    print(f"{mode_name:<18} {best:>10.3f}  {n:>14d}    (mean over {REPEATS}: {mean:.3f}s)")

ratio_best = model_means['update_full=True'][0]/model_means['update_full=False'][0]
ratio_means = model_means['update_full=True'][1]/model_means['update_full=False'][1]
print(f'(update_full=True)/(update_full=False): best = {ratio_best}, mean = {ratio_means}')

# sanity: same subset coords either way
def grab(update_full, modelfn, trajfn, frame_ix):
    model = loos.createSystem(modelfn)
    t = loos.pyloos.Trajectory(trajfn, model, subset='name == "CA"',
                               update_full=update_full, suppress_update_full_warning=True)
    frame = t[frame_ix]
    return frame.getCoords()
    # return [(frame[i].coords().x(), frame[i].coords().y(), frame[i].coords().z())
    #        for i in range(len(frame))]

model = loos.createSystem(PDB)
traj = loos.pyloos.Trajectory(TRAJFN, model)
nframes = len(traj)
test_frame_ix = random.randint(0, nframes - 1)
print(f"\nsanity check: subset coords match between modes for a randomly chosen frame: {test_frame_ix}?")
coords_a = grab(True, PDB, TRAJFN, test_frame_ix)
coords_b = grab(False, PDB, TRAJFN, test_frame_ix)
print("  match:", np.array_equal(coords_a, coords_b), f"(n={len(coords_a)})")

Here's the output from running it with the xtc extension (note the performance improvement is pretty modest here; I think this trajectory is so short that dealing with the xtc compression is still dominating the cost of reading it)

warming OS cache...

mode                 time (s)   atoms touched
----------------------------------------------
update_full=True        2.279           84000    (mean over 5: 2.292s)
update_full=False       1.776           84000    (mean over 5: 1.780s)

sanity check: subset coords match between modes for a randomly chosen frame: 667?
  match: True (n=84)
(omm) [lsmith@ccblin095 loostests]$ vim bench_update_full.py 
(omm) [lsmith@ccblin095 loostests]$ sed -i 's/DCD/TRAJFN/g' bench_update_full.py 
(omm) [lsmith@ccblin095 loostests]$ vim bench_update_full.py 
(omm) [lsmith@ccblin095 loostests]$ python bench_update_full.py 
warming OS cache...

mode                 time (s)   atoms touched
----------------------------------------------
update_full=True        2.258           84000    (mean over 10: 2.267s)
update_full=False       1.782           84000    (mean over 10: 1.787s)
(update_full=True)/(update_full=False): best = 1.266670277102126, mean = 1.2687202012060974

And if I change the file name back to the dcd (I just converted it by having loos write an xtc copy):

model: /home/lsmith/loostests/1ezg-sol.pdb traj: /home/lsmith/loostests/1ezg-sol.dcd
warming OS cache...

mode                 time (s)   atoms touched
----------------------------------------------
update_full=True        0.619           84000    (mean over 10: 0.623s)
update_full=False       0.138           84000    (mean over 10: 0.139s)
(update_full=True)/(update_full=False): best = 4.473243117514091, mean = 4.486294288455289

sanity check: subset coords match between modes for a randomly chosen frame: 740?
  match: True (n=84)

Make sense?

@lgsmith
Copy link
Copy Markdown
Contributor Author

lgsmith commented May 15, 2026

Regarding the comment about the warnings: I am happy to have fewer warnings (warnings about unused fields from the beginnings of pdbs litter my outfiles and I don't always write a grep -v WARN to get rid of them...) but I think people who use loos nowadays probably are writing a lot of one-off python scripts on the fly; maybe they're incorporating it into a notebook, or something. It's those people who I'm most worried about, because they're likely to have fairly ad-hoc trajectories and unlikely to look at the source. What's a more graceful way to deal with that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants