Skip to content
Open
Show file tree
Hide file tree
Changes from 74 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
37848a6
some progress on hub label integration?
electricEpilith Nov 12, 2025
297a11f
hub labeling in (debugging not finished), also changes to deal with C…
electricEpilith Jan 23, 2026
9d4c2e2
Point at compatible libbdsg and get build working on Mac
adamnovak Jan 23, 2026
8c13cf3
Use the new indexing types and accessors to avoid fetching nodes by t…
adamnovak Jan 23, 2026
788224d
Use accessors so we can build the Tiny oversized snarl test index and…
adamnovak Jan 23, 2026
4985468
Try dumping hub label data for debugging
adamnovak Jan 26, 2026
86e4e31
Add synthetic Boost graph dumping code, and missing semicolon, and li…
adamnovak Jan 27, 2026
ee5bd54
Merge remote-tracking branch 'origin/master' into hublabel
adamnovak Jan 28, 2026
e84e657
Use libbdsg with slightly more implemented hub labeling integration
adamnovak Jan 28, 2026
4f31496
Make sure NodeProp fields are not used before initialization
adamnovak Jan 29, 2026
232a589
Stop trying to look up removed trivial snarls
adamnovak Jan 29, 2026
30e392a
Add the debugging to subgraph finding that I needed to fix ChainRecor…
adamnovak Jan 29, 2026
9639b68
Stop trying to interpret the root as a chain in debug prints
adamnovak Feb 2, 2026
c0db406
Turn off debugging after passing existing snarl distance index tests
adamnovak Feb 2, 2026
ce1027f
Merge remote-tracking branch 'origin/master' into hublabel
adamnovak Feb 10, 2026
77c2ec2
Merge remote-tracking branch 'origin/hublabel' into hublabel
adamnovak Feb 10, 2026
163764f
Make randomized graph test actually exercise oversized snarls sometimes
adamnovak Feb 10, 2026
ddce5f4
Add function for loading a handlegraph from JSON
adamnovak Feb 10, 2026
e56353f
Allow cactus-ifying all handle graphs
adamnovak Feb 10, 2026
4f66c25
Add synthetic fix for actually populating the unique_ptr right
adamnovak Feb 10, 2026
5436d73
Commit partial synthetic refactor to use new JSON load method
adamnovak Feb 10, 2026
2c3721d
Revert "Commit partial synthetic refactor to use new JSON load method"
adamnovak Feb 10, 2026
695cff5
Replace string_to_graph with json2graph
adamnovak Feb 10, 2026
8ede8df
Remove a bunch of mostly unused functions for working with Protobuf G…
adamnovak Feb 10, 2026
e472799
Mostly-automatically convert tests to use vg::io::json2graph
adamnovak Feb 10, 2026
904f445
Remove duplicative JSON to graph function
adamnovak Feb 10, 2026
809a766
Set up tiny test that breaks oversized snarl logic
adamnovak Feb 10, 2026
f2d4f08
Remove unused cases
adamnovak Feb 10, 2026
caaa512
Fill in the dustances through oversized snarls to pass more distance …
adamnovak Feb 11, 2026
f15a5f9
Add exhaustive test for small snarls
adamnovak Feb 11, 2026
a0c71e6
Add a test for one of the failing possible small oversized snarls spe…
adamnovak Feb 11, 2026
8c048e2
Pin down one small graph
adamnovak Feb 11, 2026
e036967
Add more test cases from sequential and random graphs and make them pass
adamnovak Feb 13, 2026
8860e3a
Turn off debugging after passing random graph test
adamnovak Feb 13, 2026
eeaa175
Add initial synthetic code for SnarlDecompositionFuzzer and randomly_…
adamnovak Feb 13, 2026
c2f3279
Synthesize code that puts child chains the right way around
adamnovak Feb 13, 2026
c3cd3b7
Synthesize much simpler code that I designed myself because stochasti…
adamnovak Feb 13, 2026
c75e7ed
Synthesize slightly more encapsulated code
adamnovak Feb 13, 2026
a9ba894
Simplify the cursor loop and the flipping determination
adamnovak Feb 13, 2026
836e7ee
Hook up orientation fuzzers to random graph tests and fail to find mo…
adamnovak Feb 13, 2026
22ff0c4
Dump mostly-synthetic hot tips for cool robots
adamnovak Feb 14, 2026
e92a036
Implement populating is_regular and the single strict notion of regul…
adamnovak Mar 20, 2026
e4e1dae
Add debugging and reduce Saturn levels by not consuming all_children
adamnovak Mar 20, 2026
934e53e
Turn off debugging and don't count bounds as children for regularity
adamnovak Mar 20, 2026
21d2bb6
Set looping "distances" in distanceless index so we can tell snarls a…
adamnovak Mar 20, 2026
a203cb6
Use libbdsg that tries not to make way too many MPHF threads
adamnovak Mar 20, 2026
07d0472
Add another test to make sure we aren't missing reversals hiding in t…
adamnovak Mar 20, 2026
9aa832a
don't build tests for sparsehash due to C++20 incompatibility
electricEpilith Mar 21, 2026
85b2f44
don't build tests for sparsehash due to C++20 incompatibility
electricEpilith Mar 21, 2026
180dbf0
Parallelize cache_payloads and re-preload distance index before it
electricEpilith Mar 24, 2026
b8d310d
Atomic-ize progress and remove uninformative comment text
adamnovak Mar 24, 2026
0281b81
Replace snarl tree depth limit with fixed point check
adamnovak Mar 24, 2026
4422b08
Preload distance index only once
adamnovak Mar 24, 2026
aa2d827
Remove extra argument
adamnovak Mar 24, 2026
96b57ee
Use libbdsg that should define child snarl count function
adamnovak Mar 24, 2026
3c27df0
Regular-ify simple snarls
adamnovak Mar 24, 2026
9b48cfe
Merge pull request #4860 from vgteam/parallel-payload-caching
adamnovak Mar 26, 2026
957ebeb
Merge pull request #4857 from vgteam/hublabel-debug
adamnovak Mar 26, 2026
e6343e0
merge newer commits into hublabel
electricEpilith Mar 30, 2026
644b900
move libbdsg up
electricEpilith Mar 30, 2026
44321a3
added back (more) preloading to speed minimizer back up
electricEpilith Mar 30, 2026
0c5c0d9
update oldest-supported-compiler-job, upgrade gcc requirement to 10 t…
electricEpilith Mar 31, 2026
992c1cc
edit correct place for GCC version notice
electricEpilith Mar 31, 2026
7569cc1
move up libbdsg to upgrade snarl distance index version number
electricEpilith Apr 2, 2026
b81e331
add (a substantial amount of) instrumentation for vg giraffe
electricEpilith Apr 3, 2026
528ec4e
fix abs() errors on Mac
electricEpilith Apr 3, 2026
a3760d1
additional abs() fix
electricEpilith Apr 4, 2026
7920c76
minor print changes
electricEpilith Apr 4, 2026
190469a
Revert "minor print changes"
electricEpilith Apr 4, 2026
3e573bd
Revert "add (a substantial amount of) instrumentation for vg giraffe"
electricEpilith Apr 4, 2026
6cf266c
second try at instrumentation
electricEpilith Apr 4, 2026
e5513aa
Revert "second try at instrumentation"
electricEpilith Apr 6, 2026
bea2589
Merge pull request #4868 from electricEpilith/hublabel
electricEpilith Apr 6, 2026
e9c3d40
snarl distance index version number update
electricEpilith Apr 7, 2026
e286bf5
Fix typo in src/snarl_distance_index.cpp
electricEpilith Apr 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ oldest-supported-compiler-job:
GIT_SUBMODULE_STRATEGY: none
# DO NOT change this version number without updating the README to reflect
# the requirement bump.
COMPILER_VERSION: 9
COMPILER_VERSION: 10


# We define one job to do the Docker container build
Expand Down
75 changes: 75 additions & 0 deletions BOTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# VG Project Notes

## Building
- New `.cpp` files auto-discovered
- Build with `make -j8` or `make obj/whatever.o` to build just one .o.
- You may be getting errors from `clangd`. If these errors seem spurious, stop and demand a `clangd` that works properly.

## Testing

### Running Bash-TAP Tests
Use `prove -v` (not `bash`) to execute Bash-TAP tests. This provides proper test harness output and better error reporting.

**Important**: Run `prove` from the `test/` directory:
```bash
cd test
prove -v t/26_deconstruct.t
```

### Running Unit Tests
To run all unit tests:
```bash
./bin/vg test
```
- `./bin/vg test "[tag]"` runs tests matching a tag

#### Writing Unit Tests
- Framework: Catch v2 (header-only)
- Include: `#include "catch.hpp"` (in `src/unittest/catch.hpp`)
- Macros: `TEST_CASE("name", "[tags]")`, `SECTION("name")`, `REQUIRE(cond)`
- Namespace: `vg::unittest`
- Directory: `src/unittest/`

### Running All Tests
```bash
make test
```

## Writing Code

### HandleGraph API
The interfaces in libhandlegraph model a bidirected sequence graph (where nodes have DNA sequences and edges can connect to either the start or end of each involved node).

#### Core types
- `handle_t` - opaque 64-bit value
- `nid_t` - node ID type
- `edge_t` = `pair<handle_t, handle_t>`

#### Key HandleGraph methods
- `get_handle(nid_t, bool is_reverse=false)` → `handle_t`
- `get_id(handle_t)` → `nid_t`
- `get_is_reverse(handle_t)` → `bool`
- `flip(handle_t)` → `handle_t` (toggle orientation)
- `get_sequence(handle_t)` → `string` (in handle's orientation)
- `follow_edges(handle_t, bool go_left, iteratee)` - iterate neighbors
- `for_each_handle(iteratee, bool parallel=false)` - iterate all nodes
- `for_each_edge(iteratee, bool parallel=false)` - iterate all edges
- `has_edge(handle_t left, handle_t right)` → `bool`

#### MutableHandleGraph additions
- `create_handle(string seq)` / `create_handle(string seq, nid_t id)` → `handle_t`
- `create_edge(handle_t left, handle_t right)`
- `destroy_handle(handle_t)` / `destroy_edge(handle_t, handle_t)`

#### HandleGraph algorithms
- Things like `topological_sort.hpp` and copy_graph.hpp` are in `deps/libhandlegraph/src/include/handlegraph/algorithms`.

#### bdsg::HashGraph
- Header: `deps/libbdsg/bdsg/include/bdsg/hash_graph.hpp`
- Implements MutablePathMutableHandleGraph
- Go-to handlegraph implementation to use
- In libbdsg

### Utilities
- `reverse_complement(string)` → `string` in src/utility.hpp

1 change: 1 addition & 0 deletions CLAUDE.md
10 changes: 6 additions & 4 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,8 @@ ifeq ($(shell uname -s),Darwin)
LD_UTIL_RPATH_FLAGS=""

# Homebrew installs a Protobuf that uses an Abseil that is built with C++17, so we need to build with at least C++17
CXX_STANDARD?=17
# C++20 for spaceship operator and ranges
CXX_STANDARD?=20

# We may need libraries from Macports
ifeq ($(shell if [ -d /opt/local/lib ];then echo 1;else echo 0;fi), 1)
Expand Down Expand Up @@ -229,8 +230,9 @@ else
$(info Compiler $(CXX) is assumed to be GCC)

# gbwtgraph uses inline variables and our oldest supported compiler has
# C++17, so we should use C++17
CXX_STANDARD?=17
# C++17, so we should use at least C++17.
# C++20 for spaceship operator and ranges
CXX_STANDARD?=20

# Set an rpath for vg and dependency utils to find installed libraries
LD_UTIL_RPATH_FLAGS="-Wl,-rpath,$(CWD)/$(LIB_DIR)"
Expand Down Expand Up @@ -820,7 +822,7 @@ $(INC_DIR)/dynamic/dynamic.hpp: $(DYNAMIC_DIR)/include/dynamic/*.hpp $(DYNAMIC_D
+mkdir -p $(INC_DIR)/dynamic && cp -r $(CWD)/$(DYNAMIC_DIR)/include/dynamic/* $(INC_DIR)/dynamic/

$(INC_DIR)/sparsehash/sparse_hash_map: $(wildcard $(SPARSEHASH_DIR)/**/*.cc) $(wildcard $(SPARSEHASH_DIR)/**/*.h)
+cd $(SPARSEHASH_DIR) && ./autogen.sh && LDFLAGS="$(LD_LIB_DIR_FLAGS) $(LDFLAGS)" ./configure --prefix=$(CWD) $(FILTER) && $(MAKE) $(FILTER) && $(MAKE) install
+cd $(SPARSEHASH_DIR) && ./autogen.sh && LDFLAGS="$(LD_LIB_DIR_FLAGS) $(LDFLAGS)" ./configure --prefix=$(CWD) $(FILTER) && $(MAKE) src/sparsehash/internal/sparseconfig.h $(FILTER) && $(MAKE) install-data $(FILTER)

$(INC_DIR)/sparsepp/spp.h: $(wildcard $(SPARSEPP_DIR)/sparsepp/*.h)
+cp -r $(SPARSEPP_DIR)/sparsepp $(INC_DIR)/
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ On other distros, or if you do not have root access, you will need to perform th
liblzma-dev liblz4-dev libffi-dev libcairo-dev libboost-all-dev \
libzstd-dev pybind11-dev python3-pybind11 libssl-dev kmc

At present, you will need GCC version 9 or greater, with support for C++17, to compile vg. (Check your version with `gcc --version`.) GCC up to 11.4.0 is supported.
At present, you will need GCC version 10 or greater, with support for C++20, to compile vg. (Check your version with `gcc --version`.) GCC up to 11.4.0 is supported.

Other libraries may be required. Please report any build difficulties.

Expand Down
2 changes: 1 addition & 1 deletion deps/gbwtgraph
4 changes: 2 additions & 2 deletions src/cactus.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -999,8 +999,8 @@ VG cactus_to_vg(stCactusGraph* cactus_graph) {
return vg_graph;
}

VG cactusify(VG& graph) {
if (graph.size() == 0) {
VG cactusify(const PathHandleGraph& graph) {
if (graph.get_node_count() == 0) {
return VG();
}
auto parts = handle_graph_to_cactus(graph, unordered_set<string>());
Expand Down
2 changes: 1 addition & 1 deletion src/cactus.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ VG cactus_to_vg(stCactusGraph* cactus_graph);

// Convert vg into vg formatted cactus representation
// Input graph must be sorted!
VG cactusify(VG& graph);
VG cactusify(const PathHandleGraph& graph);

}

Expand Down
82 changes: 41 additions & 41 deletions src/cluster.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -212,8 +212,8 @@ class MEMClusterer {

protected:

class HitNode;
class HitEdge;
class HitNode;
class HitGraph;
class DPScoreComparator;

Expand All @@ -232,7 +232,47 @@ class MEMClusterer {
/// is closest to the optimal separation
void deduplicate_cluster_pairs(vector<pair<pair<size_t, size_t>, int64_t>>& cluster_pairs, int64_t optimal_separation);
};

class MEMClusterer::HitEdge {
public:
HitEdge(size_t to_idx, int32_t weight, int64_t distance) : to_idx(to_idx), weight(weight), distance(distance) {}
HitEdge() = default;
~HitEdge() = default;

/// Index of the node that the edge points to
size_t to_idx;

/// Weight for dynamic programming
int32_t weight;

/// Estimated distance
int64_t distance;
};

class MEMClusterer::HitNode {
public:
HitNode(const MaximalExactMatch& mem, pos_t start_pos, int32_t score) : mem(&mem), start_pos(start_pos), score(score) { }
HitNode() = default;
~HitNode() = default;

const MaximalExactMatch* mem;

/// Position of GCSA hit in the graph
pos_t start_pos;

/// Score of the exact match this node represents
int32_t score;

/// Score used in dynamic programming
int32_t dp_score;

/// Edges from this node that are colinear with the read
vector<HitEdge> edges_from;

/// Edges to this node that are colinear with the read
vector<HitEdge> edges_to;
};

class MEMClusterer::HitGraph {
public:

Expand Down Expand Up @@ -286,46 +326,6 @@ class MEMClusterer::HitGraph {
UnionFind components;
};

class MEMClusterer::HitNode {
public:
HitNode(const MaximalExactMatch& mem, pos_t start_pos, int32_t score) : mem(&mem), start_pos(start_pos), score(score) { }
HitNode() = default;
~HitNode() = default;

const MaximalExactMatch* mem;

/// Position of GCSA hit in the graph
pos_t start_pos;

/// Score of the exact match this node represents
int32_t score;

/// Score used in dynamic programming
int32_t dp_score;

/// Edges from this node that are colinear with the read
vector<HitEdge> edges_from;

/// Edges to this node that are colinear with the read
vector<HitEdge> edges_to;
};

class MEMClusterer::HitEdge {
public:
HitEdge(size_t to_idx, int32_t weight, int64_t distance) : to_idx(to_idx), weight(weight), distance(distance) {}
HitEdge() = default;
~HitEdge() = default;

/// Index of the node that the edge points to
size_t to_idx;

/// Weight for dynamic programming
int32_t weight;

/// Estimated distance
int64_t distance;
};

struct MEMClusterer::DPScoreComparator {
private:
const vector<HitNode>& nodes;
Expand Down
43 changes: 34 additions & 9 deletions src/gbwtgraph_helper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -431,7 +431,7 @@ std::vector<key_type> find_frequent_kmers(const gbwtgraph::GBZ& gbz, const Minim
void cache_payloads(
const gbwtgraph::GBZ& gbz,
const SnarlDistanceIndex& distance_index,
hash_map<nid_t, payload_t>& node_id_to_payload,
vg::hash_map<nid_t, payload_t>& node_id_to_payload,
ZipCodeCollection* oversized_zipcodes,
bool progress
) {
Expand All @@ -442,22 +442,37 @@ void cache_payloads(

const handlegraph::HandleGraph* graph_ptr = (const handlegraph::HandleGraph*) &gbz.graph;

double total_zipcode_time = 0.0, total_decoder_time = 0.0;
std::atomic<uint64_t> node_count = 0;
gbz.graph.for_each_handle([&](const handle_t& handle) {
nid_t node_id = gbz.graph.get_id(handle);
ZipCode zipcode;
pos_t pos = make_pos_t(node_id, false, 0);
zipcode.fill_in_zipcode_from_pos(distance_index, pos, true, graph_ptr);
ZipCode zipcode;
zipcode.fill_in_zipcode_from_pos(distance_index, pos, false, graph_ptr);
zipcode.fill_in_full_decoder();
if (++node_count % 10000 == 0 && progress) {
double telapsed = gbwt::readTimer() - start;
#pragma omp critical (cerr)
std::cerr << " Cached " << node_count << " nodes in " << telapsed << "s" << std::endl;
}

payload_t payload = zipcode.get_payload_from_zip();
if (payload == MIPayload::NO_CODE && oversized_zipcodes != nullptr) {
// The zipcode is too large for the payload field.
// Add it to the oversized zipcode list.
zipcode.fill_in_full_decoder();
size_t offset = oversized_zipcodes->size();
oversized_zipcodes->emplace_back(zipcode);
size_t offset;
#pragma omp critical (cache_payloads_zipcodes)
{
offset = oversized_zipcodes->size();
oversized_zipcodes->emplace_back(zipcode);
}
payload = { 0, offset };
}
node_id_to_payload.emplace(node_id, payload);
});
#pragma omp critical (cache_payloads_map)
{
node_id_to_payload.emplace(node_id, payload);
}
}, true);

if (progress) {
double seconds = gbwt::readTimer() - start;
Expand Down Expand Up @@ -519,8 +534,18 @@ gbwtgraph::DefaultMinimizerIndex build_minimizer_index(
} else {
// Cache payloads before building the index.
// A zipcode only depends on the node id.
hash_map<nid_t, payload_t> node_id_to_payload;
vg::hash_map<nid_t, payload_t> node_id_to_payload;
node_id_to_payload.reserve(gbz.graph.max_node_id() - gbz.graph.min_node_id());
// Re-preload the distance index right before use. find_frequent_kmers
// runs for a long time and may evict the mmap'd index pages from the OS
// page cache. We also preload eagerly right after loading the index (in
// minimizer_main.cpp) so the kernel treats those pages as recently-used;
// together the two preloads prevent cache_payloads from page-faulting on
// every node under the memory pressure of 32 parallel threads.
Comment on lines +541 to +544
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't really explain how two passes of preloading could possibly help, either.

if (params.progress) {
std::cerr << "Preloading distance index";
}
distance_index->preload(true);
cache_payloads(gbz, *distance_index, node_id_to_payload, oversized_zipcodes, params.progress);

auto get_payload = [&](const pos_t& pos) -> const code_type* {
Expand Down
Loading
Loading