Hello,
I found an issue in mmseqs translatenucs that can lead to segfault under specific conditions, i.e. when the size of the fasta file used to create the database is a multiple of the memory page size.
Repro
python3 -c "import random; [print(f\">0\n{''.join(random.choice('ACGT') for _ in range(60))}\") for _ in range(64 * 15362)]" > repro.fasta
mmseqs createdb --createdb-mode 1 --dbtype 2 repro.fasta db
mmseqs translatenucs db out --threads 1
Cause
The issue is located in src/util/translatenucs.cpp:70
size_t length = reader.getEntryLen(i) - 1;
if ((data[length] != '\n' && length % 3 != 0) && (data[length - 1] == '\n' && (length - 1) % 3 != 0)) {
...
}
Here length is the sequence length including \n. For all sequences except the last one, data[length] reads the start of the next sequence header >:
(gdb) print data
$2 = 0x7ffff7ff5003 "GTGGCGCCAGGGACAGAGAGCCTGAGACAGCAGGCTTACTTTGGGCGTAACTCCAACCTG\n>0\nATAAGTAGG"...
(gdb) print data[length]
$3 = 62 '>'
But for the last sequence, data[length] reads beyond the file boundary.
In practice, the issue often does not occur for two reasons.
- For non page-aligned files, the remaining bytes in the partial page at the end of the mapping are zeroed and
data[length] reads a null 0 byte without error:
(gdb) print data
$2 = 0x7ffff7ff5f83 "TGACTCGGCCTGTTTCCTCGAATCTGCCATGTCACCGAGATGTCGGAGGAAGGTGCACTC\n"
(gdb) print data[length]
$3 = 0 '\000'
- For page-aligned files, memory may be allocated immediately after the mapping, so the read succeeds while reading garbage data. For example, here the progress bar content is right after:
(gdb) print data
$2 = 0x7ffff7ff5fc3 "CAATGGCCAGAATGGCCGGTATCCCTATCGAAGGTACTCCACGTGCTTATGAACTCTTCA\n[", '=' <repeats 65 times>, "] 100.00% 64 0s 9ms\n \r \tfalse\nVerbosity \t3\nCompressed "...
(gdb) print data[length]
$3 = 91 '['
However, when no memory is allocated after the mapping end, the read results in a segfault:
(gdb) print data
$2 = 0x7ffff7e01fc3 "AGGAAAAAAGACGGGTTAGAACTGACTTTGGCCTCCATCACGCAGCATACAAGCGCCGGG\n"<error: Cannot access memory at address 0x7ffff7e02000>
(gdb) print data[length]
Cannot access memory at address 0x7ffff7e02000
I can propose a fix if needed
Teo
Hello,
I found an issue in
mmseqs translatenucsthat can lead to segfault under specific conditions, i.e. when the size of the fasta file used to create the database is a multiple of the memory page size.Repro
Cause
The issue is located in
src/util/translatenucs.cpp:70Here
lengthis the sequence length including\n. For all sequences except the last one,data[length]reads the start of the next sequence header>:But for the last sequence,
data[length]reads beyond the file boundary.In practice, the issue often does not occur for two reasons.
data[length]reads a null0byte without error:However, when no memory is allocated after the mapping end, the read results in a segfault:
I can propose a fix if needed
Teo