Skip to content

mmseqs translatenucs can segfault when input file size aligns to memory page size #1107

@tlemane

Description

@tlemane

Hello,

I found an issue in mmseqs translatenucs that can lead to segfault under specific conditions, i.e. when the size of the fasta file used to create the database is a multiple of the memory page size.

Repro

python3 -c "import random; [print(f\">0\n{''.join(random.choice('ACGT') for _ in range(60))}\") for _ in range(64 * 15362)]" > repro.fasta

mmseqs createdb --createdb-mode 1 --dbtype 2 repro.fasta db
mmseqs translatenucs db out --threads 1

Cause

The issue is located in src/util/translatenucs.cpp:70

size_t length = reader.getEntryLen(i) - 1;
if ((data[length] != '\n' && length % 3 != 0) && (data[length - 1] == '\n' && (length - 1) % 3 != 0)) {
    ...
}

Here length is the sequence length including \n. For all sequences except the last one, data[length] reads the start of the next sequence header >:

(gdb) print data
$2 = 0x7ffff7ff5003 "GTGGCGCCAGGGACAGAGAGCCTGAGACAGCAGGCTTACTTTGGGCGTAACTCCAACCTG\n>0\nATAAGTAGG"...
(gdb) print data[length]
$3 = 62 '>'

But for the last sequence, data[length] reads beyond the file boundary.

In practice, the issue often does not occur for two reasons.

  1. For non page-aligned files, the remaining bytes in the partial page at the end of the mapping are zeroed and data[length] reads a null 0 byte without error:
(gdb) print data
$2 = 0x7ffff7ff5f83 "TGACTCGGCCTGTTTCCTCGAATCTGCCATGTCACCGAGATGTCGGAGGAAGGTGCACTC\n"
(gdb) print data[length]
$3 = 0 '\000'
  1. For page-aligned files, memory may be allocated immediately after the mapping, so the read succeeds while reading garbage data. For example, here the progress bar content is right after:
(gdb) print data
$2 = 0x7ffff7ff5fc3 "CAATGGCCAGAATGGCCGGTATCCCTATCGAAGGTACTCCACGTGCTTATGAACTCTTCA\n[", '=' <repeats 65 times>, "] 100.00% 64 0s 9ms\n     \r    \tfalse\nVerbosity        \t3\nCompressed      "...
(gdb) print data[length]
$3 = 91 '['

However, when no memory is allocated after the mapping end, the read results in a segfault:

(gdb) print data
$2 = 0x7ffff7e01fc3 "AGGAAAAAAGACGGGTTAGAACTGACTTTGGCCTCCATCACGCAGCATACAAGCGCCGGG\n"<error: Cannot access memory at address 0x7ffff7e02000>
(gdb) print data[length]
Cannot access memory at address 0x7ffff7e02000

I can propose a fix if needed

Teo

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions