Byte buffers holding XLUnicodeString prematurely truncated by sftse · Pull Request #521 · tafia/calamine

sftse · 2025-07-02T09:52:51Z

This PR is another batch of commits from #463.

The first commit b5b948a introduces a test that fails.
1c58893 fixes one of the mistakes, so that a different error message appears.
599eae8 fixes the next failure due to a spec nonconformity of the test file, which seems like a logical extension of the standard. Already fixed in upstreamed PR.

Edit: updated commit hashes after push

jmcnamara · 2025-07-03T10:30:19Z

Looks like a good patchset with tests and a sensible fix. I vote +1 for merge and if there is no dissent I will merge in a few days.

sftse · 2025-07-04T14:25:36Z

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

jmcnamara · 2025-07-04T14:33:46Z

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

Okay. I'll wait for that. Let me know.

sftse · 2025-07-06T12:29:36Z

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

This makes me think there is more to this test than initially thought. The other thing I'm not able to piece together is that the XlsEncoding is windows-1252, so the high-byte being set for two byte characters doesn't make a lot of sense? The root cause for this error might be something else, and the patch does fixup an error, just a different one, while accidentally fixing the crash.

If there are any doubts about how the MS-XLS 2.5.296 says one should read an XLUnicodeStringNoCch, then we should hold off merging, if not I think we can merge? A testcase would be nice though, since I'm no longer certain the .xls file added even contains any.

sftse · 2025-07-06T12:38:46Z

Actually, let's hold off on merge until we can investigate this further.

jmcnamara · 2025-07-06T16:06:37Z

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

I created a test xls file called j44.xls that contains the formula =REPT("O",I44) in cell J44 like in the failing test file high_byte_string.xls.

j44.xls

This has the following binary data for the formula:

    17 01 00 4F 44 2B 00 08 C0 41 1E 00

I intrepreted this as:

17    ptgStr
0001  cch = 1 and fHighByte not set.
4F    string "O"

44    ptgRe = 0x04 +  PtgDataType = 0x2 = Value
002B  Row 43
08    Col  8 (combined to make I44)
C0    = 1100_0000b. Bit 8 indicates row is relative (not absolute). Bit 7 is the same for columns.

41    ptgFunc Value
001E  REPT

(As an aside, I said in another comment that C0 was the record "TOOLBAREND" but after a bit of detective work I re-reminded myself that the second byte of the column word is used to store 2 bits that indicate if the row or column are relative or absolute. I.e., A1 vs $A$1)

The formula in high_byte_string.xls looks like this:

   17 01 4F 44 2B C0 08 41 1E 00

Comparing the 2 binary data sets:

high_byte: 17 | 01    4F | 44 2B C0 08    | 41 1E 00

j44:       17 | 01 00 4F | 44 2B 00 08 C0 | 41 1E 00

There are 2 odd things here.

There is no fHighByte field in the the ShortXLUnicodeString. ~~This may be because, as you point out, the CODEPAGE is 1252 which might mean that all strings are simple single byte encodings so the encoding/fHighByte field can be omitted.~~ It is probably a XLUnicodeStringNoCch string as you point out.
The row in the cell reference appears to be a single byte only. And the C0 token is in an odd place.

All in all, it looks like a junk formula except for the fact that Excel (including a couple of older versions that I have) reads it. This is one of the reasons I stopped working with the xls format once xlsx came along. :-)

We could maybe just fix the cch = 1 issue and not fix whatever this is.

Update 2: I think I understand the difference between the 2 ptgRef structures above.

The first in high_byte_string.xlsx probably uses an older 14 bit encoding for rows up to 16,383 in Excel 95 and earlier and uses bits 15-16 to encode the row/col relative/absolute. I couldn't find a 3-byte RgceLoc (2.5.198.109) variant but it is in the old paper copy I have. On the other hand both test files report Biff5 in the BOF so I don't know how a parser should distinguish between a 3 byte and 4 byte RgceLoc.

The second is the 4 byte RgceLoc (2.5.198.109) with the row/col relative/absolute bit in the column entry.

jmcnamara · 2025-07-06T22:58:18Z

@sftse I undated the previous comment to clarify what C0 is.

jmcnamara · 2025-07-07T11:31:43Z

@sftse let me know when this patchset ready for merge?

When reading a PtgStr 2.5.198.89 the cch byte of ShortXLUnicodeString indicates the number of characters in the string. The error here is twofold: 1. The byte buffer holding the string characters in prematurely truncated before calling fn read_unicode_string_no_cch() based on cch, although the correct length in bytes can only be known inside fn read_unicode..() after checking the fHighByte flag. The fix is to not truncate the buffer at all pass it in its entirety so that fn read_unicode_string_no_cch() may decide how many bytes to read. 2. The second error then advances the offset into the buffer based on this erroneous length, which later leads to crashes.

…d in spec

sftse · 2025-10-30T16:05:03Z

Haven't forgotten about this, thanks for your review. Rebased, will revisit.

sftse force-pushed the unicode-strings branch from 9f2af03 to 051de6a Compare July 2, 2025 09:55

jmcnamara mentioned this pull request Jul 3, 2025

Oom alloc #463

Closed

sftse marked this pull request as draft July 6, 2025 12:39

sftse force-pushed the unicode-strings branch from 051de6a to 5a43311 Compare July 7, 2025 10:22

sftse added 5 commits October 30, 2025 15:15

test: add high byte strings testcase

b5b948a

test: expand test to check for missing text

f6630cc

fix: add record type 214 that comes out of nowhere and cannot be foun…

74b66f8

…d in spec

test: add future FIXME for unrecognized formula

b9ed9a7

sftse force-pushed the unicode-strings branch from 5a43311 to b9ed9a7 Compare October 30, 2025 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte buffers holding XLUnicodeString prematurely truncated#521

Byte buffers holding XLUnicodeString prematurely truncated#521
sftse wants to merge 5 commits intotafia:masterfrom
sftse:unicode-strings

sftse commented Jul 2, 2025 •

edited

Loading

Uh oh!

jmcnamara commented Jul 3, 2025

Uh oh!

sftse commented Jul 4, 2025

Uh oh!

jmcnamara commented Jul 4, 2025 •

edited

Loading

Uh oh!

sftse commented Jul 6, 2025

Uh oh!

sftse commented Jul 6, 2025

Uh oh!

jmcnamara commented Jul 6, 2025 •

edited

Loading

Uh oh!

jmcnamara commented Jul 6, 2025

Uh oh!

jmcnamara commented Jul 7, 2025 •

edited

Loading

Uh oh!

sftse commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sftse commented Jul 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmcnamara commented Jul 3, 2025

Uh oh!

sftse commented Jul 4, 2025

Uh oh!

jmcnamara commented Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sftse commented Jul 6, 2025

Uh oh!

sftse commented Jul 6, 2025

Uh oh!

jmcnamara commented Jul 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jmcnamara commented Jul 6, 2025

Uh oh!

jmcnamara commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sftse commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sftse commented Jul 2, 2025 •

edited

Loading

jmcnamara commented Jul 4, 2025 •

edited

Loading

jmcnamara commented Jul 6, 2025 •

edited

Loading

jmcnamara commented Jul 7, 2025 •

edited

Loading