Skip to content

Byte buffers holding XLUnicodeString prematurely truncated#521

Draft
sftse wants to merge 5 commits intotafia:masterfrom
sftse:unicode-strings
Draft

Byte buffers holding XLUnicodeString prematurely truncated#521
sftse wants to merge 5 commits intotafia:masterfrom
sftse:unicode-strings

Conversation

@sftse
Copy link
Copy Markdown
Contributor

@sftse sftse commented Jul 2, 2025

This PR is another batch of commits from #463.

The first commit b5b948a introduces a test that fails.
1c58893 fixes one of the mistakes, so that a different error message appears.
599eae8 fixes the next failure due to a spec nonconformity of the test file, which seems like a logical extension of the standard.
Already fixed in upstreamed PR.

Edit: updated commit hashes after push

@sftse sftse force-pushed the unicode-strings branch from 9f2af03 to 051de6a Compare July 2, 2025 09:55
@jmcnamara
Copy link
Copy Markdown
Collaborator

Looks like a good patchset with tests and a sensible fix. I vote +1 for merge and if there is no dissent I will merge in a few days.

@jmcnamara jmcnamara mentioned this pull request Jul 3, 2025
@sftse
Copy link
Copy Markdown
Contributor Author

sftse commented Jul 4, 2025

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

@jmcnamara
Copy link
Copy Markdown
Collaborator

jmcnamara commented Jul 4, 2025

Before merge I want to add the particular, spec-nonconformant XLUnicodeString as a unit test.

Okay. I'll wait for that. Let me know.

@sftse
Copy link
Copy Markdown
Contributor Author

sftse commented Jul 6, 2025

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

This makes me think there is more to this test than initially thought. The other thing I'm not able to piece together is that the XlsEncoding is windows-1252, so the high-byte being set for two byte characters doesn't make a lot of sense? The root cause for this error might be something else, and the patch does fixup an error, just a different one, while accidentally fixing the crash.

If there are any doubts about how the MS-XLS 2.5.296 says one should read an XLUnicodeStringNoCch, then we should hold off merging, if not I think we can merge? A testcase would be nice though, since I'm no longer certain the .xls file added even contains any.

@sftse
Copy link
Copy Markdown
Contributor Author

sftse commented Jul 6, 2025

Actually, let's hold off on merge until we can investigate this further.

@sftse sftse marked this pull request as draft July 6, 2025 12:39
@jmcnamara
Copy link
Copy Markdown
Collaborator

jmcnamara commented Jul 6, 2025

While revisiting this to add a unit test, I noticed that the formula that is triggering this high byte string panic is also the one from row 43 col 9 that was the point of discussion here.

I created a test xls file called j44.xls that contains the formula =REPT("O",I44) in cell J44 like in the failing test file high_byte_string.xls.

j44.xls

This has the following binary data for the formula:

    17 01 00 4F 44 2B 00 08 C0 41 1E 00

I intrepreted this as:

17    ptgStr
0001  cch = 1 and fHighByte not set.
4F    string "O"

44    ptgRe = 0x04 +  PtgDataType = 0x2 = Value
002B  Row 43
08    Col  8 (combined to make I44)
C0    = 1100_0000b. Bit 8 indicates row is relative (not absolute). Bit 7 is the same for columns.

41    ptgFunc Value
001E  REPT

(As an aside, I said in another comment that C0 was the record "TOOLBAREND" but after a bit of detective work I re-reminded myself that the second byte of the column word is used to store 2 bits that indicate if the row or column are relative or absolute. I.e., A1 vs $A$1)

The formula in high_byte_string.xls looks like this:

   17 01 4F 44 2B C0 08 41 1E 00 

Comparing the 2 binary data sets:

high_byte: 17 | 01    4F | 44 2B C0 08    | 41 1E 00

j44:       17 | 01 00 4F | 44 2B 00 08 C0 | 41 1E 00

There are 2 odd things here.

  1. There is no fHighByte field in the the ShortXLUnicodeString. This may be because, as you point out, the CODEPAGE is 1252 which might mean that all strings are simple single byte encodings so the encoding/fHighByte field can be omitted. It is probably a XLUnicodeStringNoCch string as you point out.
  2. The row in the cell reference appears to be a single byte only. And the C0 token is in an odd place.

All in all, it looks like a junk formula except for the fact that Excel (including a couple of older versions that I have) reads it. This is one of the reasons I stopped working with the xls format once xlsx came along. :-)

We could maybe just fix the cch = 1 issue and not fix whatever this is.

Update 2: I think I understand the difference between the 2 ptgRef structures above.

The first in high_byte_string.xlsx probably uses an older 14 bit encoding for rows up to 16,383 in Excel 95 and earlier and uses bits 15-16 to encode the row/col relative/absolute. I couldn't find a 3-byte RgceLoc (2.5.198.109) variant but it is in the old paper copy I have. On the other hand both test files report Biff5 in the BOF so I don't know how a parser should distinguish between a 3 byte and 4 byte RgceLoc.

The second is the 4 byte RgceLoc (2.5.198.109) with the row/col relative/absolute bit in the column entry.

@jmcnamara
Copy link
Copy Markdown
Collaborator

@sftse I undated the previous comment to clarify what C0 is.

@sftse sftse force-pushed the unicode-strings branch from 051de6a to 5a43311 Compare July 7, 2025 10:22
@jmcnamara
Copy link
Copy Markdown
Collaborator

jmcnamara commented Jul 7, 2025

@sftse let me know when this patchset ready for merge?

sftse added 5 commits October 30, 2025 15:15
When reading a PtgStr 2.5.198.89 the cch byte of ShortXLUnicodeString
indicates the number of characters in the string. The error here is twofold:

1. The byte buffer holding the string characters in prematurely truncated
before calling fn read_unicode_string_no_cch() based on cch, although
the correct length in bytes can only be known inside fn read_unicode..()
after checking the fHighByte flag. The fix is to not truncate the buffer at all
pass it in its entirety so that fn read_unicode_string_no_cch() may decide
how many bytes to read.

2. The second error then advances the offset into the buffer based on this
erroneous length, which later leads to crashes.
@sftse
Copy link
Copy Markdown
Contributor Author

sftse commented Oct 30, 2025

Haven't forgotten about this, thanks for your review. Rebased, will revisit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants