Skip to content

xls: fix issue with incorrect SST record#549

Merged
jmcnamara merged 1 commit intotafia:masterfrom
jmcnamara:gh548_incorrect_sst_unique_count
Sep 6, 2025
Merged

xls: fix issue with incorrect SST record#549
jmcnamara merged 1 commit intotafia:masterfrom
jmcnamara:gh548_incorrect_sst_unique_count

Conversation

@jmcnamara
Copy link
Copy Markdown
Collaborator

This commit fixes an issue in the Excel xls SST record (Shared String Table) when the unique string count is incorrect.

Fixes #548

@jmcnamara jmcnamara force-pushed the gh548_incorrect_sst_unique_count branch 2 times, most recently from 450c21f to bfef628 Compare September 3, 2025 19:17
@jmcnamara jmcnamara requested a review from Copilot September 3, 2025 19:18
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes an issue in Excel xls file parsing where the SST (Shared String Table) record contains an incorrect unique string count, causing parsing to fail or miss strings.

  • Modified the SST parsing logic to read all available strings instead of relying on the declared count
  • Updated string parsing conditions to handle continuation records properly
  • Added test coverage for the specific issue with incorrect SST unique count

Reviewed Changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/xls.rs Updates SST parsing to read all strings until data is exhausted rather than trusting the count field
tests/test.rs Adds test case to verify parsing of files with incorrect SST unique string counts
Changelog.md Documents the bug fix for issue #548

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment thread src/xls.rs
@jmcnamara jmcnamara force-pushed the gh548_incorrect_sst_unique_count branch from bfef628 to e1c63e0 Compare September 3, 2025 23:38
@jmcnamara
Copy link
Copy Markdown
Collaborator Author

@sftse If you get a chance could you review this Xls fix.

@jmcnamara jmcnamara mentioned this pull request Sep 3, 2025
2 tasks
@jmcnamara jmcnamara force-pushed the gh548_incorrect_sst_unique_count branch 2 times, most recently from affb801 to aeef06d Compare September 4, 2025 09:55
@timcoote
Copy link
Copy Markdown

timcoote commented Sep 4, 2025

documentation point: the breakage is in a .xls file, not xlsx.

should the user be made aware of the error in the xls file? I think that they should, but I don't know how calamine would do that (I'm almost totally ignorant of the package).

@sftse
Copy link
Copy Markdown
Contributor

sftse commented Sep 4, 2025

If the len is generally considered untrusted, can we remove the Vec::with_capacity(len) above it? This kind of rightsizing seems like a footgun, see #529, and would potentially lead to OOM if we had a fuzzer sophisticated enough to reach those codepaths.

Will try to give a more thorough review within a day or two.

@jmcnamara jmcnamara force-pushed the gh548_incorrect_sst_unique_count branch from aeef06d to 189e02b Compare September 4, 2025 12:46
@jmcnamara
Copy link
Copy Markdown
Collaborator Author

@timcoote

documentation point: the breakage is in a .xls file, not xlsx.

Thanks. Fixed.

should the user be made aware of the error in the xls file?

In this particular case I don't think the information is useful to most end-users and unless they wrote the xls exporting software there isn't anything that they can do about it. If you know what software produced these files we could raise the issue upstream.

@jmcnamara
Copy link
Copy Markdown
Collaborator Author

@sftse

If the len is generally considered untrusted, can we remove the Vec::with_capacity(len) above it? This kind of rightsizing seems like a footgun

Agreed. I removed that and rebased.

There are a lot of other similar instances in the code base. There are probably very few that have a measurable benefit to performance and some could led to crashes with deliberate or accidentally incorrect values in the parse file.

Will try to give a more thorough review within a day or two.

Thanks.

@jmcnamara jmcnamara changed the title xlsx: fix issue with incorrect SST record xls: fix issue with incorrect SST record Sep 4, 2025
@timcoote
Copy link
Copy Markdown

timcoote commented Sep 4, 2025

In this particular case I don't think the information is useful to most end-users and unless they wrote the xls exporting software there isn't anything that they can do about it. If you know what software produced these files we could raise the issue upstream.

I've not yet bottomed out how this file was created. It comes from UK NHS. My concern is that such orgs have poor processes for keeping s/w current (or the export formats, for that matter: still on .xls), and anything that makes such mistakes more visible would reduce the risks from not updating often enough.
Reasonable to ignore.

What's my fastest route for validating the fix against other files - build the dev environment, or wait for the PR to flow thro' ?

@jmcnamara
Copy link
Copy Markdown
Collaborator Author

@timcoote

What's my fastest route for validating the fix against other files - build the dev environment, or wait for the PR to flow thro' ?

It depends on how urgently you need it. The fix should be released over the weekend at which point python-calamine can consume it.

If you want to test it in the interim you can modify the Cargo.toml file for python-calamine and modify the calamine dependency entry to the following:

calamine = { 
    version = "0.30.0", 
    features = ["dates"] , 
    git = "https://github.com/jmcnamara/calamine.git", 
    branch = "gh548_incorrect_sst_unique_count"}

You will then need to rebuild the python module and install it. This will be slightly tricky if you haven't done it before.

This commit fixes an issue in the Excel xls SST record (Shared
String Table) when the unique string count is incorrect.

Fixes tafia#548
@jmcnamara jmcnamara force-pushed the gh548_incorrect_sst_unique_count branch from 189e02b to c243386 Compare September 5, 2025 00:22
@timcoote
Copy link
Copy Markdown

timcoote commented Sep 6, 2025

Indeed. It failed on pytest - I'll chase up downstream. However, I also had to put the Cargo.toml dependency on one line for it to parse. Is this expected?

@jmcnamara jmcnamara merged commit a09420f into tafia:master Sep 6, 2025
5 checks passed
@jmcnamara
Copy link
Copy Markdown
Collaborator Author

Merged so that the fix can be released to unblock downstream.

@jmcnamara jmcnamara deleted the gh548_incorrect_sst_unique_count branch October 21, 2025 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: Excel file will not parse

4 participants