Use namespace-aware XML parsing for workbook.#632
Use namespace-aware XML parsing for workbook.#632aquasync wants to merge 1 commit intotafia:masterfrom
Conversation
|
There's a bit of churn due to templating |
28ee3c4 to
55871a8
Compare
|
Thanks for the fix. It is good as a rule to make interfaces generic but in this particular case there will only be two Readers so separate functions and a small amount of code duplication is probably okay. That would give a less invasive fix like this: diff --git a/src/xlsx/mod.rs b/src/xlsx/mod.rs
index 879b170..2a6056e 100644
--- a/src/xlsx/mod.rs
+++ b/src/xlsx/mod.rs
@@ -16,7 +16,7 @@ use log::warn;
use quick_xml::events::attributes::{AttrError, Attribute, Attributes};
use quick_xml::events::BytesStart;
use quick_xml::events::Event;
-use quick_xml::name::QName;
+use quick_xml::name::{Namespace, QName, ResolveResult};
use quick_xml::Decoder;
use quick_xml::Reader as XmlReader;
use zip::read::{ZipArchive, ZipFile};
@@ -35,6 +35,7 @@ use crate::{
pub use cells_reader::XlsxCellReader;
pub(crate) type XlReader<'a, RS> = XmlReader<BufReader<ZipFile<'a, RS>>>;
+pub(crate) type XlNsReader<'a, RS> = quick_xml::NsReader<BufReader<ZipFile<'a, RS>>>;
/// Maximum number of rows allowed in an XLSX file.
pub const MAX_ROWS: u32 = 1_048_576;
@@ -398,10 +399,14 @@ impl<RS: Read + Seek> Xlsx<RS> {
}
fn read_workbook(&mut self, relationships: &HashMap<Vec<u8>, String>) -> Result<(), XlsxError> {
- let mut xml = match xml_reader(&mut self.zip, "xl/workbook.xml", &self.zip_path_cache) {
+ const NS_RELATIONSHIPS: Namespace =
+ Namespace(b"http://schemas.openxmlformats.org/officeDocument/2006/relationships");
+
+ let mut xml = match xml_ns_reader(&mut self.zip, "xl/workbook.xml", &self.zip_path_cache) {
None => return Ok(()),
Some(x) => x?,
};
+
let mut defined_names = Vec::new();
let mut buf = Vec::with_capacity(1024);
let mut val_buf = Vec::with_capacity(1024);
@@ -414,17 +419,12 @@ impl<RS: Read + Seek> Xlsx<RS> {
let mut visible = SheetVisible::Visible;
for a in e.attributes() {
let a = a?;
- match a {
- Attribute {
- key: QName(b"name"),
- ..
- } => {
+ let (ns, key) = xml.resolver().resolve_attribute(a.key);
+ match (ns, key.as_ref()) {
+ (ResolveResult::Unbound, b"name") => {
name = a.decode_and_unescape_value(xml.decoder())?.to_string();
}
- Attribute {
- key: QName(b"state"),
- ..
- } => {
+ (ResolveResult::Unbound, b"state") => {
visible = match a.decode_and_unescape_value(xml.decoder())?.as_ref()
{
"visible" => SheetVisible::Visible,
@@ -438,12 +438,9 @@ impl<RS: Read + Seek> Xlsx<RS> {
}
}
}
- Attribute {
- key: QName(b"r:id" | b"relationships:id"),
- value: v,
- } => {
+ (ResolveResult::Bound(NS_RELATIONSHIPS), b"id") => {
let r = &relationships
- .get(&*v)
+ .get(&*a.value)
.ok_or(XlsxError::RelationshipNotFound)?[..];
// target may have prepended "/xl/" or "xl/" path;
// strip if present
@@ -1757,6 +1754,7 @@ impl<RS: Read + Seek> ReaderRef<RS> for Xlsx<RS> {
}
}
+// Default XML reader. Not namespace-aware.
fn xml_reader<'a, RS: Read + Seek>(
zip: &'a mut ZipArchive<RS>,
path: &str,
@@ -1767,11 +1765,26 @@ fn xml_reader<'a, RS: Read + Seek>(
match zip.by_name(zip_path) {
Ok(f) => {
let mut r = XmlReader::from_reader(BufReader::new(f));
- let config = r.config_mut();
- config.check_end_names = false;
- config.trim_text(false);
- config.check_comments = false;
- config.expand_empty_elements = true;
+ configure_reader(r.config_mut());
+ Some(Ok(r))
+ }
+ Err(ZipError::FileNotFound) => None,
+ Err(e) => Some(Err(e.into())),
+ }
+}
+
+// Namespace-aware XML reader. Used for attributes with namespaces.
+fn xml_ns_reader<'a, RS: Read + Seek>(
+ zip: &'a mut ZipArchive<RS>,
+ path: &str,
+ cache: &HashMap<String, String>,
+) -> Option<Result<XlNsReader<'a, RS>, XlsxError>> {
+ let zip_path = cached_zip_path(cache, path);
+
+ match zip.by_name(zip_path) {
+ Ok(f) => {
+ let mut r = quick_xml::NsReader::from_reader(BufReader::new(f));
+ configure_reader(r.config_mut());
Some(Ok(r))
}
Err(ZipError::FileNotFound) => None,
@@ -1779,6 +1792,14 @@ fn xml_reader<'a, RS: Read + Seek>(
}
}
+// Configure the XML reader. Used for the default and namespace-aware readers.
+fn configure_reader(config: &mut quick_xml::reader::Config) {
+ config.check_end_names = false;
+ config.trim_text(false);
+ config.check_comments = false;
+ config.expand_empty_elements = true;
+}
+
/// search through an Element's attributes for the named one
pub(crate) fn get_attribute<'a>(
atts: Attributes<'a>,Alternatively, we could ignore the namespace and just handle the id. That would give a very simple fix: diff --git a/src/xlsx/mod.rs b/src/xlsx/mod.rs
index 879b170..dd7e68a 100644
--- a/src/xlsx/mod.rs
+++ b/src/xlsx/mod.rs
@@ -438,10 +438,9 @@ impl<RS: Read + Seek> Xlsx<RS> {
}
}
}
- Attribute {
- key: QName(b"r:id" | b"relationships:id"),
- value: v,
- } => {
+ // Ignore the "r:id" attribute namespace and match on the "id" name.
+ a if a.key.local_name().as_ref() == b"id" => {
+ let v = a.value;
let r = &relationships
.get(&*v)
.ok_or(XlsxError::RelationshipNotFound)?[..];@jqnatividad Any thoughts on this issue and the potential fixes. |
|
Yes I had something similar initially but was trying to remove the duplication. Using the shared reader config stops it being too bad though. As to whether we could just get away with ignoring namespaces entirely - I don't have a good feel for that. I can't see instances in the Happy to go with the simplest thing that works. OTOH I have another test case from the same suite that has namespaced nodes: Not just in the workbook, but even in the shared strings: And the actual sheet data: File uploaded here: Ekaterinburg_IP_9.xlsx. Note that it opens fine in LibreOffice and Excel. Not sure if the preference would be pervasive namespace-stripping, or properly resolving. I'd go with the latter but wary of performance implications given it'd be needed for the cell data also. |
Namespaced element names are already handled using the same So, since that seems to be the preferred method in the existing codebase let's go with that. Thanks for the diligence. |
Previously accepted only "r:" and "relationships:" prefixes. Closes tafia#634
55871a8 to
7fa6766
Compare
|
Ah that's great, I had missed that. Ok all the more reason to go with namespace stripping then. I've updated based on that suggestion. Thanks! |
|
Thanks for that. That looks good. You managed to maintain the I will leave this open until the weekend to see if there are any other comments and I will merge it then. |
|
I use calamine in my project - qsv - where performance is the top goal. The last release saw a big performance improvement - from 11.251 seconds to 8.725 seconds to export a million row sample of NYC's 311 data. See https://qsv.dathere.com/benchmarks and filter for Hopefully, we don't get a perf regression with this change, but it LGTM after a quick pass. On a related matter, it'd be useful to have a test suite for calamine with real-world Excel and ODS files. In that way, we can track both accuracy and performance over releases. |
This change won't affect performance. It is only on the
That is almost all down to @alexander-beedie's work. Could you take a look at #621 which offers even more performance gains but contains a more fundamental change to the internal processing. |
Test spreadsheet comes from the R readxl package test suite. Returns a parse error without this change.
Currently uses some common hard-coded namespace prefixes (
r:andrelationships:) for parsing relationships. This changes it to match on the namespace URL.Rather than change all XML parsing to be namespace-aware (which would likely have performance implications?), this only touches the workbook parsing.