Skip to content

corrupted data when generating a searchable pdf with hocr-pdf #186

Description

@pprw

I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.

I have both files in the same folder. hocr-pdf . > out.pdf generates a pdf but I cannot search inside.

Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).

When I extract the text from the pdf

$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data

and out.txt contains (excerpt)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)

My hocr file is generated by kraken.

I read from kraken documentation

hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.

Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.

So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions