Conversation
Prawn does not perform OpenType text shaping, so Arabic characters render as disconnected isolated glyphs. This adds a shaping module that converts Arabic characters to their Unicode Presentation Forms (initial/medial/final/isolated) before Prawn renders them. Features: - All standard Arabic letters (U+0621-U+064A) - Extended Arabic (Farsi, Urdu: Peh, Tcheh, Gaf, etc.) - Lam-Alef mandatory ligatures - Diacritical marks preservation - Intercepts all Prawn text methods via document singleton See also: prawnpdf/prawn#1392 for upstream PR
|
|
||
| if arabic_letter?(cp) | ||
| # Collect a run of Arabic characters (letters + marks) | ||
| run_start = i |
There was a problem hiding this comment.
Pull request overview
This PR adds an internal Arabic shaping layer to the PDF export pipeline so that Arabic (and extended Arabic) text renders with correct contextual forms in Prawn-generated PDFs.
Changes:
- Add
Exports::PDF::Common::ArabicShapingto shape Arabic characters into Unicode Presentation Forms and perform a simplified RTL reordering. - Prepend interceptors onto the Prawn document instance in
Exports::PDF::Common::Viewto run shaping for all major text entry points (text,draw_text,formatted_text,text_box,formatted_text_box,make_cell). - Add specs to cover the shaping module and its integration with the PDF view.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
app/models/exports/pdf/common/arabic_shaping.rb |
Implements shaping + RTL reordering logic for Arabic text destined for Prawn. |
app/models/exports/pdf/common/view.rb |
Hooks shaping into PDF rendering by prepending a module on the document instance. |
spec/models/exports/pdf/common/arabic_shaping_spec.rb |
Adds unit specs for Arabic shaping behavior. |
spec/models/exports/pdf/common/view_arabic_spec.rb |
Adds integration-level coverage for shaping being applied to common Prawn text methods. |
| require 'spec_helper' | ||
|
|
||
| RSpec.describe Exports::PDF::Common::View do | ||
| describe 'Arabic text shaping integration' do | ||
| let(:view) { described_class.new(:en) } | ||
|
|
||
| it 'applies Arabic shaping to text method' do | ||
| pdf = view.document | ||
| # text method should accept Arabic without error | ||
| expect { pdf.text('مرحبا بالعالم') }.not_to raise_error | ||
| end | ||
|
|
||
| it 'applies Arabic shaping to formatted_text method' do | ||
| pdf = view.document | ||
| expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error | ||
| end | ||
|
|
||
| it 'applies Arabic shaping to make_cell method' do | ||
| pdf = view.document | ||
| expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error |
There was a problem hiding this comment.
These new specs don’t follow the repository’s Ruby/RuboCop conventions (missing # frozen_string_literal: true header and using single-quoted strings). This will likely fail linting; align the header and string literal quoting with existing specs in this directory (e.g., spec/models/exports/pdf/common/view_spec.rb).
| require 'spec_helper' | |
| RSpec.describe Exports::PDF::Common::View do | |
| describe 'Arabic text shaping integration' do | |
| let(:view) { described_class.new(:en) } | |
| it 'applies Arabic shaping to text method' do | |
| pdf = view.document | |
| # text method should accept Arabic without error | |
| expect { pdf.text('مرحبا بالعالم') }.not_to raise_error | |
| end | |
| it 'applies Arabic shaping to formatted_text method' do | |
| pdf = view.document | |
| expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error | |
| end | |
| it 'applies Arabic shaping to make_cell method' do | |
| pdf = view.document | |
| expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error | |
| # frozen_string_literal: true | |
| require "spec_helper" | |
| RSpec.describe Exports::PDF::Common::View do | |
| describe "Arabic text shaping integration" do | |
| let(:view) { described_class.new(:en) } | |
| it "applies Arabic shaping to text method" do | |
| pdf = view.document | |
| # text method should accept Arabic without error | |
| expect { pdf.text("مرحبا بالعالم") }.not_to raise_error | |
| end | |
| it "applies Arabic shaping to formatted_text method" do | |
| pdf = view.document | |
| expect { pdf.formatted_text([{ text: "مرحبا" }]) }.not_to raise_error | |
| end | |
| it "applies Arabic shaping to make_cell method" do | |
| pdf = view.document | |
| expect { pdf.make_cell("مرحبا", {}) }.not_to raise_error |
| # text method should accept Arabic without error | ||
| expect { pdf.text('مرحبا بالعالم') }.not_to raise_error | ||
| end | ||
|
|
||
| it 'applies Arabic shaping to formatted_text method' do | ||
| pdf = view.document | ||
| expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error | ||
| end | ||
|
|
||
| it 'applies Arabic shaping to make_cell method' do | ||
| pdf = view.document | ||
| expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error |
There was a problem hiding this comment.
These examples only assert that Prawn methods don’t raise when given Arabic text; they don’t verify that shaping is actually applied. This test would still pass if apply_arabic_shaping! were removed. Consider asserting that Exports::PDF::Common::ArabicShaping.process is invoked (e.g., via a spy) or that the returned text passed into Prawn is transformed.
| # text method should accept Arabic without error | |
| expect { pdf.text('مرحبا بالعالم') }.not_to raise_error | |
| end | |
| it 'applies Arabic shaping to formatted_text method' do | |
| pdf = view.document | |
| expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error | |
| end | |
| it 'applies Arabic shaping to make_cell method' do | |
| pdf = view.document | |
| expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error | |
| expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا بالعالم').and_call_original | |
| pdf.text('مرحبا بالعالم') | |
| end | |
| it 'applies Arabic shaping to formatted_text method' do | |
| pdf = view.document | |
| expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا').and_call_original | |
| pdf.formatted_text([{ text: 'مرحبا' }]) | |
| end | |
| it 'applies Arabic shaping to make_cell method' do | |
| pdf = view.document | |
| expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا').and_call_original | |
| pdf.make_cell('مرحبا', {}) |
| require 'spec_helper' | ||
|
|
||
| RSpec.describe Exports::PDF::Common::ArabicShaping do | ||
| describe '.contains_arabic?' do | ||
| it 'returns true for Arabic text' do | ||
| expect(described_class.contains_arabic?('مرحبا')).to be true | ||
| end | ||
|
|
||
| it 'returns false for Latin text' do | ||
| expect(described_class.contains_arabic?('Hello')).to be false | ||
| end | ||
|
|
||
| it 'returns true for mixed text' do | ||
| expect(described_class.contains_arabic?('Hello مرحبا')).to be true | ||
| end | ||
| end | ||
|
|
||
| describe '.shape' do | ||
| it 'converts Arabic to presentation forms' do | ||
| shaped = described_class.shape('مرحبا') | ||
| shaped.codepoints.each do |cp| | ||
| expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true | ||
| end | ||
| end | ||
|
|
||
| it 'creates Lam-Alef ligatures' do | ||
| shaped = described_class.shape('لا') | ||
| expect(shaped.codepoints).to eq([0xFEFB]) | ||
| end | ||
|
|
||
| it 'preserves diacritical marks' do | ||
| shaped = described_class.shape("بِسْمِ") | ||
| marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) } | ||
| expect(marks).not_to be_empty | ||
| end | ||
|
|
||
| it 'passes through non-Arabic text' do | ||
| expect(described_class.shape('Hello')).to eq('Hello') | ||
| end | ||
|
|
||
| it 'handles nil and empty' do | ||
| expect(described_class.process(nil)).to be_nil | ||
| expect(described_class.process('')).to eq('') | ||
| end | ||
|
|
||
| it 'preserves spaces in mixed text' do | ||
| shaped = described_class.shape('مرحبا بالعالم') | ||
| expect(shaped).to include(' ') | ||
| end | ||
| end | ||
|
|
||
| describe '.process' do | ||
| it 'shapes and reorders Arabic text' do | ||
| result = described_class.process('مرحبا') | ||
| expect(result).not_to eq('مرحبا') |
There was a problem hiding this comment.
These new specs don’t follow the repository’s Ruby/RuboCop conventions (missing # frozen_string_literal: true header and using single-quoted strings). This will likely fail linting; align with existing specs’ headers and quoting.
| require 'spec_helper' | |
| RSpec.describe Exports::PDF::Common::ArabicShaping do | |
| describe '.contains_arabic?' do | |
| it 'returns true for Arabic text' do | |
| expect(described_class.contains_arabic?('مرحبا')).to be true | |
| end | |
| it 'returns false for Latin text' do | |
| expect(described_class.contains_arabic?('Hello')).to be false | |
| end | |
| it 'returns true for mixed text' do | |
| expect(described_class.contains_arabic?('Hello مرحبا')).to be true | |
| end | |
| end | |
| describe '.shape' do | |
| it 'converts Arabic to presentation forms' do | |
| shaped = described_class.shape('مرحبا') | |
| shaped.codepoints.each do |cp| | |
| expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true | |
| end | |
| end | |
| it 'creates Lam-Alef ligatures' do | |
| shaped = described_class.shape('لا') | |
| expect(shaped.codepoints).to eq([0xFEFB]) | |
| end | |
| it 'preserves diacritical marks' do | |
| shaped = described_class.shape("بِسْمِ") | |
| marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) } | |
| expect(marks).not_to be_empty | |
| end | |
| it 'passes through non-Arabic text' do | |
| expect(described_class.shape('Hello')).to eq('Hello') | |
| end | |
| it 'handles nil and empty' do | |
| expect(described_class.process(nil)).to be_nil | |
| expect(described_class.process('')).to eq('') | |
| end | |
| it 'preserves spaces in mixed text' do | |
| shaped = described_class.shape('مرحبا بالعالم') | |
| expect(shaped).to include(' ') | |
| end | |
| end | |
| describe '.process' do | |
| it 'shapes and reorders Arabic text' do | |
| result = described_class.process('مرحبا') | |
| expect(result).not_to eq('مرحبا') | |
| # frozen_string_literal: true | |
| require "spec_helper" | |
| RSpec.describe Exports::PDF::Common::ArabicShaping do | |
| describe ".contains_arabic?" do | |
| it "returns true for Arabic text" do | |
| expect(described_class.contains_arabic?("مرحبا")).to be true | |
| end | |
| it "returns false for Latin text" do | |
| expect(described_class.contains_arabic?("Hello")).to be false | |
| end | |
| it "returns true for mixed text" do | |
| expect(described_class.contains_arabic?("Hello مرحبا")).to be true | |
| end | |
| end | |
| describe ".shape" do | |
| it "converts Arabic to presentation forms" do | |
| shaped = described_class.shape("مرحبا") | |
| shaped.codepoints.each do |cp| | |
| expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true | |
| end | |
| end | |
| it "creates Lam-Alef ligatures" do | |
| shaped = described_class.shape("لا") | |
| expect(shaped.codepoints).to eq([0xFEFB]) | |
| end | |
| it "preserves diacritical marks" do | |
| shaped = described_class.shape("بِسْمِ") | |
| marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) } | |
| expect(marks).not_to be_empty | |
| end | |
| it "passes through non-Arabic text" do | |
| expect(described_class.shape("Hello")).to eq("Hello") | |
| end | |
| it "handles nil and empty" do | |
| expect(described_class.process(nil)).to be_nil | |
| expect(described_class.process("")).to eq("") | |
| end | |
| it "preserves spaces in mixed text" do | |
| shaped = described_class.shape("مرحبا بالعالم") | |
| expect(shaped).to include(" ") | |
| end | |
| end | |
| describe ".process" do | |
| it "shapes and reorders Arabic text" do | |
| result = described_class.process("مرحبا") | |
| expect(result).not_to eq("مرحبا") |
|
|
||
| if arabic_letter?(cp) | ||
| # Collect a run of Arabic characters (letters + marks) | ||
| run_start = i |
There was a problem hiding this comment.
run_start is assigned but never used. This will trigger RuboCop’s unused variable check and should be removed (or used if it was intended for something).
| run_start = i |
| segments.reverse_each do |segment| | ||
| if segment[:dir] == :rtl | ||
| # Reverse RTL characters (they were in logical order) | ||
| result.concat(segment[:cps].reverse) | ||
| else | ||
| # LTR text keeps its order | ||
| result.concat(segment[:cps]) | ||
| end |
There was a problem hiding this comment.
reorder_rtl_line reverses all codepoints in RTL segments. For combining marks (tashkeel), this changes the required order from base+mark to mark+base (e.g., "بِ" becomes kasra before beh), which can break rendering. Consider reversing RTL text by grapheme clusters (base + following combining marks) or otherwise ensuring marks remain after their base character when reversing.
| it 'preserves diacritical marks' do | ||
| shaped = described_class.shape("بِسْمِ") | ||
| marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) } | ||
| expect(marks).not_to be_empty | ||
| end |
There was a problem hiding this comment.
The current spec for diacritics only checks that marks are present, but it wouldn’t catch the mark/base order regression introduced by reorder_rtl (marks can move before their base). Add an assertion that combining marks remain after the intended base character in the output of .process (e.g., for a short string like "بِ").
- Use push() instead of concat([]) for ARABIC_MARKS - Rename short param cp to codepoint (min 3 chars) - Use text.blank? instead of text.nil? || text.empty? - Extract collect_arabic_run to reduce AbcSize of shape method - Remove unused run_start variable - Remove unused RTL reordering code (direction handled by Prawn align) - Fix extra spacing in hash comment alignment
|
Hi. |
Summary
Exports::PDF::Common::ArabicShapingmodule that converts Arabic characters to Unicode Presentation Forms based on joining contexttext,draw_text,formatted_text,text_box,make_cell) on the PDF document instanceContext
Arabic is a cursive script where characters must connect based on position. Prawn uses ttfunk for font parsing but does not run the OpenType shaping engine, rendering all Arabic characters in isolated form — completely unreadable.
Test plan