Skip to content

Add Arabic text shaping for PDF export#22656

Closed
alghanim wants to merge 2 commits intoopf:devfrom
alghanim:feature/pdf-arabic-shaping
Closed

Add Arabic text shaping for PDF export#22656
alghanim wants to merge 2 commits intoopf:devfrom
alghanim:feature/pdf-arabic-shaping

Conversation

@alghanim
Copy link
Copy Markdown

@alghanim alghanim commented Apr 2, 2026

Summary

  • Add Exports::PDF::Common::ArabicShaping module that converts Arabic characters to Unicode Presentation Forms based on joining context
  • Intercept all Prawn text methods (text, draw_text, formatted_text, text_box, make_cell) on the PDF document instance
  • Supports all standard Arabic letters, extended Arabic (Farsi, Urdu), Lam-Alef ligatures, and diacritical marks
  • Zero performance impact on non-Arabic text (early return)
  • Upstream PR on Prawn: Add Arabic text shaping support prawnpdf/prawn#1392

Context

Arabic is a cursive script where characters must connect based on position. Prawn uses ttfunk for font parsing but does not run the OpenType shaping engine, rendering all Arabic characters in isolated form — completely unreadable.

Test plan

  • Export work package with Arabic title/description to PDF — characters connected
  • Export work package table with Arabic content — cells render correctly
  • Mixed Arabic/English text displays correctly
  • Arabic with diacritics renders correctly
  • Lam-Alef combinations form ligatures
  • Non-Arabic exports unaffected

Prawn does not perform OpenType text shaping, so Arabic characters
render as disconnected isolated glyphs. This adds a shaping module
that converts Arabic characters to their Unicode Presentation Forms
(initial/medial/final/isolated) before Prawn renders them.

Features:
- All standard Arabic letters (U+0621-U+064A)
- Extended Arabic (Farsi, Urdu: Peh, Tcheh, Gaf, etc.)
- Lam-Alef mandatory ligatures
- Diacritical marks preservation
- Intercepts all Prawn text methods via document singleton

See also: prawnpdf/prawn#1392 for upstream PR

if arabic_letter?(cp)
# Collect a run of Arabic characters (letters + marks)
run_start = i
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an internal Arabic shaping layer to the PDF export pipeline so that Arabic (and extended Arabic) text renders with correct contextual forms in Prawn-generated PDFs.

Changes:

  • Add Exports::PDF::Common::ArabicShaping to shape Arabic characters into Unicode Presentation Forms and perform a simplified RTL reordering.
  • Prepend interceptors onto the Prawn document instance in Exports::PDF::Common::View to run shaping for all major text entry points (text, draw_text, formatted_text, text_box, formatted_text_box, make_cell).
  • Add specs to cover the shaping module and its integration with the PDF view.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
app/models/exports/pdf/common/arabic_shaping.rb Implements shaping + RTL reordering logic for Arabic text destined for Prawn.
app/models/exports/pdf/common/view.rb Hooks shaping into PDF rendering by prepending a module on the document instance.
spec/models/exports/pdf/common/arabic_shaping_spec.rb Adds unit specs for Arabic shaping behavior.
spec/models/exports/pdf/common/view_arabic_spec.rb Adds integration-level coverage for shaping being applied to common Prawn text methods.

Comment on lines +1 to +20
require 'spec_helper'

RSpec.describe Exports::PDF::Common::View do
describe 'Arabic text shaping integration' do
let(:view) { described_class.new(:en) }

it 'applies Arabic shaping to text method' do
pdf = view.document
# text method should accept Arabic without error
expect { pdf.text('مرحبا بالعالم') }.not_to raise_error
end

it 'applies Arabic shaping to formatted_text method' do
pdf = view.document
expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error
end

it 'applies Arabic shaping to make_cell method' do
pdf = view.document
expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new specs don’t follow the repository’s Ruby/RuboCop conventions (missing # frozen_string_literal: true header and using single-quoted strings). This will likely fail linting; align the header and string literal quoting with existing specs in this directory (e.g., spec/models/exports/pdf/common/view_spec.rb).

Suggested change
require 'spec_helper'
RSpec.describe Exports::PDF::Common::View do
describe 'Arabic text shaping integration' do
let(:view) { described_class.new(:en) }
it 'applies Arabic shaping to text method' do
pdf = view.document
# text method should accept Arabic without error
expect { pdf.text('مرحبا بالعالم') }.not_to raise_error
end
it 'applies Arabic shaping to formatted_text method' do
pdf = view.document
expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error
end
it 'applies Arabic shaping to make_cell method' do
pdf = view.document
expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error
# frozen_string_literal: true
require "spec_helper"
RSpec.describe Exports::PDF::Common::View do
describe "Arabic text shaping integration" do
let(:view) { described_class.new(:en) }
it "applies Arabic shaping to text method" do
pdf = view.document
# text method should accept Arabic without error
expect { pdf.text("مرحبا بالعالم") }.not_to raise_error
end
it "applies Arabic shaping to formatted_text method" do
pdf = view.document
expect { pdf.formatted_text([{ text: "مرحبا" }]) }.not_to raise_error
end
it "applies Arabic shaping to make_cell method" do
pdf = view.document
expect { pdf.make_cell("مرحبا", {}) }.not_to raise_error

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +20
# text method should accept Arabic without error
expect { pdf.text('مرحبا بالعالم') }.not_to raise_error
end

it 'applies Arabic shaping to formatted_text method' do
pdf = view.document
expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error
end

it 'applies Arabic shaping to make_cell method' do
pdf = view.document
expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples only assert that Prawn methods don’t raise when given Arabic text; they don’t verify that shaping is actually applied. This test would still pass if apply_arabic_shaping! were removed. Consider asserting that Exports::PDF::Common::ArabicShaping.process is invoked (e.g., via a spy) or that the returned text passed into Prawn is transformed.

Suggested change
# text method should accept Arabic without error
expect { pdf.text('مرحبا بالعالم') }.not_to raise_error
end
it 'applies Arabic shaping to formatted_text method' do
pdf = view.document
expect { pdf.formatted_text([{ text: 'مرحبا' }]) }.not_to raise_error
end
it 'applies Arabic shaping to make_cell method' do
pdf = view.document
expect { pdf.make_cell('مرحبا', {}) }.not_to raise_error
expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا بالعالم').and_call_original
pdf.text('مرحبا بالعالم')
end
it 'applies Arabic shaping to formatted_text method' do
pdf = view.document
expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا').and_call_original
pdf.formatted_text([{ text: 'مرحبا' }])
end
it 'applies Arabic shaping to make_cell method' do
pdf = view.document
expect(Exports::PDF::Common::ArabicShaping).to receive(:process).with('مرحبا').and_call_original
pdf.make_cell('مرحبا', {})

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +55
require 'spec_helper'

RSpec.describe Exports::PDF::Common::ArabicShaping do
describe '.contains_arabic?' do
it 'returns true for Arabic text' do
expect(described_class.contains_arabic?('مرحبا')).to be true
end

it 'returns false for Latin text' do
expect(described_class.contains_arabic?('Hello')).to be false
end

it 'returns true for mixed text' do
expect(described_class.contains_arabic?('Hello مرحبا')).to be true
end
end

describe '.shape' do
it 'converts Arabic to presentation forms' do
shaped = described_class.shape('مرحبا')
shaped.codepoints.each do |cp|
expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true
end
end

it 'creates Lam-Alef ligatures' do
shaped = described_class.shape('لا')
expect(shaped.codepoints).to eq([0xFEFB])
end

it 'preserves diacritical marks' do
shaped = described_class.shape("بِسْمِ")
marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) }
expect(marks).not_to be_empty
end

it 'passes through non-Arabic text' do
expect(described_class.shape('Hello')).to eq('Hello')
end

it 'handles nil and empty' do
expect(described_class.process(nil)).to be_nil
expect(described_class.process('')).to eq('')
end

it 'preserves spaces in mixed text' do
shaped = described_class.shape('مرحبا بالعالم')
expect(shaped).to include(' ')
end
end

describe '.process' do
it 'shapes and reorders Arabic text' do
result = described_class.process('مرحبا')
expect(result).not_to eq('مرحبا')
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new specs don’t follow the repository’s Ruby/RuboCop conventions (missing # frozen_string_literal: true header and using single-quoted strings). This will likely fail linting; align with existing specs’ headers and quoting.

Suggested change
require 'spec_helper'
RSpec.describe Exports::PDF::Common::ArabicShaping do
describe '.contains_arabic?' do
it 'returns true for Arabic text' do
expect(described_class.contains_arabic?('مرحبا')).to be true
end
it 'returns false for Latin text' do
expect(described_class.contains_arabic?('Hello')).to be false
end
it 'returns true for mixed text' do
expect(described_class.contains_arabic?('Hello مرحبا')).to be true
end
end
describe '.shape' do
it 'converts Arabic to presentation forms' do
shaped = described_class.shape('مرحبا')
shaped.codepoints.each do |cp|
expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true
end
end
it 'creates Lam-Alef ligatures' do
shaped = described_class.shape('لا')
expect(shaped.codepoints).to eq([0xFEFB])
end
it 'preserves diacritical marks' do
shaped = described_class.shape("بِسْمِ")
marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) }
expect(marks).not_to be_empty
end
it 'passes through non-Arabic text' do
expect(described_class.shape('Hello')).to eq('Hello')
end
it 'handles nil and empty' do
expect(described_class.process(nil)).to be_nil
expect(described_class.process('')).to eq('')
end
it 'preserves spaces in mixed text' do
shaped = described_class.shape('مرحبا بالعالم')
expect(shaped).to include(' ')
end
end
describe '.process' do
it 'shapes and reorders Arabic text' do
result = described_class.process('مرحبا')
expect(result).not_to eq('مرحبا')
# frozen_string_literal: true
require "spec_helper"
RSpec.describe Exports::PDF::Common::ArabicShaping do
describe ".contains_arabic?" do
it "returns true for Arabic text" do
expect(described_class.contains_arabic?("مرحبا")).to be true
end
it "returns false for Latin text" do
expect(described_class.contains_arabic?("Hello")).to be false
end
it "returns true for mixed text" do
expect(described_class.contains_arabic?("Hello مرحبا")).to be true
end
end
describe ".shape" do
it "converts Arabic to presentation forms" do
shaped = described_class.shape("مرحبا")
shaped.codepoints.each do |cp|
expect(cp >= 0xFE70 && cp <= 0xFEFF || cp >= 0xFB50 && cp <= 0xFDFF).to be true
end
end
it "creates Lam-Alef ligatures" do
shaped = described_class.shape("لا")
expect(shaped.codepoints).to eq([0xFEFB])
end
it "preserves diacritical marks" do
shaped = described_class.shape("بِسْمِ")
marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) }
expect(marks).not_to be_empty
end
it "passes through non-Arabic text" do
expect(described_class.shape("Hello")).to eq("Hello")
end
it "handles nil and empty" do
expect(described_class.process(nil)).to be_nil
expect(described_class.process("")).to eq("")
end
it "preserves spaces in mixed text" do
shaped = described_class.shape("مرحبا بالعالم")
expect(shaped).to include(" ")
end
end
describe ".process" do
it "shapes and reorders Arabic text" do
result = described_class.process("مرحبا")
expect(result).not_to eq("مرحبا")

Copilot uses AI. Check for mistakes.

if arabic_letter?(cp)
# Collect a run of Arabic characters (letters + marks)
run_start = i
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_start is assigned but never used. This will trigger RuboCop’s unused variable check and should be removed (or used if it was intended for something).

Suggested change
run_start = i

Copilot uses AI. Check for mistakes.
Comment on lines +369 to +376
segments.reverse_each do |segment|
if segment[:dir] == :rtl
# Reverse RTL characters (they were in logical order)
result.concat(segment[:cps].reverse)
else
# LTR text keeps its order
result.concat(segment[:cps])
end
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reorder_rtl_line reverses all codepoints in RTL segments. For combining marks (tashkeel), this changes the required order from base+mark to mark+base (e.g., "بِ" becomes kasra before beh), which can break rendering. Consider reversing RTL text by grapheme clusters (base + following combining marks) or otherwise ensuring marks remain after their base character when reversing.

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +35
it 'preserves diacritical marks' do
shaped = described_class.shape("بِسْمِ")
marks = shaped.codepoints.select { |cp| (0x064B..0x065F).cover?(cp) }
expect(marks).not_to be_empty
end
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current spec for diacritics only checks that marks are present, but it wouldn’t catch the mark/base order regression introduced by reorder_rtl (marks can move before their base). Add an assertion that combining marks remain after the intended base character in the output of .process (e.g., for a short string like "بِ").

Copilot uses AI. Check for mistakes.
- Use push() instead of concat([]) for ARABIC_MARKS
- Rename short param cp to codepoint (min 3 chars)
- Use text.blank? instead of text.nil? || text.empty?
- Extract collect_arabic_run to reduce AbcSize of shape method
- Remove unused run_start variable
- Remove unused RTL reordering code (direction handled by Prawn align)
- Fix extra spacing in hash comment alignment
@as-op as-op requested review from myabc and removed request for as-op April 4, 2026 09:52
@as-op
Copy link
Copy Markdown
Contributor

as-op commented Apr 7, 2026

Hi.
This is again the monkey-patching of text shaping that should not be done on the application level and also would also be a duplication by your PR to prawn. Instead of hacking the shaping support in, there is a better way mentioned as a comment in your PR.

@as-op as-op closed this Apr 7, 2026
@github-actions github-actions bot locked and limited conversation to collaborators Apr 7, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Development

Successfully merging this pull request may close these issues.

5 participants