Robo Khutt (خط)

Where Tradition Meets Technology العربي

Robo Khutt is an innovative project that leverages advanced diffusion models to generate high-quality Arabic calligraphy. By combining the artistry of traditional calligraphy with the power of artificial intelligence, RoboKhutt aims to provide beautifully rendered text that can be used in various applications.

The project is called "RoboKhutt" to reflect its focus on text rendering and calligraphy and AI Automation. The name combines "Robo" for AI / robotics and "Khutt" the Arabic (خط) word for calligraphy

Problem Statement

In traditional text rendering, we often encounter challenges with laying out text properly or finding suitable substitutions for certain characters or words. As a fun and ambitious solution, I propose training a diffusion model to become a calligrapher! However, recognizing the enormity of this task, I plan to take an incremental approach.

Approach

My approach involves training a model to render small sentences or even single words initially. At runtime, the rendered image will be used to feed the diffusion model. This approach allows us to start with simpler tasks and gradually scale up to more complex text generation.

Limitations

Image Dimensions: The generated images will be 512x128 pixels.
Maximum Word Length: Given the constraints of the image dimensions and the average character width (approximately 20 pixels), the model is designed to handle words up to 20-25 characters in length.
Character Adaptability: The model adapts to varying character widths and different calligraphy styles and fonts, ensuring optimal readability and aesthetic quality.

Additional Notes

I'm excited about the potential of this project to merge the elegance of Arabic calligraphy with the power of modern AI. Any feedback or contributions to further enhance RoboKhutt are welcome!

Features

Text-to-Image Generation: Convert Arabic text into stunning calligraphic images.
Customizable Styles: Supports various calligraphy styles and fonts.
Incremental Learning: Starts with individual characters and scales up to words and sentences.
High-Quality Output: Utilizes state-of-the-art diffusion models to ensure visually appealing results.

graph TD
  A[Root Directory] --> B[diffusion]
  A --> C[data_processing.py]
  A --> D[img_utilities.py]
  A --> E[lang_utilities.py]
  A --> F[model_inference.py]
  A --> G[model_training.py]
  B --> H[context_unet.py]
  B --> I[dataset.py]
  B --> J[diffusion_model.py]
  B --> K[model_blocks.py]
  B --> L[utilities.py]

UNet Implementation in ContextUnet

The UNet architecture in the ContextUnet implementation consists of multiple layers, forming a characteristic "U" shape. It starts with an initial convolution block, followed by three downsampling layers, a bottleneck layer, and then proceeds with four upsampling layers to restore the spatial dimensions of the input image. Embedding layers are also included to incorporate time and context information.

Detailed Breakdown of Layers and Steps

Initial Convolution: (Blue)
- init_conv: ResidualConvBlock (3, 64) - Applies initial convolution with residual connections to capture basic features.
Downsampling Path (Encoder): (Green)
- down1: UnetDown (64, 128) - Reduces spatial dimensions while increasing the feature depth to capture more complex features.
- down2: UnetDown (128, 256) - Further reduces spatial dimensions and increases feature depth for deeper feature extraction.
- down3: UnetDown (256, 512) - Deepest downsampling layer capturing the most complex and abstract features.
Bottleneck Layer:
- to_vec: AvgPool2d + GELU (512, 512) - Compresses the feature maps into a vector, maintaining critical information for upsampling.
Embedding Layers: (Red)
- timeembed1: EmbedFC (1, 512) - Embeds the time information into a 512-dimensional vector.
- timeembed2: EmbedFC (1, 256) - Further embeds the time information into a 256-dimensional vector.
- contextembed1: EmbedFC (n_cfeat, 512) - Embeds context features into a 512-dimensional vector.
- contextembed2: EmbedFC (n_cfeat, 256) - Further embeds context features into a 256-dimensional vector.
Upsampling Path (Decoder): (Yellow)
- up0: ConvTranspose2d + GroupNorm + ReLU (512) - Begins upsampling, reversing the compression while adding normalization and activation.
- up1: UnetUp (512, 256) - Upsamples and combines with corresponding encoder features to refine the image details.
- up2: UnetUp (256, 128) - Further upsampling and refinement, progressively increasing spatial dimensions.
- up3: UnetUp (128, 64) - Final upsampling layer, restoring the image to near original spatial dimensions.
Output Convolution: (Purple)
- out: Conv2d + GroupNorm + ReLU + Conv2d (128, 3) - Final convolution to produce the 3-channel RGB output, with normalization and activation for final adjustments.

Summary of the Hyperparameters in RoboKhutt Implementation

ContextUnet Hyperparameters

in_channels: Number of input channels in the images (typically 3 for RGB images).
n_feat: Number of features for the convolution layers (default is 64).
n_cfeat: Number of context features (default is 10).
height: Height of the input images (default is 128).

ResidualConvBlock Hyperparameters

in_channels: Number of input channels.
out_channels: Number of output channels.
is_res: Boolean indicating whether the block is a residual block (default is False).

UnetDown Hyperparameters

in_channels: Number of input channels.
out_channels: Number of output channels.

UnetUp Hyperparameters

in_channels: Number of input channels.
out_channels: Number of output channels.

EmbedFC Hyperparameters

input_dim: Dimensionality of the input.
emb_dim: Dimensionality of the embedding.

TextImageDataset Hyperparameters

alphabet: List of characters representing the alphabet.
max_length: Maximum length of the generated text strings.
font_name: Name of the font to be used for rendering text images.
font_size: Size of the font.
image_size: Size of the generated images (tuple of width and height).
is_arabic: Boolean indicating whether the text is Arabic (default is False).

Diffusion Model Training Hyperparameters

epochs: Number of training epochs.
device: Device used for training (CPU or GPU).
save_path: Path to save the trained model.
learning_rate: Learning rate for the optimizer (default is 1e-3).

Utilities Hyperparameters

render_text_image

text: Text to be rendered on the image.
image_size: Size of the image (tuple of width and height).
font_name: Name of the font.
font_size: Size of the font.
position: Position to align the text (e.g., 'top-left', 'center').
is_arabic: Boolean indicating whether the text is Arabic (default is False).

Project Structure

Getting Started

Prerequisites

Python 3.8+
PyTorch
Hugging Face Transformers
PIL (Pillow)
Other dependencies listed in requirements.txt

Installation

Clone the repository:

   git clone https://github.com/your_username/RoboKhutt.git
   cd RoboKhutt

Dependencies

python -m pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Create a virtual environment and activate it:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

Install the dependencies:

pip install -r requirements.txt

Running tests

python -m unittest discover
python -m unittest discover -s tests -p "test_lang_utilities.py"
python -m unittest discover -s tests -p "test_diffusion_utilities.py" > .generated\out.log 2>&1

python -m unittest tests.test_diffusion_evaluation.TestDiffusionEvaluation.test_evaluate_model

or

set PYTHONPATH=%CD%\src;%CD%
set PYTHONIOENCODING=utf-8
set LOGGING_LEVEL=DEBUG

set CUDA_HOME=C:\mltools\cuda_12_4
set CUDA_PATH=C:\mltools\cuda_12_4
set CUDA_PATH_V12_4=C:\mltools\cuda_12_4

pip install -e .

pytest

taskkill /F /IM python.exe /T

Usage

Data Preparation:

Collect a dataset of Arabic text and corresponding images. Use provided scripts in the scripts/ directory to preprocess the data.

Training:

Train the diffusion model on your dataset using:

Copy code
python src/model_training.py --data_dir path_to_your_data --output_dir path_to_save_model

Inference:

Generate calligraphic images from text using:

Copy code
python src/model_inference.py --input_text "Your Arabic Text Here" --model_dir

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
docs		docs
models		models
notebooks		notebooks
scripts		scripts
src		src
tests		tests
.env.template		.env.template
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
README_ar.md		README_ar.md
__init__.py		__init__.py
logo.png		logo.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robo Khutt (خط)

Problem Statement

Approach

Limitations

Additional Notes

Features

UNet Implementation in ContextUnet

Detailed Breakdown of Layers and Steps

Summary of the Hyperparameters in RoboKhutt Implementation

ContextUnet Hyperparameters

ResidualConvBlock Hyperparameters

UnetDown Hyperparameters

UnetUp Hyperparameters

EmbedFC Hyperparameters

TextImageDataset Hyperparameters

Diffusion Model Training Hyperparameters

Utilities Hyperparameters

render_text_image

Project Structure

Getting Started

Prerequisites

Installation

Usage

Data Preparation:

Training:

Inference:

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Robo Khutt (خط)

Problem Statement

Approach

Limitations

Additional Notes

Features

UNet Implementation in ContextUnet

Detailed Breakdown of Layers and Steps

Summary of the Hyperparameters in RoboKhutt Implementation

ContextUnet Hyperparameters

ResidualConvBlock Hyperparameters

UnetDown Hyperparameters

UnetUp Hyperparameters

EmbedFC Hyperparameters

TextImageDataset Hyperparameters

Diffusion Model Training Hyperparameters

Utilities Hyperparameters

render_text_image

Project Structure

Getting Started

Prerequisites

Installation

Usage

Data Preparation:

Training:

Inference:

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages