Skip to content

feat: add resilient background job queue with retry and monitoring#913

Open
tarai-dl wants to merge 1 commit intorohitdash08:mainfrom
tarai-dl:rn/resilient-job-queue
Open

feat: add resilient background job queue with retry and monitoring#913
tarai-dl wants to merge 1 commit intorohitdash08:mainfrom
tarai-dl:rn/resilient-job-queue

Conversation

@tarai-dl
Copy link
Copy Markdown

Summary

Implements a production-ready background job queue system with retry logic and monitoring.

Changes

  • Redis-backed job queue with priority support (lower number = higher priority)
  • Exponential backoff retry with configurable max retries (default: 5)
  • Dead-letter queue for permanently failed jobs
  • Job status tracking in PostgreSQL
  • Distributed locking to prevent duplicate execution
  • APScheduler integration for automatic job processing every 5 seconds
  • REST API for job monitoring and management (/api/jobs/*)
  • Prometheus metrics for monitoring queue depth, success/failure rates
  • Tests for queue operations, worker, and retry logic

API Endpoints

  • GET /api/jobs — List jobs with filters
  • GET /api/jobs/<id> — Get job status
  • POST /api/jobs/<id>/cancel — Cancel pending job
  • POST /api/jobs/<id>/retry — Retry dead-lettered job
  • GET /api/jobs/stats — Queue statistics
  • POST /api/jobs/enqueue — Manually enqueue a job

Files Added

  • packages/backend/app/services/job_queue.py — Core queue logic
  • packages/backend/app/services/job_worker.py — Worker with retry
  • packages/backend/app/routes/jobs.py — API endpoints
  • packages/backend/tests/test_jobs.py — Tests
  • docs/job-queue.md — Documentation

Files Modified

  • packages/backend/app/routes/__init__.py — Register jobs blueprint
  • packages/backend/app/__init__.py — Start APScheduler worker

Acceptance Criteria

  • ✅ Production ready implementation
  • ✅ Includes tests
  • ✅ Documentation updated

Closes #130

Implement a production-ready background job system:

- Redis-backed job queue with priority support
- Exponential backoff retry (configurable max retries, default 5)
- Dead-letter queue for permanently failed jobs
- Job status tracking in PostgreSQL
- Distributed locking to prevent duplicate execution
- APScheduler integration for automatic job processing
- REST API for job monitoring and management
- Prometheus metrics integration
- Tests for queue, worker, and retry logic

Closes rohitdash08#130
@tarai-dl tarai-dl requested a review from rohitdash08 as a code owner April 17, 2026 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Resilient background job retry & monitoring

1 participant