Skip to content

freelawproject/kent

Repository files navigation

Jkent

Alpha — Jkent is in early-stage development. APIs and database schemas may change without notice.

Jkent is a scraper-driver framework for structured web scraping. It separates parsing logic (scrapers) from I/O orchestration (drivers): a scraper is a collection of pure parsing steps that yield data and follow-on requests, while the driver handles HTTP, file storage, rate limiting, and persistence.

Installation

The base package is the scraper SDK — everything needed to write scrapers (data types, decorators, parsing helpers), with no driver machinery:

pip install jkent

To actually run scrapers, install the operational extra (database engine, HTTP/browser transports):

pip install "jkent[operational]"
playwright install chromium   # only for browser-transport scrapers

Writing a scraper

Scrapers subclass BaseScraper, declare entry points with @entry, and parse pages in @step methods that yield ParsedData and further Requests:

from collections.abc import Generator

from jkent.common.decorators import entry, step
from jkent.data_types import (
    BaseScraper,
    HttpMethod,
    HTTPRequestParams,
    ParsedData,
    Request,
)


class MyCourtScraper(BaseScraper[dict]):
    @entry(dict)
    def get_entry(self) -> Generator[Request, None, None]:
        yield Request(
            request=HTTPRequestParams(
                method=HttpMethod.GET,
                url="https://court.example.com/cases",
            ),
            continuation="parse_list",
        )

    @step
    def parse_list(self, lxml_tree) -> Generator[Request, None, None]:
        for href in lxml_tree.checked_xpath("//a[@class='case']/@href", "case links"):
            yield Request(
                request=HTTPRequestParams(method=HttpMethod.GET, url=href),
                continuation="parse_detail",
            )

    @step
    def parse_detail(self, lxml_tree) -> Generator[ParsedData, None, None]:
        yield ParsedData(data={...})

Running a scraper

RunBootstrapper reads the scraper's driver_requirements, selects and stitches the right transport/engine/storage, and drives a resumable run backed by a SQLite database:

from pathlib import Path

from jkent.driver.unified_driver import RunBootstrapper

async with RunBootstrapper(
    MyCourtScraper(), db_path=Path("run.db")
) as run:
    await run.run()

Development

Jkent uses uv for dependency management:

uv sync
uv run pytest -n auto
uvx pre-commit run --all-files

License

BSD 2-Clause. See LICENSE.

About

A scraper-driver framework for web scraping court sites

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages