diff --git a/.dockerignore b/.dockerignore new file mode 100644 index 000000000..d1144a248 --- /dev/null +++ b/.dockerignore @@ -0,0 +1,54 @@ +# Python +__pycache__/ +*.py[cod] +*$py.class +*.so +.Python +*.egg-info/ +dist/ +build/ +*.egg +.pytest_cache/ +.mypy_cache/ +.ruff_cache/ + +# Virtual environments +.venv/ +venv/ +ENV/ + +# IDE +.idea/ +.vscode/ +*.swp +*.swo +*~ + +# Git +.gitignore +.gitattributes + +# Documentation +docs/ +*.md +!README.md + +# Tests +tests/ +.pytest_cache/ + +# CI/CD +.github/ +.pre-commit-config.yaml + +# Environment files (will be passed as build args or mounted) +.env + +# Chainlit specific +.chainlit/ +.files/ + +# Misc +*.log +.DS_Store +literature/ \ No newline at end of file diff --git a/.gitignore b/.gitignore index a9be6f5f2..ba62b5dee 100644 --- a/.gitignore +++ b/.gitignore @@ -312,3 +312,6 @@ tests/example2.* # Client data src/paperqa/clients/client_data/retractions.csv +/src/.chainlit/* +.chainlit/* +chainlit.md diff --git a/DOCKER_README.md b/DOCKER_README.md new file mode 100644 index 000000000..d6b22dd7c --- /dev/null +++ b/DOCKER_README.md @@ -0,0 +1,102 @@ +# Docker Setup for Chainlit UI + +This guide explains how the DESTINY repo search agent UI in a docker container. + +## Quick Start + +### Using Docker Compose (Recommended) + +1. **Build and run the container:** + ```bash + docker compose up --build + ``` + +2. **Access the UI:** + Open your browser to http://localhost:8000 + +3. **Stop the container:** + ```bash + docker compose down + ``` + +### Using Docker Directly (Untested) + +1. **Build the image:** + ```bash + docker build -t paper-qa-chainlit . + ``` + +2. **Run the container:** + ```bash + docker run -p 8000:8000 -p 42071:42071 --env-file .env paper-qa-chainlit + ``` + +## Environment Variables + +The application requires several environment variables for API access. These are automatically loaded from your `.env` file in the project's root directory when using docker-compose. + +If you haven't created one yet, make sure to do so! + +**Required variables:** +- `AZURE_API_BASE` - Azure OpenAI endpoint +- `AZURE_API_KEY` - Azure OpenAI API key +- `DESTINY_API_URL` - DESTINY API URL +- `DESTINY_CLIENT_ID` - DESTINY OAuth client ID +- `DESTINY_AUTHORITY` - DESTINY OAuth authority +- `DESTINY_LOGIN_HINT` - User email for DESTINY +- `DESTINY_SCOPES` - DESTINY API scopes + +## Data Persistence + +Docker Compose sets up volumes for persistent data: +- `chainlit-data` - Chainlit configuration and session data +- `chainlit-files` - Uploaded files +- `chainlit-public` - Public assets + +To view or backup this data: +```bash +docker volume ls +docker volume inspect paper-qa-chainlit-data +``` + +## Troubleshooting + +### Port already in use +If port 8000 is already in use, change it in docker-compose.yml: +```yaml +ports: + - "8080:8000" # Use port 8080 on host +``` + +### Authentication issues with DESTINY +The DESTINY OAuth flow requires interactive authentication using MSAL. The container exposes port 42071 for the authentication callback. + +**How it works:** +1. When you start the container and visit `localhost:8000` in your browser, MSAL will print an authentication URL in your terminal. +2. Copy the second URL and open it in your **host machine's browser** (not inside the container) +3. Complete the Microsoft login +4. The callback will be sent to `localhost:42071` which is forwarded to the container +5. Authentication completes and the token is cached + +**If authentication fails:** +- Ensure port 42071 is not blocked by your firewall +- Check that the port mapping is correct: `docker compose ps` +- Ensure you're visiting the **second** authentication URL from your terminal +- Ensure you visit `localhost:8000` in your browser before looking for the auth link your terminal + +### View logs +```bash +docker compose logs -f chainlit-ui +``` + +## Cleanup + +Remove all containers and volumes: +```bash +docker compose down -v +``` + +Remove the image: +```bash +docker rmi paper-qa-chainlit +``` \ No newline at end of file diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 000000000..8d6685285 --- /dev/null +++ b/Dockerfile @@ -0,0 +1,26 @@ +FROM python:3.12-slim-trixie +COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/ + +# Install git (required by setuptools-scm for version detection) +RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/* + +# Copy the project into the image +COPY . /app + +# Set working directory +WORKDIR /app + +# Install Python dependencies (skip dev dependencies) +RUN uv sync --locked --no-dev + +# Add the virtual environment to PATH so we use the installed packages +ENV PATH="/app/.venv/bin:$PATH" + +# Expose Chainlit default port +EXPOSE 8000 + +# Expose MSAL authentication callback port +EXPOSE 42071 + +# Run the Chainlit app directly from the venv +CMD ["chainlit", "run", "app.py", "--host", "0.0.0.0", "--port", "8000"] \ No newline at end of file diff --git a/README.md b/README.md index 8b3208854..6ce22d201 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,9 @@ # PaperQA2 +## To run the project locally + +In line with the existing [CONTRIBUTING.md](CONTRIBUTING.md) file. Executing `uv sync` in the project root is sufficient to start editing and running the project code locally. + ## To run on our infrastructure There is a basic `azure.json` configuration file in `src/paperqa/configs` that provides a simple configuration `paperqa`'s `Settings` object. @@ -14,6 +18,85 @@ For it to work, it requires a `.env` file in the project root directory populate - `OPENALEX_MAILTO` To make use of the configuration, simply create a `Settings` object using its `from_name` class method, passing the stem of the json config as a string, i.e. `Settings.from_name("azure")`. + +## To run the DESTINY repo paper helper + +The following additional environment variables are required: + +- `DESTINY_API_URL` (ATTOW https://destiny-repository-stag-app.proudmeadow-2a76e8ac.swedencentral.azurecontainerapps.io) +- `DESTINY_CLIENT_ID` (ATTOW 96ed941e-15dc-4ec0-b9e7-e4eda99efd2e) +- `DESTINY_AUTHORITY` (ATTOW https://login.microsoftonline.com/f870e5ae-5521-4a94-b9ff-cdde7d36dd35) +- `DESTINY_SCOPES` (ATTOW api://14e3f6c0-b8aa-46c6-98d9-29b0dd2a0f7c/.default as a list, i.e. between double quotes ending with a comma) +- `DESTINY_LOGIN_HINT` (your UCL email address, see Lena's authentication notebook in the teams channel for more) + +See `test_contribs.py` for an example of running the paper helper. + +Using this forked version of paper-qa as a local package/dependency should work if not: + +```python +import os +from dotenv import load_dotenv +from paperqa import Settings +from paperqa.contrib.destiny_paper_helper import DESTINYPaperHelper +from paperqa.settings import IndexSettings + +load_dotenv() + +paper_directory = "~/some-directory" + +settings = Settings.from_name("azure").model_copy( + update={ + "paper_directory": paper_directory, + "index": IndexSettings(paper_directory=paper_directory) + } +) +helper = DESTINYPaperHelper( + settings, + api_url=os.getenv("DESTINY_API_URL"), + client_id=os.getenv("DESTINY_CLIENT_ID"), + authority=os.getenv("DESTINY_AUTHORITY"), + login_hint=os.getenv("DESTINY_LOGIN_HINT"), + scopes=os.getenv("DESTINY_SCOPES").split(","), +) + +question = "What is the progress on climate change intervention research?" + +papers = await helper.fetch_relevant_papers(question) + +docs = await helper.aadd_docs(papers) + +session = await docs.aquery(question, settings=helper.settings) + +print(session.answer) +``` + +## To run the agent with a DESTINY repo search tool + +```python +from dotenv import load_dotenv +from paperqa import Settings, agent_query + +load_dotenv() # load your environment variables + +paper_directory = "~/some-directory" + +settings = Settings.from_name("search_only_destiny").model_copy( + update={ + "paper_directory": paper_directory, + "verbosity": 0 # to reduce output + } +) + +query = "What are the greatest health risks brought about by climate change?" + +answer_response = await agent_query( + query=query, + settings=settings +) + +print(answer_response.session.answer) # show the agent's response +``` + [![GitHub](https://img.shields.io/badge/GitHub-black?logo=github&logoColor=white)](https://github.com/Future-House/paper-qa) diff --git a/app.py b/app.py new file mode 100644 index 000000000..0906f93c7 --- /dev/null +++ b/app.py @@ -0,0 +1,99 @@ +import tempfile + +import chainlit as cl +from lmi.utils import update_litellm_max_callbacks + +from paperqa import agent_query, Settings +from paperqa.sources.destiny_repo import get_access_token + +# Suppress LiteLLM callback warnings +update_litellm_max_callbacks() + + +@cl.on_chat_start +async def on_chat_start(): + print("Starting new chat session.") + print("Attempting to retrieve DESTINY access token...") + get_access_token() + print("Retrieved access token successfully.") + print("Creating paperqa settings...") + tempdir = tempfile.TemporaryDirectory() + cl.user_session.set("tempdir", tempdir) + settings = Settings.from_name("search_only_destiny") + settings.agent.index.paper_directory = tempdir.name + settings.verbosity = 0 + cl.user_session.set("paperqa-settings", settings) + + +@cl.on_chat_end +def on_chat_end(): + tempdir = cl.user_session.get("tempdir") + tempdir.cleanup() + + +@cl.on_message +async def main(message: cl.Message): + # Store steps for updating them during callbacks + current_step = None + step_count = 0 + + async def on_agent_action_callback(action, _state): + """Called when agent takes an action (tool call).""" + nonlocal current_step, step_count + step_count += 1 + + # Extract tool names and details from the action + tool_names = [tc.function.name for tc in action.tool_calls] + display_names = [name.replace("_", " ").title() for name in tool_names] + step_name = f"{', '.join(display_names)} Tool - Step {step_count}" + + # Build detailed output for each tool call + tool_details = [] + for tc in action.tool_calls: + tool_name = tc.function.name + + # Check if this is a DESTINY search and extract the query + if tool_name == "destiny_search": + query = tc.function.arguments.get("query", "N/A") + tool_details.append(f"**DESTINY API Search Query**: `{query}`") + current_step = cl.Step(name=step_name, show_input=True) + else: + # For other tools, just show the name + tool_details.append(f"**{tool_name}**") + current_step = cl.Step(name=step_name, show_input=False) + + # Convert tool names to title case for display + await current_step.__aenter__() + current_step.input = "\n".join(tool_details) + await current_step.update() + + async def on_env_step_callback(obs, _reward, _done, _truncated): + """Called after environment processes the action.""" + nonlocal current_step + + if current_step is not None: + # Format the observations (tool results) + result_output = "" + for msg in obs: + if hasattr(msg, 'role') and msg.role == 'tool': + result_output = "\n\n" + str(msg.content) + + # Append the result to existing output instead of replacing it + current_step.output = f"\n\n**Completed**{result_output}" + await current_step.update() + await current_step.__aexit__(None, None, None) + current_step = None + + # Create a final answer step + async with cl.Step(name="Generating Answer") as answer_step: + answer_response = await agent_query( + query=str(message.content), + settings=cl.user_session.get("paperqa-settings"), + on_agent_action_callback=on_agent_action_callback, + on_env_step_callback=on_env_step_callback + ) + answer_step.output = "Agent completed processing" + + await cl.Message( + content=answer_response.session.answer + ).send() diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 000000000..83208c305 --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,34 @@ +version: '3.8' + +services: + chainlit-ui: + build: + context: . + dockerfile: Dockerfile + ports: + - "8000:8000" + - "42071:42071" # MSAL authentication callback port + environment: + # Azure OpenAI + - AZURE_API_BASE=${AZURE_API_BASE} + - AZURE_API_KEY=${AZURE_API_KEY} + + # DESTINY + - DESTINY_API_URL=${DESTINY_API_URL} + - DESTINY_CLIENT_ID=${DESTINY_CLIENT_ID} + - DESTINY_AUTHORITY=${DESTINY_AUTHORITY} + - DESTINY_LOGIN_HINT=${DESTINY_LOGIN_HINT} + - DESTINY_SCOPES=${DESTINY_SCOPES} + volumes: + + # Persist Chainlit data + - chainlit-data:/app/.chainlit + - chainlit-files:/app/.files + - chainlit-public:/app/public + + restart: unless-stopped + +volumes: + chainlit-data: + chainlit-files: + chainlit-public: \ No newline at end of file diff --git a/public/logo_dark.png b/public/logo_dark.png new file mode 100644 index 000000000..c9d6eca89 Binary files /dev/null and b/public/logo_dark.png differ diff --git a/public/logo_light.png b/public/logo_light.png new file mode 100644 index 000000000..14a28daee Binary files /dev/null and b/public/logo_light.png differ diff --git a/pyproject.toml b/pyproject.toml index 640b45ce8..3b45798c8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -20,7 +20,6 @@ classifiers = [ "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", "Programming Language :: Python :: 3 :: Only", - "Programming Language :: Python :: 3.11", "Programming Language :: Python :: 3.12", "Programming Language :: Python :: 3.13", "Programming Language :: Python", @@ -28,11 +27,15 @@ classifiers = [ ] dependencies = [ "anyio", + "chainlit>=2.6.3", + "destiny-sdk>=0.6.0", "fhaviary[llm]>=0.27", # For partial tool concurrency "fhlmi>=0.41.0", # Pin for LiteLLMModel.get_router "html2text", # TODO: evaluate moving to an opt-in dependency "httpx", "httpx-aiohttp", + "luqum>=1.0.0", + "msal>=1.34.0", "numpy", "paper-qa-pypdf", # TODO: after https://peps.python.org/pep-0771/, make this opt-out if 'pymupdf' extra is specified` "pyalex>=0.19", @@ -56,7 +59,7 @@ maintainers = [ ] name = "paper-qa" readme = "README.md" -requires-python = ">=3.11" +requires-python = ">=3.12" [project.optional-dependencies] dev = [ diff --git a/src/paperqa/agents/env.py b/src/paperqa/agents/env.py index 135eb9a2c..6675d2d09 100644 --- a/src/paperqa/agents/env.py +++ b/src/paperqa/agents/env.py @@ -36,7 +36,7 @@ GenerateAnswer, NamedTool, PaperSearch, - Reset, + Reset, DESTINYPaperSearch, ) logger = logging.getLogger(__name__) @@ -131,6 +131,12 @@ def make_tool(fn: Callable, tool_type: type[NamedTool] = tool_type) -> Tool: search_count=settings.agent.search_count, settings=settings ).clinical_trials_search ) + elif issubclass(tool_type, DESTINYPaperSearch): + tool = make_tool( + DESTINYPaperSearch( + settings=settings + ).destiny_search + ) else: raise NotImplementedError(f"Didn't handle tool type {tool_type}.") if tool.info.name == Complete.complete.__name__: diff --git a/src/paperqa/agents/tools.py b/src/paperqa/agents/tools.py index abc992f8f..ef94337f8 100644 --- a/src/paperqa/agents/tools.py +++ b/src/paperqa/agents/tools.py @@ -20,6 +20,7 @@ from paperqa.types import Context, DocDetails, PQASession from .search import get_directory_index +from ..sources.destiny_repo import add_destiny_references_to_docs logger = logging.getLogger(__name__) @@ -479,30 +480,30 @@ async def clinical_trials_search(self, query: str, state: EnvironmentState) -> s Query Syntax: Basic Search: Simple text automatically uses default EXPANSION[Relaxation] and COVERAGE[Contains] - >>> "heart attack" + "heart attack" Modified Search: Use operators to modify search behavior: - >>> 'EXPANSION[None]COVERAGE[FullMatch]"exact phrase"' - >>> 'EXPANSION[Concept]heart attack' + 'EXPANSION[None]COVERAGE[FullMatch]"exact phrase"' + 'EXPANSION[Concept]heart attack' Field Search: Specify fields using AREA operator: - >>> 'AREA[InterventionName]aspirin' - >>> 'AREA[Phase]PHASE3' + 'AREA[InterventionName]aspirin' + 'AREA[Phase]PHASE3' Location Search: Use SEARCH operator for compound location queries: - >>> 'cancer AND SEARCH[Location](AREA[LocationCity]Boston AND AREA[LocationState]Massachusetts)' + 'cancer AND SEARCH[Location](AREA[LocationCity]Boston AND AREA[LocationState]Massachusetts)' Complex Boolean: Combine terms with AND, OR, NOT and parentheses: - >>> '(cancer OR tumor) AND NOT (EXPANSION[None]pediatric OR AREA[StdAge]CHILD)' + '(cancer OR tumor) AND NOT (EXPANSION[None]pediatric OR AREA[StdAge]CHILD)' Date Ranges: Use RANGE to specify date ranges with formats like "yyyy-MM" or "yyyy-MM-dd". Note that MIN and MAX can be used for open-ended ranges: - >>> AREA[ResultsFirstPostDate]RANGE[2015-01-01, MAX] + AREA[ResultsFirstPostDate]RANGE[2015-01-01, MAX] Operators: EXPANSION[type]: Controls term expansion @@ -682,7 +683,67 @@ async def clinical_trials_search(self, query: str, state: EnvironmentState) -> s f" {offset + new_result_count} among {total_result_count} total" f" results. {state.status}" ) - return f"Error in clinical trial query syntax: {error_message}" + return f"Error in DESTINY Search API: {error_message}" + + +class DESTINYPaperSearch(NamedTool): + TOOL_FN_NAME = "destiny_search" + + CONCURRENCY_SAFE = True + + model_config = ConfigDict(extra="forbid") + + previous_searches: dict[str, int] = Field(default_factory=dict) + settings: Settings = Field(default_factory=Settings) + + async def destiny_search( + self, + query: str, + state: EnvironmentState + ) -> str: + """ + Search the DESTINY repository for climate and health papers to increase the paper count. + + Repeat previous calls with the same query to continue a search. + This tool can be called concurrently. + This tool introduces novel papers from DESTINY's curated repository, so invoke this tool when just beginning or when unsatisfied with the current evidence. + + Args: + query: A Lucene search query string starting with ?q=. Basic format is ?q=keyword1 AND keyword2. + Examples: + - ?q=climate change AND health + - ?q=adaptation OR mitigation + - ?q=title:"climate change"&start_year=2015&end_year=2020 + state: Current state. + + Returns: + String describing searched papers and the current status. + """ + try: + page = self.previous_searches[query] + except KeyError: + page = self.previous_searches[query] = 1 + + total_result_count, new_result_count, error_message = ( + await add_destiny_references_to_docs( + query, + state.docs, + self.settings, + page=page + ) + ) + + self.previous_searches[query] += 1 + + if error_message is None: + return ( + f"Search found a total of {total_result_count} reference papers on DESTINY's repository." + f" From page {page} of the search results, {new_result_count} were successfully added to the environment docs." + f" {state.status}" + ) + + return f"Searching the DESTINY repository failed: {error_message}" + AVAILABLE_TOOL_NAME_TO_CLASS: dict[str, type[NamedTool]] = { diff --git a/src/paperqa/configs/azure.json b/src/paperqa/configs/azure.json index 572189e78..ddc731d56 100644 --- a/src/paperqa/configs/azure.json +++ b/src/paperqa/configs/azure.json @@ -87,9 +87,6 @@ } ] }, - "agent_type": "ToolSelector", - "index": { - "paper_directory": "literature/" - } + "agent_type": "ToolSelector" } } diff --git a/src/paperqa/configs/search_only_destiny.json b/src/paperqa/configs/search_only_destiny.json new file mode 100644 index 000000000..8e32f1da7 --- /dev/null +++ b/src/paperqa/configs/search_only_destiny.json @@ -0,0 +1,103 @@ +{ + "llm": "gpt-4.1-mini", + "llm_config": { + "model_list": [ + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "base_model": "gpt-4o-mini", + "api_base": "https://eppireasoning.openai.azure.com/" + } + }, + { + "model_name": "text-embedding-3-small", + "litellm_params": { + "model": "azure/text-embedding-3-small", + "base_model": "text-embedding-3-small", + "api_base": "https://eppireasoning.openai.azure.com/" + } + } + ] + }, + "summary_llm": "gpt-4.1-mini", + "summary_llm_config": { + "model_list": [ + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "base_model": "gpt-4o-mini", + "api_base": "https://eppireasoning.openai.azure.com/" + } + }, + { + "model_name": "text-embedding-3-small", + "litellm_params": { + "model": "azure/text-embedding-3-small", + "base_model": "text-embedding-3-small", + "api_base": "https://eppireasoning.openai.azure.com/" + } + } + ] + }, + "embedding": "text-embedding-3-small", + "embedding_config": { + "model_list": [ + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "base_model": "gpt-4o-mini", + "api_base": "https://eppireasoning.openai.azure.com/" + } + }, + { + "model_name": "text-embedding-3-small", + "litellm_params": { + "model": "azure/text-embedding-3-small", + "base_model": "text-embedding-3-small", + "api_base": "https://eppireasoning.openai.azure.com/" + } + } + ] + }, + "parsing": { + "multimodal": false + }, + "agent": { + "agent_llm": "gpt-4.1-mini", + "agent_llm_config": { + "model_list": [ + { + "model_name": "gpt-4.1-mini", + "litellm_params": { + "model": "azure/gpt-4.1-mini", + "base_model": "gpt-4o-mini", + "api_base": "https://eppireasoning.openai.azure.com/" + } + }, + { + "model_name": "text-embedding-3-small", + "litellm_params": { + "model": "azure/text-embedding-3-small", + "base_model": "text-embedding-3-small", + "api_base": "https://eppireasoning.openai.azure.com/" + } + } + ] + }, + "tool_names": [ + "destiny_search", + "gather_evidence", + "gen_answer", + "complete" + ], + "agent_type": "ToolSelector" + }, + "answer": { + "evidence_k": 15, + "answer_max_sources": 5, + "max_concurrent_requests": 10 + } +} diff --git a/src/paperqa/contrib/__init__.py b/src/paperqa/contrib/__init__.py index 5552a5539..bf683f4bf 100644 --- a/src/paperqa/contrib/__init__.py +++ b/src/paperqa/contrib/__init__.py @@ -1,3 +1,5 @@ -from .zotero import ZoteroDB - -__all__ = ["ZoteroDB"] +try: + from .zotero import ZoteroDB + __all__ = ["ZoteroDB"] +except ImportError: + __all__ = [] diff --git a/src/paperqa/contrib/destiny_paper_helper.py b/src/paperqa/contrib/destiny_paper_helper.py new file mode 100644 index 000000000..a7faba2d0 --- /dev/null +++ b/src/paperqa/contrib/destiny_paper_helper.py @@ -0,0 +1,241 @@ +import json +import logging +from pathlib import Path +from typing import Any + +import anyio +import httpx +import httpx_aiohttp +from aviary.message import Message +from destiny_sdk.enhancements import EnhancementType +from destiny_sdk.identifiers import ExternalIdentifierType +from destiny_sdk.references import ReferenceSearchResult, Reference +from lmi import LiteLLMModel +from luqum.parser import parser, ParseSyntaxError +from msal import PublicClientApplication +from pydantic import Field, BaseModel, field_validator +from pydantic.v1 import ValidationError + +from paperqa import Settings, Docs +from paperqa.prompts import DESTINY_search_api_docs + +logger = logging.getLogger(__name__) + +class DESTINYSearchQuery(BaseModel): + query: str = Field(description="The search query to be passed to DESTINY's Search API") + +class FailedToGetRelevantPapersError(Exception): + pass + +class LuceneQuery(BaseModel): + """Validated Lucene query model.""" + + query: str = Field( + description="A valid Lucene query syntax string" + ) + + @field_validator('query') + @classmethod + def validate_lucene_syntax(cls, v: str) -> str: + """Validate that the query is valid Lucene syntax.""" + try: + parser.parse(v) + return v + except ParseSyntaxError as e: + raise ValueError(f"Invalid Lucene syntax: {e}") + +class DESTINYPaperHelper: + def __init__( + self, + settings: Settings, + api_url: str, + client_id: str, + authority: str, + login_hint: str, + scopes: list[str], + search_endpoint: str = "/v1/references/search/", + max_timeout: float = 15.0, + max_attempts: int = 5 + ): + self.settings = settings + Path(settings.paper_directory).mkdir(parents=True, exist_ok=True) + + self.api_url = api_url + self.search_endpoint = search_endpoint + + self.app = PublicClientApplication( + client_id=client_id, + authority=authority, + client_credential=None + ) + # TODO we might not need a token + self.token = self.app.acquire_token_interactive( + login_hint=login_hint, + scopes=scopes + ) + + self.access_token = self.token["access_token"] + + self.max_timeout = max_timeout + self.max_attempts = max_attempts + + self.llm_model = LiteLLMModel( + name=self.settings.llm, + config=self.settings.llm_config + ) + async def fetch_relevant_papers(self, question: str) -> dict[str, Reference]: + """Get relevant papers/references for a given question using an LLM.""" + relevant_references = await self._get_relevant_references(question) + await self.download_papers(relevant_references) + return {str(ref.id):ref for ref in relevant_references} + + async def download_papers(self, references: list[Reference]) -> None: + """Download PDFs of all relevant papers found from the DESTINY repository search.""" + downloaded_references = Path(self.settings.paper_directory).glob("*.pdf") + downloaded_ids = {ref.stem for ref in downloaded_references} + for ref in references: + if str(ref.id) not in downloaded_ids: + await self._download_pdf(ref) + + async def _download_pdf(self, reference: Reference) -> bool: + """Download a single PDF file""" + pdf_urls = self._parse_pdf_urls_from_reference(reference) + + async with httpx_aiohttp.HttpxAiohttpClient( + follow_redirects=True, + timeout=self.max_timeout + ) as client: + for url in pdf_urls: + try: + response = await client.get(url) + response.raise_for_status() + async with await anyio.open_file( + f"{self.settings.paper_directory}/{str(reference.id)}.pdf", "wb" + ) as f: + await f.write(response.content) + logger.info(f"Successfully downloaded {str(reference.id)}.pdf") + return True + except httpx.HTTPStatusError as e: + logger.warning( + f"Failed to download the PDF. Status code: {e.response.status_code}, text:" + f" {response.text}" + ) + except httpx.ReadTimeout as e: + logger.warning( + f"Failed to download the {str(reference.id)}.pdf. Timeout reached: {e}" + ) + return False + + def _parse_pdf_urls_from_reference(self, ref: Reference) -> list[str]: + pdf_urls = [] + for enhancement in ref.enhancements: + metadata = enhancement.content + if metadata.enhancement_type is EnhancementType.LOCATION: + # pdf urls are instances of HttpUrl so need to be cast to strings + pdf_urls += [ + str(location.pdf_url) for location in metadata.locations + if location.pdf_url is not None + ] + + return pdf_urls + + async def _get_relevant_references(self, question: str) -> list[Reference]: + """Perform a search using DESTINY's search API using an LLM generated search query.""" + search_query = await self._generate_lucene_search_query(question) + for _ in range(self.max_attempts): + try: + resp = httpx.get( + f"{self.api_url}{self.search_endpoint}?q={search_query.query}", + headers={"Authorization": f"Bearer {self.access_token}"}, + timeout=self.max_timeout + ) + resp.raise_for_status() + search_result = ReferenceSearchResult.model_validate(resp.json()) + references = search_result.references + if not references: + raise ValueError(f"No references found for {search_query.query}") + return references + except ValidationError as e: + print(f"Invalid response format: {e}") + raise e + except ValueError as e: + print(f"Value Error: {e}") + additional_context = f"The last search returned no references: {e}. Try creating a different search query to {search_query}." + search_query = await self._generate_lucene_search_query(question, additional_context) + except httpx.HTTPStatusError as e: + print(f"HTTP Status Error: {e}") + additional_context = f"The last search query produced: {e} with response: {resp.json()["detail"]}. Try creating a different search query to {search_query}." + search_query = await self._generate_lucene_search_query(question, additional_context) + raise FailedToGetRelevantPapersError( + f"Received HTTP status errors {self.max_attempts} times. Last search_query: {search_query}" + ) + + # TODO sometimes the model generates a query that passes our validation but fails on the API call + async def _generate_lucene_search_query(self, question: str, additional_context: str = "") -> LuceneQuery: + prompt = f"{additional_context}\n\n" + ( + "You are the helper model that aims to generate a search query in Lucene query syntax retrieve relevant papers" + " for the user's question from the DESTINY Repository." + "User's question:\n" + ) + f"{question}\n\n{DESTINY_search_api_docs}" + + response = await self.llm_model.call_single( + messages=[Message(role="user", content=prompt)], + output_type=LuceneQuery, + temperature=0.1 + ) + + # unsure if using LuceneQuery as output type runs the validation check + # so we explicitly run it here + # TODO we should check if this is redundant + lucene_query = LuceneQuery.model_validate(json.loads(str(response.text))) + + return lucene_query + + def _parse_metadata_from_reference(self, ref: Reference) -> dict[str, Any]: + metadata = {} + + for identifier in ref.identifiers: + if identifier.identifier_type is ExternalIdentifierType.DOI: + metadata["doi"] = identifier.identifier + + for enhancement in ref.enhancements: + content = enhancement.content + + match content.enhancement_type: + case EnhancementType.BIBLIOGRAPHIC: + metadata["authors"] = [author.display_name for author in content.authorship] + metadata["title"] = content.title + case EnhancementType.ABSTRACT: + metadata["abstract"] = content.abstract + + return metadata + + async def aadd_docs( + self, references: dict[str, Reference] | None = None, docs: Docs | None = None + ) -> Docs: + if docs is None: + docs = Docs() + for doc_path in Path(self.settings.paper_directory).rglob( # noqa: ASYNC240 + "*.pdf" + ): + ref = references.get(doc_path.stem) if references is not None else None + if ref: + metadata = self._parse_metadata_from_reference(ref) + # TODO find a way to use bibliographic data + try: + await docs.aadd( + doc_path, + settings=self.settings, + title=metadata.get("title", "Unknown"), + abstract=metadata.get("abstract", "Unknown"), + doi=metadata.get("doi", "Unknown"), + authors=metadata.get("authors", None) + ) + except ValueError as e: + logging.warning(f"Failed to aadd {doc_path} to Docs: {e}") + else: + await docs.aadd(doc_path, settings=self.settings) + return docs + + + + diff --git a/src/paperqa/prompts.py b/src/paperqa/prompts.py index 65335e35f..93ec3bf95 100644 --- a/src/paperqa/prompts.py +++ b/src/paperqa/prompts.py @@ -220,3 +220,201 @@ "\n\n{context_text}Describe the screenshot," # Allow for empty context_text " or if uncertain on a description please state why:" ) +DESTINY_search_api_docs = """ +### [API Query String Search](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#id7)[#](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#api-query-string-search "Link to this heading") + +The simplest API interface for searching references is the [query string search](https://destiny-repository-prod-app.politesea-556f2857.swedencentral.azurecontainerapps.io/redoc#tag/search/operation/search_references_v1_references_search__get) at /v1/references/search/. This endpoint requires [authentication](https://destiny-evidence.github.io/destiny-repository/procedures/oauth.html). + +#### [Parameters](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#id8)[#](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#parameters "Link to this heading") + +The only required parameter is the query string `q`. Additional optional parameters can be provided to filter, sort, and page through results. + +##### Query String (required)[#](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#query-string-required "Link to this heading") + +The `q` parameter is a query string in the [Lucene syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax). + +At it’s simplest, this can be a simple keyword search, which will search over `title` and `abstract`: + +# Get references with "climate change" anywhere in the title or abstract: +?q=climate change + +# Get references with both "climate change" and "health" anywhere in the title or abstract: +?q=climate change AND health + +Note + +Query parameters must be [URL-encoded](https://www.w3schools.com/tags/ref_urlencode.ASP). For example, spaces must be encoded as `%20` or `+`. Most HTTP client libraries will do this automatically. + +More complex queries can be constructed using the search syntax and the set of [searchable fields](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#search-fields). + +# Get references with "climate", "climatology" etc in the title and either "John Doe" or "Jane Smith" as an author: +?q=title:"climat*" AND authors:("John Doe" OR "Jane Smith") + +# Get references with "adaptation" or "mitigation" in the abstract that haven't yet been classified against the `Intervention` taxonomy: +?q=abstract:(adaptation OR mitigation) AND NOT evaluated_schemes:classification:taxonomy:Intervention + +# Get references with "climate change" in any order and a typoed "health": +?q="change climate"~2 AND helth~ + +##### Start Year and End Year[#](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#start-year-and-end-year "Link to this heading") + +The minimum and maximum publication years (inclusive) for references to return. + +# Get references published from 2015 onwards: +?q=...&start_year=2015 + +# Get references published up to and including 2020: +?q=...&end_year=2020 + +# Get references published from 2015 to 2020: +?q=...&start_year=2015&end_year=2020 + +##### Annotations[#](https://destiny-evidence.github.io/destiny-repository/procedures/search.html#annotations "Link to this heading") + +The `annotation` parameter can be used to filter results based on their annotations. + +These are provided in the format `[/