WikidataSearch is the API and web app for semantic retrieval over the Wikidata Vector Database from the Wikidata Embedding Project.
This repository powers the public service. The intended usage is the hosted API, not running your own deployment.
Hosted Web App: https://wd-vectordb.wmcloud.org/
Hosted API Docs (OpenAPI): https://wd-vectordb.wmcloud.org/docs
Project Page: https://www.wikidata.org/wiki/Wikidata:Vector_Database
Base URL:
https://wd-vectordb.wmcloud.org
Use a descriptive User-Agent for query endpoints. Generic user agents are rejected.
Example header:
User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)
Current operational constraints:
- Rate limit is applied per
User-Agent(default:30/minute). - Query endpoints require a descriptive
User-Agentheader. - Current vector shards are
en,fr,ar, andde.
Semantic + keyword search for Wikidata items (QIDs), fused with Reciprocal Rank Fusion (RRF).
Parameters:
query(required): natural-language query or ID.lang(default:all): vector shard language; unknown languages are translated then searched globally.K(default/max:50): number of top results requested.instanceof(optional): comma-separated QIDs used asP31filter.rerank(default:false): apply reranker on textified Wikidata content.return_vectors(default:false): include vectors in response payload.
Example:
curl -sG 'https://wd-vectordb.wmcloud.org/item/query/' \
--data-urlencode 'query=Douglas Adams' \
--data-urlencode 'lang=en' \
--data-urlencode 'K=10' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'Semantic + keyword search for Wikidata properties (PIDs), fused with RRF.
Parameters:
query(required)lang(default:all)K(default/max:50)instanceof(optional): comma-separated QIDs used asP31filter.exclude_external_ids(default:false): excludes properties with datatypeexternal-id.rerank(default:false)return_vectors(default:false): include vectors in response payload.
Example:
curl -sG 'https://wd-vectordb.wmcloud.org/property/query/' \
--data-urlencode 'query=instance of' \
--data-urlencode 'lang=en' \
--data-urlencode 'exclude_external_ids=true' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'Similarity scoring for a fixed list of Wikidata IDs (QIDs and/or PIDs) against one query.
Parameters:
query(required)qid(required): comma-separated IDs, for exampleQ42,Q5,P31(maximum:100IDs).lang(default:all)return_vectors(default:false): include vectors in response payload.
Example:
curl -sG 'https://wd-vectordb.wmcloud.org/similarity-score/' \
--data-urlencode 'query=science fiction writer' \
--data-urlencode 'qid=Q42,Q25169,P31' \
-H 'User-Agent: WikidataSearch-Client/1.0 (your-email@example.org)'/item/query/ returns objects with:
QIDsimilarity_scorerrf_scoresource(Vector Search,Keyword Search, or both)reranker_score(whenrerank=true)vector(whenreturn_vectors=true)
/property/query/ returns the same shape with PID instead of QID.
/similarity-score/ returns:
QIDorPIDsimilarity_scorevector(whenreturn_vectors=true)
High-level request flow:
- FastAPI route receives the query, enforces user-agent policy, and rate limit.
HybridSearchorchestrates retrieval:- Vector path: embeds query with Jina embeddings and searches Astra DB vector collections across language shards in parallel.
- Keyword path: runs Wikidata keyword search against
wikidata.org.
- Results are fused with Reciprocal Rank Fusion (RRF), preserving source attribution.
- Optional reranking fetches Wikidata text representations and reorders top hits with Jina reranker.
- JSON response is returned and request metadata is logged for analytics.
Main components in this repo:
- API app and routing:
wikidatasearch/main.py,wikidatasearch/routes/ - Retrieval orchestration:
wikidatasearch/services/search/HybridSearch.py - Vector retrieval backend:
wikidatasearch/services/search/VectorSearch.py - Keyword retrieval backend:
wikidatasearch/services/search/KeywordSearch.py - Embeddings/reranking client:
wikidatasearch/services/jina.py
See LICENSE.