feat: Excel search improvements (multi-keyword + row context)#59
feat: Excel search improvements (multi-keyword + row context)#59k-ibaraki wants to merge 7 commits into
Conversation
- Add include_surrounding_cells parameter to get row context in search - Reduces API calls from N+1 to 1 when row data needed - Default False for backward compatibility - Add comma-separated multi-keyword OR search - Example: query='budget,forecast' - Backward compatible with single keyword - Update tool description for LLM usability - Add comprehensive test coverage - Fix dictionary iteration issue in _scan_sheet Resolves #55, #56 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Summary of ChangesHello @k-ibaraki, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Excel search capabilities by introducing multi-keyword OR search functionality and an option to retrieve full row context for matching cells. These improvements aim to make the Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull request overview
Excel検索の使い勝手と取得効率を同時に改善していて、運用上の負荷(追加API呼び出し)を下げる方向性がとても良いです。既存の検索モードを保ちながら、複数キーワード対応と行コンテキスト取得をオプトインで追加する内容になっています。
Changes:
- Excel検索にカンマ区切りの複数キーワード OR 検索を追加
- 検索結果に同一行のセル群を含める
include_surrounding_cellsを追加 - ツール説明・README/usage(日英)とテストを更新
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/sharepoint_excel.py |
検索API拡張(複数キーワード/行データ付与)を実装 |
src/server.py |
MCPツール関数の引数・呼び出し・説明文を拡張 |
tests/test_sharepoint_excel.py |
新機能(行コンテキスト/複数キーワード)をユニットテストで追加検証 |
tests/test_server.py |
サーバ層からの呼び出し引数(新パラメータ)をテストに反映 |
docs/usage.md |
新しい検索オプションの利用方法を追記(英語) |
docs/usage_ja.md |
新しい検索オプションの利用方法を追記(日本語) |
README.md |
検索モードの機能説明を更新(英語) |
README_ja.md |
検索モードの機能説明を更新(日本語) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # 1つ目のマッチ(D2: "在庫あり") | ||
| match1 = result["matches"][0] | ||
| assert match1["coordinate"] == "D2" | ||
| assert match1["value"] == "在庫あり" | ||
| assert "row_data" in match1 | ||
| assert len(match1["row_data"]) == 4 | ||
| assert match1["row_data"][1]["value"] == "商品A" | ||
|
|
||
| # 2つ目のマッチ(D4: "在庫わずか") | ||
| match2 = result["matches"][1] | ||
| assert match2["coordinate"] == "D4" | ||
| assert match2["value"] == "在庫わずか" | ||
| assert "row_data" in match2 | ||
| assert len(match2["row_data"]) == 4 | ||
| assert match2["row_data"][1]["value"] == "商品C" | ||
|
|
There was a problem hiding this comment.
このテストは result["matches"][0] / [1] の順序が D2→D4 になることを前提にしていますが、検索結果の順序は実装詳細(openpyxl の内部辞書順や走査順)に依存して変わり得ます。順序を保証しない API であれば、coordinate で該当 match を取り出して検証する(または座標でソートしてから比較する)形にするとテストが安定します。
| # 1つ目のマッチ(D2: "在庫あり") | |
| match1 = result["matches"][0] | |
| assert match1["coordinate"] == "D2" | |
| assert match1["value"] == "在庫あり" | |
| assert "row_data" in match1 | |
| assert len(match1["row_data"]) == 4 | |
| assert match1["row_data"][1]["value"] == "商品A" | |
| # 2つ目のマッチ(D4: "在庫わずか") | |
| match2 = result["matches"][1] | |
| assert match2["coordinate"] == "D4" | |
| assert match2["value"] == "在庫わずか" | |
| assert "row_data" in match2 | |
| assert len(match2["row_data"]) == 4 | |
| assert match2["row_data"][1]["value"] == "商品C" | |
| # coordinate をキーにしてマッチ結果を引けるようにしておく | |
| matches_by_coord = {m["coordinate"]: m for m in result["matches"]} | |
| # D2 のマッチ("在庫あり") | |
| match_d2 = matches_by_coord["D2"] | |
| assert match_d2["value"] == "在庫あり" | |
| assert "row_data" in match_d2 | |
| assert len(match_d2["row_data"]) == 4 | |
| assert match_d2["row_data"][1]["value"] == "商品A" | |
| # D4 のマッチ("在庫わずか") | |
| match_d4 = matches_by_coord["D4"] | |
| assert match_d4["value"] == "在庫わずか" | |
| assert "row_data" in match_d4 | |
| assert len(match_d4["row_data"]) == 4 | |
| assert match_d4["row_data"][1]["value"] == "商品C" |
| if include_surrounding_cells: | ||
| row_cells = sheet[cell.row] | ||
| match_entry["row_data"] = [ | ||
| { | ||
| "value": self._serialize_value(c.value), | ||
| "coordinate": c.coordinate, | ||
| } | ||
| for c in row_cells | ||
| ] |
There was a problem hiding this comment.
include_surrounding_cells=True のときに row_cells = sheet[cell.row] で行全体を取得していますが、openpyxl ではこのアクセスが max_column までのセルを実体化(空セルも Cell 化)してメモリ/CPU を急増させることがあります。大きいシートや書式だけ付いた広い範囲だと、検索1件でも巨大な row_data を作ってしまうので、(1) 既存セルのみから row_data を組み立てる/必要な列範囲だけに限定する、(2) 既存の config.excel_max_data_cols 相当の上限で打ち切る、などの防御を入れるのが安全です。
| include_surrounding_cells: Trueの場合、マッチしたセルと同じ行の全セルを含める | ||
| """ | ||
| # カンマ区切りで複数キーワードを解析 | ||
| keywords = [kw.strip() for kw in query.split(",") if kw.strip()] |
There was a problem hiding this comment.
keywords = [kw.strip() for kw in query.split(",") if kw.strip()] の結果が空リストになり得ます(例: query="," や " , ")。この場合は any(...) が常に False になり、入力ミスでも 0 件ヒットとして静かに成功してしまいます。少なくとも keywords が空なら ValueError を投げる、または search_cells の戻り JSON に warnings を入れる等で、無効なクエリを明示した方が API として分かりやすいです。
| keywords = [kw.strip() for kw in query.split(",") if kw.strip()] | |
| keywords = [kw.strip() for kw in query.split(",") if kw.strip()] | |
| # キーワードが 1 つも抽出できない場合は、入力ミスなどの無効なクエリとみなす | |
| if not keywords: | |
| raise ValueError( | |
| "無効な検索クエリです。カンマ以外の有効なキーワードを 1 つ以上指定してください。" | |
| ) |
There was a problem hiding this comment.
Code Review
This pull request introduces valuable improvements to the Excel search functionality, including multi-keyword OR search and the ability to fetch row context for matches. However, the implementation of the include_surrounding_cells feature creates a significant Denial of Service (DoS) vulnerability due to a lack of resource limits and deduplication logic, potentially leading to excessive memory consumption and data duplication. The search_cells method also lacks crucial data size limits. Additionally, there is an opportunity to refactor duplicated code to improve maintainability.
| if include_surrounding_cells: | ||
| row_cells = sheet[cell.row] | ||
| match_entry["row_data"] = [ | ||
| { | ||
| "value": self._serialize_value(c.value), | ||
| "coordinate": c.coordinate, | ||
| } | ||
| for c in row_cells | ||
| ] | ||
|
|
||
| matches.append(match_entry) |
There was a problem hiding this comment.
The implementation of include_surrounding_cells can lead to significant memory exhaustion. For every matching cell, the entire row's data is fetched and added to the response. If multiple cells in the same row match the query, the entire row is duplicated in the response for each match. In a wide sheet (e.g., 2000 columns), a few hundred matches can result in a massive JSON response that could crash the server or the client.
Remediation:
- Deduplicate row data so that each row is only included once per search result, even if multiple cells in that row match.
- Implement a hard limit on the total number of matches or cells returned in a single search request.
- Apply the
excel_max_data_colslimit to the row context fetching.
| if include_surrounding_cells: | ||
| row_cells = sheet[cell.row] | ||
| match_entry["row_data"] = [ | ||
| { | ||
| "value": self._serialize_value(c.value), | ||
| "coordinate": c.coordinate, | ||
| } | ||
| for c in row_cells | ||
| ] | ||
|
|
||
| matches.append(match_entry) |
There was a problem hiding this comment.
| for cell in sheet._cells.values(): | ||
| # Note: リストに変換してから反復処理することで、 | ||
| # row_data取得時のシート内部の辞書変更によるエラーを回避 | ||
| for cell in list(sheet._cells.values()): |
There was a problem hiding this comment.
The search_cells method lacks crucial data size limits (excel_max_data_rows, excel_max_data_cols) that are enforced in parse_to_json, creating a Denial of Service (DoS) vulnerability with large workbooks. It is recommended to apply these limits to prevent excessive resource consumption. Additionally, the logic for creating a match entry is duplicated across if hasattr(sheet, "_cells") and its else block, which could be refactored into a helper function to improve maintainability.
Based on real-world usage feedback, OR search was causing poor results
(23 hits → 7 attempts → 26,600 tokens). AND search provides better
precision for narrowing down results.
Changes:
- Replace comma-separated OR search with space-separated AND search
- Update keyword parsing: split(",") → split()
- Change match logic: any() → all()
- Update all tests and documentation
Example:
- Before: query="予算,利益" (OR - finds cells with either keyword)
- After: query="予算 報告" (AND - finds cells with both keywords)
Use case: "簾舞 連絡先" finds cells containing both keywords in one call,
instead of getting too many irrelevant results with OR search.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Update all documentation to reflect the change from OR to AND search: - README.md / README_ja.md: Update feature description - docs/usage.md / docs/usage_ja.md: Update examples and best practices Examples changed: - "budget,forecast" (OR) → "budget report" (AND) - "予算,見積" (OR) → "予算 報告" (AND) Best practice guidance added: - Start with single keyword - Add more keywords to narrow down if needed Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
フィードバックを受けて修正実際のエージェント使用で、OR検索が使いづらいというフィードバックを受けて、AND検索に変更しました。 問題点
修正内容OR検索(削除)→ AND検索(実装) # 変更前(OR検索)
query="予算,利益" # いずれかを含む → 結果多すぎ
# 変更後(AND検索)
query="簾舞 連絡先" # 両方を含む → 1回で絞り込み実装変更
ベストプラクティス
✅ 全テスト通過 |
- Add empty query validation (ValueError on empty/whitespace-only) - Add error handling for include_surrounding_cells row_data retrieval - Add match count limits (1000 max, 500 warning threshold) - Add warnings array to search response for actionable feedback - Add comprehensive test coverage for edge cases Improves production robustness based on agent feedback Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Document empty query restriction - Document 1000 match limit and 500 warning threshold - Add error handling documentation (row_data_error field) - Add warnings array documentation - Update workflow recommendations for large result sets Helps users write effective search queries Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add query guidelines (empty query, specific keywords) - Document performance considerations (1000 limit, 500 warning) - Add error handling examples - Add good/bad query examples - Add warning handling examples Helps users write effective search queries Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
むしろ精度が下がったので破棄 |
Summary
include_surrounding_cellsparameter for getting row context in one API callChanges
include_surrounding_cells: bool = Falseparameter to searchTest Plan
include_surrounding_cellsfeaturePerformance Impact
Closes #55
Closes #56
🤖 Generated with Claude Code