Duplicate Issue Detection Tool

Overview

This repository now includes a tool to help identify duplicate issues. With over 1200 open issues, finding duplicates manually is challenging. This tool automates the process using intelligent similarity detection.

Quick Start

Prerequisites

Python 3.7 or higher
requests library (will be installed automatically)
(Optional) GitHub Personal Access Token for higher API rate limits

Installation

cd tools
pip install -r requirements.txt

Basic Usage

# Run with default settings (analyzes all open issues)
python tools/find-duplicates.py

# Or use the helper script
cd tools && ./run.sh

Custom Analysis

# Analyze only recent 200 issues with a lower threshold
python tools/find-duplicates.py --max-issues 200 --threshold 0.6

# Use a GitHub token for higher rate limits
python tools/find-duplicates.py --token YOUR_TOKEN

# Save results to a custom location
python tools/find-duplicates.py --output results/my-analysis.json

How It Works

The tool analyzes issues using multiple similarity metrics:

Title Similarity (50% weight): Compares issue titles using sequence matching
Body Similarity (20% weight): Analyzes first 500 characters of issue descriptions
Label Similarity (15% weight): Compares issue labels (bug, feature request, etc.)
Keyword Similarity (15% weight): Detects WebView2-specific keywords

Smart Normalization

Removes URLs, code blocks, and version numbers
Extracts domain-specific keywords (crash, navigation, scaling, DPI, etc.)
Normalizes text for better comparison

Understanding Results

The tool generates two output files:

1. JSON File (machine-readable)

Contains detailed similarity scores and metadata for programmatic processing.

2. Text Report (human-readable)

Easy-to-read report with:

Duplicate groups sorted by number of duplicates
Similarity scores and breakdowns
Direct links to all issues
Labels and creation dates

Example Report Section

Group 1: 3 potential duplicates
--------------------------------------------------------------------------------
Primary Issue: #5247
Title: [Problem/Bug]: The UI of an application appears to be frozen when user changes system Scaling
URL: https://github.com/MicrosoftEdge/WebView2Feedback/issues/5247
Created: 2025-05-19T11:39:36Z
Labels: bug

Potential Duplicates:
  - #5248 (Similarity: 85.0%)
    Title: [Problem/Bug]: The UI of an application sporadically appears to be frozen...
    URL: https://github.com/MicrosoftEdge/WebView2Feedback/issues/5248
    Breakdown: Title=0.78, Body=0.65, Labels=1.00, Keywords=0.80

Interpreting Similarity Scores

90-100%: Almost certainly duplicates - investigate immediately
80-89%: Likely duplicates - high priority review
70-79%: Possibly duplicates - manual review recommended
60-69%: Might be related - check if issues describe same problem
Below 60%: Likely different issues (not shown by default)

Threshold Recommendations

Threshold	Use Case	Expected Results
0.8-0.9	High confidence only	Fewer results, minimal false positives
0.7 (default)	Balanced approach	Good mix of recall and precision
0.6-0.65	Aggressive search	More results, some false positives
0.5-0.55	Exploratory	Many results, requires careful review

Workflow for Closing Duplicates

Run the tool with appropriate threshold (start with 0.7)
Review the report starting with groups with most duplicates
Verify duplicates by reading the actual issues
Close duplicates by:
- Adding a comment linking to the original issue
- Adding "duplicate" label
- Closing the issue
Track progress to avoid re-analyzing closed issues

GitHub API Rate Limits

Without token: 60 requests/hour
With token: 5000 requests/hour

To create a token:

Go to GitHub Settings → Developer settings → Personal access tokens
Generate new token with public_repo scope
Use with --token flag or set GITHUB_TOKEN environment variable

Tips for Best Results

Start small: Test with --max-issues 100 first
Review manually: The tool suggests duplicates, but human judgment is essential
Check labels: Issues with identical labels are more likely to be true duplicates
Consider dates: Usually keep the older issue and close newer ones
Look for patterns: Multiple issues from same user might need different handling
Document decisions: Add comments explaining why issues were marked as duplicates

Common Duplicate Patterns

Based on the repository, common duplicate issues include:

Scaling/DPI issues: Multiple reports of UI freezing or incorrect sizing with DPI changes
Navigation failures: Various forms of navigation not working
Authentication issues: Different symptoms of SSO/auth problems
Performance issues: Memory leaks, crashes, freezes
Feature requests: Same feature requested multiple times

Advanced Usage

Analyzing Specific Issue Types

# Focus on bugs only (fetch manually filtered)
# Note: The tool fetches all open issues; pre-filtering requires manual work

# Analyze with very high threshold for obvious duplicates
python tools/find-duplicates.py --threshold 0.85

# Quick analysis of recent issues
python tools/find-duplicates.py --max-issues 300 --threshold 0.7

Integrating with CI/CD

The tool can be integrated into automated workflows:

# Generate report on schedule
python tools/find-duplicates.py --output reports/$(date +%Y-%m-%d)-duplicates.json

Troubleshooting

Rate Limit Errors

Solution: Use a GitHub token or wait for the rate limit to reset (1 hour)

No Duplicates Found

Solution: Lower the threshold (try 0.6 or 0.65)

Too Many False Positives

Solution: Raise the threshold (try 0.8) or focus on specific issue types

Script Errors

Solution: Ensure Python 3.7+ and requests library are installed

Contributing Improvements

To enhance the duplicate detection:

Add keywords: Edit extract_keywords() in find-duplicates.py
Adjust weights: Modify similarity weights in calculate_similarity()
Improve normalization: Enhance normalize_text() function
Add metrics: Implement additional similarity algorithms

Files Created

tools/
├── find-duplicates.py      # Main duplicate detection script
├── README.md               # Detailed tool documentation
├── requirements.txt        # Python dependencies
├── run.sh                  # Quick start script
├── .gitignore             # Ignore output files
└── duplicate-issues.json   # Output (generated)
└── duplicate-issues.txt    # Report (generated)

Support

For questions or issues with this tool:

Check the tool's README in tools/README.md
Review example output and threshold recommendations
Open an issue with the tools or meta label

License

This tool is part of the WebView2Feedback repository and follows the same license terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate Issue Detection Tool

Overview

Quick Start

Prerequisites

Installation

Basic Usage

Custom Analysis

How It Works

Smart Normalization

Understanding Results

1. JSON File (machine-readable)

2. Text Report (human-readable)

Example Report Section

Interpreting Similarity Scores

Threshold Recommendations

Workflow for Closing Duplicates

GitHub API Rate Limits

Tips for Best Results

Common Duplicate Patterns

Advanced Usage

Analyzing Specific Issue Types

Integrating with CI/CD

Troubleshooting

Rate Limit Errors

No Duplicates Found

Too Many False Positives

Script Errors

Contributing Improvements

Files Created

Support

License

FilesExpand file tree

DUPLICATE_DETECTION.md

Latest commit

History

DUPLICATE_DETECTION.md

File metadata and controls

Duplicate Issue Detection Tool

Overview

Quick Start

Prerequisites

Installation

Basic Usage

Custom Analysis

How It Works

Smart Normalization

Understanding Results

1. JSON File (machine-readable)

2. Text Report (human-readable)

Example Report Section

Interpreting Similarity Scores

Threshold Recommendations

Workflow for Closing Duplicates

GitHub API Rate Limits

Tips for Best Results

Common Duplicate Patterns

Advanced Usage

Analyzing Specific Issue Types

Integrating with CI/CD

Troubleshooting

Rate Limit Errors

No Duplicates Found

Too Many False Positives

Script Errors

Contributing Improvements

Files Created

Support

License