This repository now includes a tool to help identify duplicate issues. With over 1200 open issues, finding duplicates manually is challenging. This tool automates the process using intelligent similarity detection.
- Python 3.7 or higher
requestslibrary (will be installed automatically)- (Optional) GitHub Personal Access Token for higher API rate limits
cd tools
pip install -r requirements.txt# Run with default settings (analyzes all open issues)
python tools/find-duplicates.py
# Or use the helper script
cd tools && ./run.sh# Analyze only recent 200 issues with a lower threshold
python tools/find-duplicates.py --max-issues 200 --threshold 0.6
# Use a GitHub token for higher rate limits
python tools/find-duplicates.py --token YOUR_TOKEN
# Save results to a custom location
python tools/find-duplicates.py --output results/my-analysis.jsonThe tool analyzes issues using multiple similarity metrics:
- Title Similarity (50% weight): Compares issue titles using sequence matching
- Body Similarity (20% weight): Analyzes first 500 characters of issue descriptions
- Label Similarity (15% weight): Compares issue labels (bug, feature request, etc.)
- Keyword Similarity (15% weight): Detects WebView2-specific keywords
- Removes URLs, code blocks, and version numbers
- Extracts domain-specific keywords (crash, navigation, scaling, DPI, etc.)
- Normalizes text for better comparison
The tool generates two output files:
Contains detailed similarity scores and metadata for programmatic processing.
Easy-to-read report with:
- Duplicate groups sorted by number of duplicates
- Similarity scores and breakdowns
- Direct links to all issues
- Labels and creation dates
Group 1: 3 potential duplicates
--------------------------------------------------------------------------------
Primary Issue: #5247
Title: [Problem/Bug]: The UI of an application appears to be frozen when user changes system Scaling
URL: https://github.com/MicrosoftEdge/WebView2Feedback/issues/5247
Created: 2025-05-19T11:39:36Z
Labels: bug
Potential Duplicates:
- #5248 (Similarity: 85.0%)
Title: [Problem/Bug]: The UI of an application sporadically appears to be frozen...
URL: https://github.com/MicrosoftEdge/WebView2Feedback/issues/5248
Breakdown: Title=0.78, Body=0.65, Labels=1.00, Keywords=0.80
- 90-100%: Almost certainly duplicates - investigate immediately
- 80-89%: Likely duplicates - high priority review
- 70-79%: Possibly duplicates - manual review recommended
- 60-69%: Might be related - check if issues describe same problem
- Below 60%: Likely different issues (not shown by default)
| Threshold | Use Case | Expected Results |
|---|---|---|
| 0.8-0.9 | High confidence only | Fewer results, minimal false positives |
| 0.7 (default) | Balanced approach | Good mix of recall and precision |
| 0.6-0.65 | Aggressive search | More results, some false positives |
| 0.5-0.55 | Exploratory | Many results, requires careful review |
- Run the tool with appropriate threshold (start with 0.7)
- Review the report starting with groups with most duplicates
- Verify duplicates by reading the actual issues
- Close duplicates by:
- Adding a comment linking to the original issue
- Adding "duplicate" label
- Closing the issue
- Track progress to avoid re-analyzing closed issues
- Without token: 60 requests/hour
- With token: 5000 requests/hour
To create a token:
- Go to GitHub Settings → Developer settings → Personal access tokens
- Generate new token with
public_reposcope - Use with
--tokenflag or setGITHUB_TOKENenvironment variable
- Start small: Test with
--max-issues 100first - Review manually: The tool suggests duplicates, but human judgment is essential
- Check labels: Issues with identical labels are more likely to be true duplicates
- Consider dates: Usually keep the older issue and close newer ones
- Look for patterns: Multiple issues from same user might need different handling
- Document decisions: Add comments explaining why issues were marked as duplicates
Based on the repository, common duplicate issues include:
- Scaling/DPI issues: Multiple reports of UI freezing or incorrect sizing with DPI changes
- Navigation failures: Various forms of navigation not working
- Authentication issues: Different symptoms of SSO/auth problems
- Performance issues: Memory leaks, crashes, freezes
- Feature requests: Same feature requested multiple times
# Focus on bugs only (fetch manually filtered)
# Note: The tool fetches all open issues; pre-filtering requires manual work
# Analyze with very high threshold for obvious duplicates
python tools/find-duplicates.py --threshold 0.85
# Quick analysis of recent issues
python tools/find-duplicates.py --max-issues 300 --threshold 0.7The tool can be integrated into automated workflows:
# Generate report on schedule
python tools/find-duplicates.py --output reports/$(date +%Y-%m-%d)-duplicates.jsonSolution: Use a GitHub token or wait for the rate limit to reset (1 hour)
Solution: Lower the threshold (try 0.6 or 0.65)
Solution: Raise the threshold (try 0.8) or focus on specific issue types
Solution: Ensure Python 3.7+ and requests library are installed
To enhance the duplicate detection:
- Add keywords: Edit
extract_keywords()infind-duplicates.py - Adjust weights: Modify similarity weights in
calculate_similarity() - Improve normalization: Enhance
normalize_text()function - Add metrics: Implement additional similarity algorithms
tools/
├── find-duplicates.py # Main duplicate detection script
├── README.md # Detailed tool documentation
├── requirements.txt # Python dependencies
├── run.sh # Quick start script
├── .gitignore # Ignore output files
└── duplicate-issues.json # Output (generated)
└── duplicate-issues.txt # Report (generated)
For questions or issues with this tool:
- Check the tool's README in
tools/README.md - Review example output and threshold recommendations
- Open an issue with the
toolsormetalabel
This tool is part of the WebView2Feedback repository and follows the same license terms.