Skip to content

Add end to end tests for the SitemapInjector#52

Draft
lfoppiano wants to merge 2 commits intoccfrom
feature/test-sitemap-hreflang
Draft

Add end to end tests for the SitemapInjector#52
lfoppiano wants to merge 2 commits intoccfrom
feature/test-sitemap-hreflang

Conversation

@lfoppiano
Copy link
Copy Markdown

@lfoppiano lfoppiano commented Apr 16, 2026

@sebastian-nagel I've manage, not without any problems, to get an end to end test running around the SitemapInjector. However is a Draft PR and the tests are still failing, see my question below.

It should simulate the process via protocol-file and make Nutch in the condition of running the sitemapInjector on any sitemap, load them into CrawlDB and verify.

I've created two tests, one for the KPMG and one for your sitemap.xml example.
I'm having problems to understand how to assert the number of URLs, e.g. what is counted by hadoop:

2026-04-16 22:01:02,576 INFO o.a.n.c.SitemapInjector [Thread-43] Found 1474 URLs in file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.1.xml
2026-04-16 22:01:02,588 INFO o.a.n.c.SitemapInjector [Thread-43] Injected total 4630 URLs for file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.1.xml
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     3156  sitemap_extension_localized_link
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-16 22:01:02,812 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     4630  urls_from_sitemaps_injected
2026-04-16 22:01:04,665 INFO o.a.n.c.SitemapInjector [Thread-160] Found 866 URLs in file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.2.xml
2026-04-16 22:01:04,668 INFO o.a.n.c.SitemapInjector [Thread-160] Injected total 2598 URLs for file:/Users/lfoppiano/development/projects/cc/nutch/src/testresources/sitemaps/sitemap.example.2.xml
2026-04-16 22:01:05,357 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     1732  sitemap_extension_localized_link
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemap_type_xml
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:        1  sitemaps_processed
2026-04-16 22:01:05,358 INFO o.a.n.c.SitemapInjector [main] SitemapInjector:     2598  urls_from_sitemaps_injected

and what is cumulated in the crawl db, when I check in

[...]
        SitemapInjector sitemapInjector = new SitemapInjector();
        sitemapInjector.setConf(conf);
        sitemapInjector.inject(crawldbPath, urlPath);

        List<String> injected = readCrawldb();

I did not find any way to count manually the expected URLs and compare them with injected.size(), but I'm surely missing something here..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants