Submit delete batches in parallel#3123
Conversation
5094aad to
25c3c5a
Compare
ArcticDB Code Review SummaryBuild and Dependencies
API and Compatibility
Documentation
|
25c3c5a to
9852aeb
Compare
9852aeb to
0614e82
Compare
c7ded90 to
31f99ea
Compare
|
|
||
| template<std::ranges::range R> | ||
| requires std::same_as<std::remove_cvref_t<std::ranges::range_value_t<R>>, VariantKey> | ||
| std::vector<KeyType> key_types(R&& keys) { |
There was a problem hiding this comment.
Feels like the return type of this should be an unordered set?
| for (auto& k : variant_keys) { | ||
| auto collection = collection_name(variant_key_type(k)); | ||
| try { | ||
| auto result = client_->remove_keyvalue(db_, collection, k); |
There was a problem hiding this comment.
Does the Mongo client not have a method to delete multiple documents (at least in the same collection) at once?
| public: | ||
| GCPXMLStorage(const LibraryPath& lib, OpenMode mode, const GCPXMLSettings& conf); | ||
|
|
||
| std::optional<size_t> max_delete_batch_size() const override { return std::nullopt; } |
There was a problem hiding this comment.
Google has no limit!?
| for (const auto& k : ks) { | ||
| auto key_type_dir = key_type_folder(root_folder, variant_key_type(k)); | ||
| to_delete.emplace_back(object_path(bucketizer.bucketize(key_type_dir, k), k)); | ||
| } |
There was a problem hiding this comment.
Could use std::ranges::transform
| auto distinct_key_types = key_types(ks); | ||
| stat_timers.reserve(distinct_key_types.size()); | ||
| for (auto kt : distinct_key_types) { | ||
| stat_timers.emplace_back(query_stats::add_task_count_and_time(query_stats::TaskType::S3_DeleteObjects, kt)); |
There was a problem hiding this comment.
I guess the query stats will now look odd for a batch-delete of mixed key types, but I'm not sure users can even trigger this in practice so not worth worrying about
|
|
||
| static std::vector<std::vector<entity::VariantKey>> chunk_keys( | ||
| std::vector<entity::VariantKey>&& keys, size_t batch_size | ||
| ) { |
There was a problem hiding this comment.
This is a pattern we have dotted all over the place, could template the element type and stick it in the utils folder somewhere
| ); | ||
| return folly::collect(std::move(futs)).via(&async::io_executor()).thenValue([](auto&&) { | ||
| return std::vector<RemoveKeyResultType>{}; | ||
| }); |
There was a problem hiding this comment.
https://github.com/man-group/ArcticDB/blob/master/cpp/arcticdb/stream/stream_sink.hpp#L49-L54
Don't have to, but if you want to rip out this pointless abstraction while you're here I won't object
| keys = {}; | ||
| keys.reserve(flush_threshold); | ||
| const auto batch_size = batch.size(); | ||
| store->remove_keys(std::move(batch)).get(); |
There was a problem hiding this comment.
There is a slightly more elegant model that would remove the need for flush_threshold . Instead of calling get immediately, we maintain a set of remove_keys futures we've submitted. Folly let's you check if a future is done without calling get() using isReady(), so the size of this collection could be kept bounded. It's quite a bit more complicated though, and probably doesn't add much in this case.
At least worth a comment about why remove_keys is better than remove_keys_sync in this case though.
| with ( | ||
| config_context("S3Storage.DeleteBatchSize", batch_size), | ||
| config_context("AzureStorage.DeleteBatchSize", batch_size), | ||
| ): |
There was a problem hiding this comment.
There's a config_context_multi for setting multiple options at once
|
|
||
|
|
||
| @pytest.mark.storage | ||
| def test_bulk_delete_batching(object_and_mem_and_lmdb_version_store, sym): |
There was a problem hiding this comment.
This is quite an indirect test of the batching. I get that we don't have query stats for Azure right now, but the test below is much more explicit
Discovered while working on delayed deletes. I created a library on PURE with 1000 symbols, with ~200 versions each and 10 snapshots.
Then running delayed deletes had these results:
Notably running delayed deletes in "dry run" mode (which skips the physical deletion) showed no performance difference between arcticc and ArcticDB, for example on a small example:
arcticc submitted
RemoveBatchTaskin chunks of 1000 https://mangit.maninvestments.com/projects/DATA/repos/arcticc/browse/arcticcxx_impl/async/async_store.hpp#238 but #70 changed it to this implementation.If we delete 10k keys, the old implementation in arcticc would submit 10 tasks to delete 1000 keys in parallel, whereas ArcticDB currently submits 10 HTTP requests to delete 1000 keys in serial.
This explains most of the performance difference between arcticc delayed deletes and enterprise delayed deletes.
To stop storage specific deletion batch size limits leaking out of the storage layer, this PR has each storage declare its maximum deletion batch size, which is then used by the
AsyncStorewhen it calculates the batches.