Add parallel reduction supports for RowIterator and NamedTupleIterator#187
Add parallel reduction supports for RowIterator and NamedTupleIterator#187tkf wants to merge 3 commits intoJuliaData:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #187 +/- ##
==========================================
+ Coverage 96.71% 96.80% +0.09%
==========================================
Files 6 6
Lines 456 469 +13
==========================================
+ Hits 441 454 +13
Misses 15 15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Can you share a bit more on the motivation here? Obviously, we want to be cautious/wary of taking on new dependencies when it will also affect so many downstream packages. Questions that pop up in my mind:
|
|
Hi, thanks for the response and sorry for this late reply.
Yes, I understand this and I should've clarified what SplittablesBase.jl is. Essentially I am hoping I probably should open an RFC in JuliaLang/julia but I've been a bit hesitant to do so since I don't feel like this interface is tested outside my packages. I thought of this PR as a step toward accumulating such experience.
The example in the OP with FLoops.jl is one thing. We'd be able to use ThreadsX with this. I think ThreadsX.jl + OnlineStats.jl integration is appealing to the Tables.jl users. Underneath, they all boils down to
If it is already an array, the generic fallback in SplittablesBase.jl covers it already. So, I don't need to add a specific implementation for
My intention is making it very minimal although I have to put the implementation for I think it's almost 1.0-ready but there is one specification of an optional API If you want to postpone merging this at least until SplittablesBase.jl hits 1.0, I think that's a very reasonable decision. I can extract out this PR to a separate package SplittableTables.jl for this to work (by touching the internals of Tables.jl a bit). But it'd be nice if we can tweak |
|
This is a really useful PR, but I see @quinnj's point about introducing an extra dependency on all of the downstream dependents of Tables. Maybe it would make more sense to have a separate package |
I'm not sure this is true, given
I think SplittablesBase.jl is fine, given it's a very small dependency. cc @quinnj and @MasonProtter -- being able to use JuliaFolds with Tables and DataFrames would be awesome. |
This PR implements SplittablesBase.jl interface
halveonRowIteratorandNamedTupleIterator. This let us use parallel reductions built on top of SplittablesBase.jl such as Transducers.jl, ThreadsX.jl, and FLoops.jl:A tricky part of this PR is that, since
SplittablesTesting.test_orderedusesisequalto compare items (rows), I needed to relaxisequalto ignore the storage type of columns. The difference is thatis
falsebefore this PR andtrueafter this PR. I think it makes sense thatColumnsRowto be compared as if they are lowered toNamedTuples. This is also compatible with that the equalities on arrays ignore the typeWhat do you think?