You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unicode Tools does a lot of data processing, and thus it experiences many similar concerns of other data processing projects as their requirements grow.
Common problems include complexity and performance, although they manifest in different concrete ways that are also intertwined. For example:
performance - as the size of datasets grow, the time it takes to compute grows, so we try to solve the problem other ways like concurrency (ex: make tests faster #1365)
caching - caching is usually introduced to speed up performance, but is notorious for adding complexity due to statefulness (ex: stale state)
global mutable state - controlled global state is not necessarily a problem (ex: databases), but when it's mutable, it limits our ability to attempt concurrency (ex: mvn test serially again #1376). Not allowing concurrency will prevent solving performance problems.
multiple versions of data - multiple related datasets / versions can cause people, when writing data processing code, to choose between performance vs. simpler separate of code. Do you update mutable data in place, or do you create an entirely separate copy to represent a different version or transformation output?
Possible Solution(s)
The one technique out of the relevant few that goes a long way to addressing these issues is to use persistent data structures. Persistent data structures are:
immutable
create a reference to the result of any transformation, without affecting the reference to the input
use clever structural sharing to store multiple "snapshots" of transformed data efficiently size-wise in memory
(modern implementations) optimize the design and implementation details to make basic operations amortized O(1) runtime (ex: get, put, at, append)
Persistent data structures can go a long way to eliminating or mitigating the previous coding pitfalls we experience, in terms of complexity and performance by virtue of the fact that they're immutable. Further efficiency concerns are handled through the optimizations of modern implementations.
There are other techniques that can be combined to make code more concurrency-ready and easier to reason about, but persistent data structures are basically a starting point and are already are sufficiently effective at addressing the original concerns.
Possible Implementations for persistent data structures
Given that we have an established codebase, we aren't trying to rewrite anything, let alone in wholesale. These ideas might be useful to consider in smaller self-contained use cases, if needed in the future. Given that we have a Java codebase, and Java is only slowly inching towards being a functional programming-ish language, there are only a few third-party libraries that already implement persistent data structures in a Java-native way:
Paguro - perhaps the simplest & most practical library
Bifurcan - the most rigorous library, but perhaps slightly too pedantic to be the easiest option
Pcollections - feels like a Potemkin village solution -- the API and behavior are technically like persistent data structures, but the implementation just makes whole copies rather than use structural sharing, let alone optimizations thereof. The implementation is closer to a plain immutable collection library like from Guava.
Background
Unicode Tools does a lot of data processing, and thus it experiences many similar concerns of other data processing projects as their requirements grow.
Common problems include complexity and performance, although they manifest in different concrete ways that are also intertwined. For example:
Possible Solution(s)
The one technique out of the relevant few that goes a long way to addressing these issues is to use persistent data structures. Persistent data structures are:
For details about persistent data structures, see 10 mins in this video.
Persistent data structures can go a long way to eliminating or mitigating the previous coding pitfalls we experience, in terms of complexity and performance by virtue of the fact that they're immutable. Further efficiency concerns are handled through the optimizations of modern implementations.
There are other techniques that can be combined to make code more concurrency-ready and easier to reason about, but persistent data structures are basically a starting point and are already are sufficiently effective at addressing the original concerns.
Possible Implementations for persistent data structures
Given that we have an established codebase, we aren't trying to rewrite anything, let alone in wholesale. These ideas might be useful to consider in smaller self-contained use cases, if needed in the future. Given that we have a Java codebase, and Java is only slowly inching towards being a functional programming-ish language, there are only a few third-party libraries that already implement persistent data structures in a Java-native way: