Skip to content

Alleviate coding issues via simpler data structures #1377

Description

@echeran

Background

Unicode Tools does a lot of data processing, and thus it experiences many similar concerns of other data processing projects as their requirements grow.

Common problems include complexity and performance, although they manifest in different concrete ways that are also intertwined. For example:

  • performance - as the size of datasets grow, the time it takes to compute grows, so we try to solve the problem other ways like concurrency (ex: make tests faster #1365)
  • caching - caching is usually introduced to speed up performance, but is notorious for adding complexity due to statefulness (ex: stale state)
  • global mutable state - controlled global state is not necessarily a problem (ex: databases), but when it's mutable, it limits our ability to attempt concurrency (ex: mvn test serially again #1376). Not allowing concurrency will prevent solving performance problems.
  • multiple versions of data - multiple related datasets / versions can cause people, when writing data processing code, to choose between performance vs. simpler separate of code. Do you update mutable data in place, or do you create an entirely separate copy to represent a different version or transformation output?

Possible Solution(s)

The one technique out of the relevant few that goes a long way to addressing these issues is to use persistent data structures. Persistent data structures are:

  1. immutable
  2. create a reference to the result of any transformation, without affecting the reference to the input
  3. use clever structural sharing to store multiple "snapshots" of transformed data efficiently size-wise in memory
  4. (modern implementations) optimize the design and implementation details to make basic operations amortized O(1) runtime (ex: get, put, at, append)

For details about persistent data structures, see 10 mins in this video.

Persistent data structures can go a long way to eliminating or mitigating the previous coding pitfalls we experience, in terms of complexity and performance by virtue of the fact that they're immutable. Further efficiency concerns are handled through the optimizations of modern implementations.

There are other techniques that can be combined to make code more concurrency-ready and easier to reason about, but persistent data structures are basically a starting point and are already are sufficiently effective at addressing the original concerns.

Possible Implementations for persistent data structures

Given that we have an established codebase, we aren't trying to rewrite anything, let alone in wholesale. These ideas might be useful to consider in smaller self-contained use cases, if needed in the future. Given that we have a Java codebase, and Java is only slowly inching towards being a functional programming-ish language, there are only a few third-party libraries that already implement persistent data structures in a Java-native way:

  1. Paguro - perhaps the simplest & most practical library
  2. Bifurcan - the most rigorous library, but perhaps slightly too pedantic to be the easiest option
  3. Pcollections - feels like a Potemkin village solution -- the API and behavior are technically like persistent data structures, but the implementation just makes whole copies rather than use structural sharing, let alone optimizations thereof. The implementation is closer to a plain immutable collection library like from Guava.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions