Alleviate coding issues via simpler data structures

## Background

Unicode Tools does a lot of data processing, and thus it experiences many similar concerns of other data processing projects as their requirements grow.

Common problems include complexity and performance, although they manifest in different concrete ways that are also intertwined. For example:

* performance - as the size of datasets grow, the time it takes to compute grows, so we try to solve the problem other ways like concurrency (ex: #1365)
* caching - caching is usually introduced to speed up performance, but is notorious for adding complexity due to statefulness (ex: stale state)
* global mutable state - controlled global state is not necessarily a problem (ex: databases), but when it's mutable, it limits our ability to attempt concurrency (ex: #1376). Not allowing concurrency will prevent solving performance problems.
* multiple versions of data - multiple related datasets / versions can cause people, when writing data processing code, to choose between performance vs. simpler separate of code. Do you update mutable data in place, or do you create an entirely separate copy to represent a different version or transformation output?

## Possible Solution(s)

The one technique out of the relevant few that goes a long way to addressing these issues is to use persistent data structures. Persistent data structures are:

1. immutable
2. create a reference to the result of any transformation, without affecting the reference to the input
3. use clever structural sharing to store multiple "snapshots" of transformed data efficiently size-wise in memory
4. (modern implementations) optimize the design and implementation details to make basic operations amortized O(1) runtime (ex: get, put, at, append)

For details about persistent data structures, see [10 mins in this video](https://youtu.be/hb3rurFxrZ8?si=FrbbXe1FEGPLwXuT&t=1451).

Persistent data structures can go a long way to eliminating or mitigating the previous coding pitfalls we experience, in terms of complexity and performance by virtue of the fact that they're immutable. Further efficiency concerns are handled through the optimizations of modern implementations.

There are other techniques that can be combined to make code more concurrency-ready and easier to reason about, but persistent data structures are basically a starting point and are already are sufficiently effective at addressing the original concerns.

### Possible Implementations for persistent data structures

Given that we have an established codebase, we aren't trying to rewrite anything, let alone in wholesale. These ideas might be useful to consider in smaller self-contained use cases, if needed in the future. Given that we have a Java codebase, and Java is only slowly inching towards being a functional programming-ish language, there are only a few third-party libraries that already implement persistent data structures in a Java-native way:

1. [Paguro](https://github.com/GlenKPeterson/Paguro) - perhaps the simplest & most practical library
2. [Bifurcan](https://github.com/lacuna/bifurcan) - the most rigorous library, but perhaps slightly too pedantic to be the easiest option
3. [Pcollections](https://github.com/hrldcpr/pcollections) - feels like a Potemkin village solution -- the API and behavior are technically like persistent data structures, but the implementation just makes whole copies rather than use structural sharing, let alone optimizations thereof. The implementation is closer to a plain [immutable collection library like from Guava](https://github.com/google/guava/wiki/ImmutableCollectionsExplained).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Alleviate coding issues via simpler data structures #1377

Background

Possible Solution(s)

Possible Implementations for persistent data structures

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Alleviate coding issues via simpler data structures #1377

Description

Background

Possible Solution(s)

Possible Implementations for persistent data structures

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions