feat: produce OpenTelemetry traces with hs-opentelemetry#3140
feat: produce OpenTelemetry traces with hs-opentelemetry#3140develop7 wants to merge 2 commits intoPostgREST:mainfrom
hs-opentelemetry#3140Conversation
|
Awesome work! 🔥 🔥
Found this Nix flake that contains an OTel GUI: https://flakestry.dev/flake/github/FriendsOfOpenTelemetry/opentelemetry-nix/1.0.1 I'll try to integrate that once the PR is ready for review. |
8c0e16a to
64a0ee9
Compare
|
The recent problem I'm seemingly stuck with is There's a more straightforward It also seems to boil down to the conceptual choice between online and offline traces' delivery-wise, or push and pull model. @steve-chavez @wolfgangwalther @laurenceisla what do you think guys? |
|
@develop7 Would vault help? It was introduced on #1988, I recall it helped with IORef handling. It's still used on postgrest/src/PostgREST/Auth.hs Lines 160 to 165 in d2fb67f I'm still not that familiar with OTel but the basic idea I had was to store these traces on AppState and export them async. |
6b891c2 to
586e7a1
Compare
Not only that, you want traces in tests too, for one. The good news is
Good call @steve-chavez, thank you for the suggestion. Will try too. |
0830a45 to
dc882f1
Compare
|
Since now we have an postgrest/src/PostgREST/App.hs Lines 170 to 172 in 229bc77 postgrest/src/PostgREST/Observation.hs Lines 15 to 18 in 229bc77 Perhaps we can add some observations for the timings? Also the Logger is now used like: postgrest/src/PostgREST/Logger.hs Lines 53 to 54 in 7c6c056 postgrest/src/PostgREST/CLI.hs Line 50 in 7c6c056 For OTel, maybe the following would make sense: otelState <- Otel.init
App.run appState (Logger.logObservation loggerState >> OTel.tracer otelState)) |
dc882f1 to
7794848
Compare
Agreed, server timings definitely belong there. |
7794848 to
398206b
Compare
398206b to
4cd99c6
Compare
|
Okay, the PR is in the cooking for long enough, let's pull the plug and start small. Let's have it reviewed while I'm fixing the remaining CI failures. |
4cd99c6 to
94d2b9b
Compare
|
I don't think we depend on this in the current state. And we should certainly not depend on an even-less-maintained fork of the same. So to go forward here, there needs to be some effort put into the upstream package first, to make it usable for us. |
590d142 to
e809a65
Compare
|
A status update:
|
Hm. I looked at your fork. It depends on support for GHC 9.8 in I guess for GHC 9.8 support it's just a matter of time. What about the other issues mentioned above? Were you able to make progress on those? |
In my prototype I actually played with replacing HASQL Session with an https://github.com/haskell-effectful/effectful based monad to make it extensible: https://github.com/mkleczek/hasql-api/blob/master/src/Hasql/Api/Eff/Session.hs#L37 Using it in PostgREST required some mixins usage in Cabal: 29b946e#diff-eb6a76805a0bd3204e7abf68dcceb024912d0200dee7e4e9b9bce3040153f1e1R140 Some work was required in PostgREST startup/configuration code to set-up appropriate effect handlers and middlewares but the changes were quite well isolated. At the end of the day I think basing your monad stack on an effect library (effectful, cleff etc.) is the way forward as it makes the solution highly extensible and configurable. |
e809a65 to
4697009
Compare
650d008 to
ac33872
Compare
|
Update: rebased the PR against latest |
|
@develop7 |
|
I tested the feature in Honeycomb and locally using otel-tui and otel-desktop-viewer and from what I can see it's working on all of them 🎉 ! I executed the following (first env. vars are for Honeycomb): # OTEL_EXPORTER_OTLP_ENDPOINT="https://api.honeycomb.io:443" \
# OTEL_EXPORTER_OTLP_HEADERS="x-honeycomb-team=<REDACTED>" \
OTEL_EXPORTER_OTLP_ENDPOINT='http://localhost:4318' \
OTEL_EXPORTER_OTLP_PROTOCOL='http/protobuf' \
OTEL_SERVICE_NAME='PostgREST' \
OTEL_LOG_LEVEL='debug' \
OTEL_TRACES_SAMPLER='always_on' \
PGRST_SERVER_OTEL_ENABLED=true \
PGRST_DB_AGGREGATES_ENABLED=true \
PGRST_DB_PLAN_ENABLED=true \
PGRST_SERVER_TIMING_ENABLED=true \
PGRST_LOG_LEVEL=info \
postgrest-with-pg-17 -f ./test/spec/fixtures/load.sql postgrest-runPrint screens: The only thing out of the ordinary is that |
@mkleczek it's absolutely worth trying to implement; will look into it. |
I'll add that to examples section of the doc; what collector did you use BTW?
Seems like upstream issue; which they just fixed in CtrlSpice/otel-desktop-viewer#203 |
Oh, I didn't use a collector, it sent data directly to Honeycomb. I'll try it out with one and come back with the info. |
docs/integrations/opentelemetry.rst
Outdated
| OTEL_TRACES_SAMPLER='always_on' \ | ||
| postgrest | ||
|
|
||
| Since current OpenTelemetry implementation incurs a small (~6% in our "Loadtest (mixed)" suite) |
There was a problem hiding this comment.
Since OTel is only applied to the timing headers, maybe the loss in perf is not from OTel but from the timing headers? The measured loss matches the one reported in #3410 (comment)
There was a problem hiding this comment.
I've enabled both OTel and timings for loadtests temporarily, let's see.
There was a problem hiding this comment.
-12% throughput in "loadtest (jwt-hs)", -1% in "Loadtest (mixed)", -9% in "Loadtest (jwt-rsa)"
looks like it's suffering from both
There was a problem hiding this comment.
I've enabled both OTel and timings for loadtests temporarily, let's see.
I think we should enable only OTel in a commit to confirm -- or will OTel do nothing if server timing is not enabled?
There was a problem hiding this comment.
will OTel do nothing if server timing is not enabled?
No, OTel would produce traces regardless of whether server timings are enabled or not, as designed.
There was a problem hiding this comment.
while both withOTel and withTiming helpers should be abstracted out under a single call, say, observe or something, that would be a refactor for PRs to come.
There was a problem hiding this comment.
@steve-chavez I must have missed it, could you reiterate what would you like to see in the docs?
There was a problem hiding this comment.
- Links to the type of spans that are produced:
it's kind of documented at https://opentelemetry.io/docs/specs/semconv/http/http-spans/ & https://opentelemetry.io/docs/specs/semconv/cli/cli-spans/ (choose all that apply)
#3140 (comment)
-
That server-timing enabled produces more spans as mentioned on https://github.com/PostgREST/postgrest/pull/3140#discussion_r2795588328.o
-
That OTel support is not finished in postgREST and there are some things missing?
There was a problem hiding this comment.
- Links to the type of spans that are produced
the upstream (hs-opentelemetry) doesn't have those, so why should we, not introducing any new ones at that? is it our job to explain OpenTelemetry basics to users? Out of scope IMO.
- That server-timing enabled produces more spans as mentioned on
it's not what happens: server-timing introduces an extra response header, period, not more spans (which are not sent to the client BTW). It's just the interesting places for (experimental) OTel support happened to be interesting places for server-timing
- That OTel support is not finished in postgREST and there are some things missing?
Great point, done, thanks.
There was a problem hiding this comment.
it's not what happens: server-timing introduces an extra response header, period, not more spans
Oh, I misremembered that it only sent two spans when server-timing was disabled, so yeah you're right here. It doesn't matter if server-timing is enabled or disabled; it sends them as in this example in both cases:
I made a test using the OpenTelemetry Collector and it sent the data just fine to Honeycomb. |
|
@mkleczek I've also added |
As mentioned before, this PR is already too long, we shouldn't be adding more to it.. it's a lot to review as it is. |
|
@steve-chavez done, extracted it to #4666 |
|
An idea to test this. Could we have I'm imagining the test could be similar to how we capture our schema cache snapshots (ref). |
@develop7 is this something that can be done here? The manual tests are working right now as mentioned in #3140 (comment), so maybe we could implement this test later on if it's not that feasible to do here? (cc. @steve-chavez) |
|
re: tests with actual otel collector — I've managed to prototype collector-including tests that compile, fail and and now I'm working on making them pass and be useful. |
|
the test is in, chose to manage collector binary from Haskell for a change |
| @@ -0,0 +1,22 @@ | |||
| [ | |||
There was a problem hiding this comment.
Q: What does this test? 🤔
So my expectation was:
- To have a test that has a client doing some requests
- Capture the generated OTel output in a file
Is it possible to get close to that?
There was a problem hiding this comment.
The test does one request, generates 5 (6 actually) spans and is making sure 1) all the spans of types specified in this file are present in the output, 2) they have spans of specified type as their parent (here all of them are of type request). Since spans are located by their unique spanId and parentSpanId respectively, this test also makes sure spans are linked properly; doesn't test for the lack of astray spans though, which matches the real life behavior IMO. This file is kind of scarce, I admit, but that's because the data in traces are either volatile (timestamps, IDs, timings) or fixed through the whole test (library name, SDK name, programming language, etc.). I've attached a sample collector output file for the reference, feel free to look it up and suggest more things to test: scratch_4.json
To think of it, I'd throw in matching against the SQL query, let me try writing it to the trace.
There was a problem hiding this comment.
I've attached a sample collector output file for the reference, feel free to look it up and suggest more things to test: scratch_4.json
Many thanks for that. By looking at the file I'm finally starting to understand OTel concepts (what's a resource, what's scope, etc).
How do you generate the json file? The way I'm thinking to test this:
- Generate the otel json file with some command.
- Use
jqto filter out the non-deterministic parts - Then compare it to a snapshotted file
I'm not sure if Haskell is the right tool here because so far it looks it results in more code. If we could do it with jq it would be less to maintain it seems.
There was a problem hiding this comment.
How do you generate the json file?
I start an otelcol OpenTelemetry collector supplying it a configuration file generated by OTelHelper.collectorConfig, start a PostgREST instance wired to said collector instance, then perform requests to PostgREST instance, shut everything down (so traces get flushed for sure), both PostgREST and collector, and take the output file in the temporary directory containing traces' JSON.
There was a problem hiding this comment.
If we could do it with
jqit would be less to maintain it seems.
Maybe with optics instead? We do have optics-core & optics-extras through dependencies already, so no build time hit. Let me try that
There was a problem hiding this comment.
Maybe with
opticsinstead?
Nope, same amount of code, but harder to read :)
|
Discovered a way to relay metrics to OpenTelemetry collector with the otelcol's own https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver, added a rough config draft to the OpenTelemetry page; should help work the lack of native metrics in hs-otel around for now. @steve-chavez @wolfgangwalther I've realized |
* Introduces producing OpenTelemetry traces with hs-opentelemetry. * Adds OTel spans over the whole application loop and over each request processing phase * Preliminary OTel tracing support in spec tests * Disables tracing in load and memory tests





This PR introduces producing OpenTelemetry traces containing, among others, metrics same as in
ServerTimingheader from before.TODO:
build with Nix as well (for now Stack only)DONE for freemake an example of exporting log messageshs-opentelemetrydoesn't support logging, per Logging roadmap iand675/hs-opentelemetry#100makeseems like impossible without major refactoringgetTraceravailable globally: we're interested in using as many different spans as it makes sense, sogetTracershould be available everywhere, as described inhs-opentelemetry-sdk's READMEhs-opentelemetry-waimiddlewarelook into failing Windows buildshs-opentelemetry-sdkdepends onunix, tracking in Windows support iand675/hs-opentelemetry#109Running:
I sort of gave up deploying and configuring all the moving bits locally, so you'd need to create the honeycomb.io account for this one (or ask me for the invite). After that, it's quite straightforward:
stack build, and get its path withstack exec -- which postgrestnix-shell, thenpostgrest-with-postgresql-15 --fixture ./test/load/fixture.sql -- cat). Note the server URL, you'll need it when running PostgREST serverpostgrest-jwt --exp 36000 postgrest_test_anonymousTests
hspec tests are also instrumented, for those to produce traces you need to set
OTEL_*vars only: