Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions docker/Makefile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
.PHONY: up down wait-compose-ready restart logs postgres build-hyperlane-cli build-celestia-devnet-images push-celestia-devnet-images
.PHONY: up down wait-compose-ready restart logs postgres enable-chaos-std enable-chaos-brutal restart-toxiproxy build-hyperlane-cli build-celestia-devnet-images push-celestia-devnet-images

PROJECT_ROOT := $(shell git rev-parse --show-toplevel)
DOCKER_COMPOSE_DIR := $(PROJECT_ROOT)/docker
Expand Down Expand Up @@ -47,20 +47,25 @@ postgres:
@echo "Starting fresh PostgreSQL environment..."
docker compose -f $(DOCKER_COMPOSE_DIR)/docker-compose.postgres.yml up

CHAOS_DIR := $(PROJECT_ROOT)/scripts/chaos
DOCKER_CHAOS_ENV := LISTEN_ADDR=0.0.0.0 POSTGRES_UPSTREAM=host.docker.internal:5432

# Steady DA latency on the primary rollup (toxi_scenario P5).
enable-chaos-std:
./toxiproxy/remove_toxics.sh
./toxiproxy/enable_toxics.sh
./toxiproxy/status_chaos.sh
$(CHAOS_DIR)/toxi_scenario.sh clear primary
$(CHAOS_DIR)/toxi_scenario.sh scenario P5 primary
$(CHAOS_DIR)/toxi_scenario.sh list

# Compound failure: DA latency + postgres resets (toxi_scenario P6).
enable-chaos-brutal:
./toxiproxy/remove_toxics.sh
TIMEOUT_RATIO=0.9 LATENCY_RATIO=0.1 LIMIT_DATA_RATIO=0.4 ./toxiproxy/enable_toxics.sh
./toxiproxy/status_chaos.sh
$(CHAOS_DIR)/toxi_scenario.sh clear primary
$(CHAOS_DIR)/toxi_scenario.sh scenario P6 primary
$(CHAOS_DIR)/toxi_scenario.sh list

restart-toxiproxy:
docker compose restart toxiproxy
sleep 5
TOXIPROXY_HOST="localhost" ./toxiproxy/configure.sh
$(DOCKER_CHAOS_ENV) $(CHAOS_DIR)/toxi_apply_config.sh

# Needs docker logged in
# echo $GITHUB_TOKEN | docker login ghcr.io -u USERNAME --password-stdin
Expand Down
62 changes: 34 additions & 28 deletions docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,46 +63,52 @@ updated during consecutive runs.

## Chaos Engineering

[Toxiproxy](https://github.com/Shopify/toxiproxy) enables chaos engineering by simulating network failures and instabilities.
Use it to test how the rollup behaves when the connection to celestia-node is unreliable.
[Toxiproxy](https://github.com/Shopify/toxiproxy) sits between the rollup and its
upstreams (Postgres + Celestia DA RPC/gRPC) so we can inject latency, timeouts, and
connection resets. The toxiproxy scripts live in [`../scripts/chaos/`](../scripts/chaos/)
and work for both baremetal and docker — baremetal is the default; docker mode is
selected via env vars.

### Setup

1. Uncomment the toxiproxy service in [`docker-compose.yml`](./docker-compose.yml)
2. Configure your rollup to connect to port `26659` (proxied) instead of `26658` (direct)
1. Uncomment the `toxiproxy` service in [`docker-compose.yml`](./docker-compose.yml).
2. Start it: `docker compose up -d toxiproxy`.
3. Populate the seven proxies from the host shell:
```bash
LISTEN_ADDR=0.0.0.0 POSTGRES_UPSTREAM=host.docker.internal:5432 \
../scripts/chaos/toxi_apply_config.sh
```
4. Point the rollup at the proxied ports (`5433` for postgres, `26678` for celestia
RPC, `9091` for celestia gRPC).

### Usage

The proxy starts without any network toxics enabled. Use the provided scripts to control network conditions:
Apply chaos via the scenario CLI (or the `make` shortcuts below):

```bash
# Enable standard toxics (light network issues)
docker/toxiproxy/enable_standard_toxics.sh
# Steady DA latency on the primary rollup
make enable-chaos-std

# Enable brutal toxics (severe network issues)
docker/toxiproxy/enable_brutal_toxics.sh
# Compound failure: DA latency + postgres connection resets
make enable-chaos-brutal

# Remove all toxics (restore normal network)
docker/toxiproxy/remove_toxics.sh

# Check current toxic status
docker/toxiproxy/status_chaos.sh
# Or run scenarios directly:
../scripts/chaos/toxi_scenario.sh scenario P5 primary
../scripts/chaos/toxi_scenario.sh clear all
../scripts/chaos/toxi_scenario.sh list
```

Available toxic types include latency, timeouts, connection resets, and bandwidth limiting.
This allows you to test rollup resilience under various network failure scenarios.
See `../scripts/chaos/toxi_scenario.sh --help` for the full list of named toxics
(`rpc-latency`, `rpc-timeout`, `pg-reset`, `pg-latency`) and scenarios (`P1`–`P7`,
`R1`–`R2`).

### Troubleshooting

**Toxiproxy crashes when adding toxics:**
- This happens when trying to add toxics to a proxy with active connections
- Solution: Restart toxiproxy and try again:
```bash
docker compose restart toxiproxy
# Wait a few seconds, then try adding toxics again
```

**Best practices:**
- Add toxics immediately after starting toxiproxy, before connections are established
- Use the remove script to clean up toxics before stopping services
- Monitor toxiproxy logs for crash indicators: `docker compose logs toxiproxy`
**Toxiproxy crashes when adding toxics to a proxy with active connections:**
```bash
make restart-toxiproxy
```
This restarts the container and re-populates all seven proxies in one shot.

Add toxics immediately after starting toxiproxy, before connections are established.
Tail logs with `docker compose logs toxiproxy`.
29 changes: 16 additions & 13 deletions docker/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,21 +1,24 @@
include:
- docker-compose.celestia.yml
# Uncomment this if you want to test network delays/errors
# Uncomment to test network delays/errors against the rollup (see ../scripts/chaos/README.md).
# After `docker compose up -d toxiproxy`, populate the proxies from the host shell:
#
# LISTEN_ADDR=0.0.0.0 POSTGRES_UPSTREAM=host.docker.internal:5432 \
# ../scripts/chaos/toxi_apply_config.sh
#
# Then apply scenarios with ../scripts/chaos/toxi_scenario.sh.
#services:
# toxiproxy:
# image: shopify/toxiproxy:2.1.4
# image: ghcr.io/shopify/toxiproxy:2.12.0
# hostname: toxiproxy
# environment:
# LOG_LEVEL: "debug"
# depends_on:
# - sequencer-0
# ports:
# - "127.0.0.1:26659:26659"
# - "127.0.0.1:8474:8474"
# toxiproxy-config:
# image: curlimages/curl:8.9.1
# depends_on:
# - toxiproxy
# volumes:
# - ./toxiproxy:/opt/toxiproxy
# command: [ "/opt/toxiproxy/configure.sh" ]
# - "127.0.0.1:8474:8474" # admin API
# - "127.0.0.1:5433:5433" # postgres_1
# - "127.0.0.1:5434:5434" # postgres_2
# - "127.0.0.1:5435:5435" # postgres_3
# - "127.0.0.1:26678:26678" # celestia_rpc_1
# - "127.0.0.1:26679:26679" # celestia_rpc_2
# - "127.0.0.1:9091:9091" # celestia_grpc_1
# - "127.0.0.1:9092:9092" # celestia_grpc_2
15 changes: 0 additions & 15 deletions docker/toxiproxy/configure.sh

This file was deleted.

87 changes: 0 additions & 87 deletions docker/toxiproxy/enable_toxics.sh

This file was deleted.

46 changes: 0 additions & 46 deletions docker/toxiproxy/remove_toxics.sh

This file was deleted.

28 changes: 0 additions & 28 deletions docker/toxiproxy/status_chaos.sh

This file was deleted.

Loading
Loading