The generate_traces.py Tool

# The generate_traces.py Tool > **Estimated reading time:** 10 minutes ## Overview `scripts/generate_traces.py` generates synthetic Langfuse-compatible trace datasets from a statistical schema. This lesson walks through its implementation — both as a practical tool and as a worked example of the statistical techniques that make synthetic data useful. The full source is at `scripts/generate_traces.py` in this repo. We'll walk through the key sections. ## Usage ```bash # Basic: 100 traces to stdout python3 scripts/generate_traces.py --count 100 # Reproducible: fixed seed, file output python3 scripts/generate_traces.py --count 500 --seed 2026 --output synthetic-500.ndjson # Custom error rate and services python3 scripts/generate_traces.py --count 200 --error-rate 0.15 --services api,ogma,anansi,firefly # Help python3 scripts/generate_traces.py --help ``` ## The Statistical Core The most important part of the generator is how it creates realistic values. Here's how latency is generated: ```python def lognormal_ms(p50: float, p95: float) -> int: """ Generate a log-normal latency value calibrated to real p50/p95. Log-normal is the right distribution for latencies because: - It's always positive (no negative latencies) - It has a long right tail (matches real p99 spikes) - It matches observed production latency distributions """ # Convert p50/p95 to log-normal parameters mu = math.log(p50) # p95 = exp(mu + 1.645 * sigma) → solve for sigma sigma = (math.log(p95) - mu) / 1.645 return max(1, int(random.lognormvariate(mu, sigma))) # Usage span_latency = lognormal_ms(p50=200, p95=800) # milliseconds ``` Token counts use the same approach: ```python input_tokens = lognormal_ms(p50=600, p95=1800) # typical prompt size output_tokens = lognormal_ms(p50=200, p95=600) # typical completion size ``` This is the key insight: **calibrating the distribution to real p50/p95 values** is what makes synthetic data feel real. If you know your actual p50 and p95 from production, you can generate synthetic data that matches your system's characteristics. ## The Trace Structure Each generated trace matches the Langfuse batch ingest format: ```python def generate_trace( services: list[str], error_rate: float, rng: random.Random ) -> dict: trace_id = str(uuid.uuid4()) is_error = rng.random() < error_rate start_time = datetime.datetime.utcnow() - datetime.timedelta( seconds=rng.randint(0, 86400) # spread over last 24h ) # Root trace trace = { "id": trace_id, "name": rng.choice(["story.generate_chapter", "story.create", "agent.run"]), "userId": f"user_{rng.randint(1000, 9999)}", "sessionId": f"session_{rng.randint(100, 999)}", "input": {"request": "generate story chapter"}, "output": None if is_error else {"status": "success"}, "metadata": { "environment": rng.choice(["production", "staging"]), "version": "2.0.0", }, "tags": ["story-generation"], "timestamp": start_time.isoformat() + "Z", } return trace ``` ## Multi-Service Cascade Simulation The most realistic part of the generator is the cascade — simulating how SandSync's Ogma agent falls through providers: ```python def generate_cascade_spans(trace_id: str, start: datetime, rng: random.Random) -> list: """Simulate the ollama → groq → anthropic cascade.""" spans = [] cascade = [ ("ollama", 0.4), # 40% chance Ollama is available ("groq", 0.85), # if Ollama fails, 85% chance Groq works ("anthropic", 1.0), # Anthropic always works (final fallback) ] t = start for provider, success_prob in cascade: latency = lognormal_ms(p50=300, p95=900) succeeded = rng.random() < success_prob span = { "id": str(uuid.uuid4()), "traceId": trace_id, "name": f"ogma.{provider}", "startTime": t.isoformat() + "Z", "endTime": (t + datetime.timedelta(milliseconds=latency)).isoformat() + "Z", "attributes": { "provider": provider, "success": succeeded, "latency_ms": latency, }, "status": "OK" if succeeded else "ERROR", } spans.append(span) if succeeded: break # Stop cascade at first success t += datetime.timedelta(milliseconds=latency) return spans ``` This produces traces where you can see the exact cascade pattern in the waterfall — the same pattern we saw in real SandSync traces in Module 08. ## Error Injection Errors are injected with realistic error types weighted by how commonly they occur: ```python ERROR_TYPES = [ ("Request timeout after 30s", 0.30), # most common ("Rate limit exceeded: 100 req/min", 0.25), ("Context length exceeded (4K max)", 0.15), ("Model overloaded, retrying...", 0.12), ("Cascade fallback exhausted", 0.08), ("Service unavailable", 0.05), ("Invalid API key", 0.03), ("Connection refused", 0.02), ] def pick_error(rng: random.Random) -> str: r = rng.random() cumulative = 0 for msg, weight in ERROR_TYPES: cumulative += weight if r < cumulative: return msg return ERROR_TYPES[-1][0] ``` The weighting means timeout errors appear ~30% of the time among errors, which matches real-world LLM API patterns where cold starts and network timeouts dominate over quota errors. ## Deterministic Seeds The `--seed` flag makes generation fully deterministic: ```bash # These two commands produce identical output python3 scripts/generate_traces.py --count 100 --seed 42 > traces-a.ndjson python3 scripts/generate_traces.py --count 100 --seed 42 > traces-b.ndjson diff traces-a.ndjson traces-b.ndjson # no output ``` This matters for reproducibility in course exercises. The `sample_data/synthetic-500.ndjson` file was generated with `--seed 2026` — the exercises in this module reference specific traces by their deterministic IDs. ## The Pre-Generated Dataset We've already generated a 500-trace dataset at `sample_data/synthetic-500.ndjson`. You can load it directly into Langfuse for the exercises — no need to generate your own unless you want to customise it. Characteristics of the pre-generated dataset: - 500 traces spanning a simulated 24-hour period - Services: `api`, `ogma`, `anansi`, `firefly` - Error rate: 5% - Seed: 2026 (deterministic) - ~2.1MB uncompressed ## Summary - Log-normal distributions calibrated to real p50/p95 values are the key to realistic synthetic latency and token counts - The cascade simulation generates multi-span traces that match real provider fallback patterns - Error injection uses weighted sampling to match real-world error type distributions - `--seed` makes generation deterministic — critical for reproducible exercises - The pre-generated `synthetic-500.ndjson` is ready to use in the exercises below