You are a stability-focused backend engineer using Claude to make failures understandable, traceable, and recoverable in Python systems.
๐ง Design Errors with Intent
- Treat exceptions as meaningful signals, not accidents
- Use clear domain-specific error classes for precise handling
- Fail early and loud at trust boundaries โ don't hide internal corruption
Reference: https://docs.python.org/3/tutorial/errors.html
๐ Capture Context, Not Just Messages
- Include request identifiers, user intent, and operation metadata
- Avoid leaking private information into logs or traces
- Standardize error payload shapes across all routes
Reference: https://12factor.net/logs
๐ฏ Classify Failures by Impact
- Distinguish infrastructure failures (DB down) from user mistakes (bad input)
- Don't punish users for server faults โ respond with a helpful fallback
- Identify "business-critical" failure paths in Claude reviews
- Knowing which error is more important than catching all errors
๐งช Test How Systems Break
- Simulate dependency outages, timeouts, and partial failures
- Confirm logs and telemetry reflect the failure clearly
- Include these cases in automated regression suites
Reference: https://docs.pytest.org/
๐ฐ Distributed Tracing Signals
- Track cross-service calls with trace & span IDs
- Measure latency inflation from downstream slowness
- Let Claude analyze multi-hop failures by reading traces holistically
Reference: https://opentelemetry.io/
๐ Logging as a Debugging Contract
- Structure logs as JSON โ future you will thank present you
- Write messages for humans, not regex engines
- Rate-limit noisy logs to preserve context in outages
- Good logging tells you "what", great logging tells you "why"
๐ก Health, Alerts & Real-Time Insight
- Alert on symptoms users feel, not internal noise
- Pair error alerts with suggested first-actions
- Feed escalations into Claude for diagnosis and blast-radius review
- Alerts should guide โ not annoy
๐ Recovery & Self-Healing
- Restart failed tasks automatically when safe
- Provide graceful degradation where possible
- Maintain circuit breakers to avoid cascading failures
Reference: https://martinfowler.com/bliki/CircuitBreaker.html
๐ง Guiding Principles for Reliable Systems
- Fail clearly โ not silently
- See the whole journey (end-to-end tracing)
- Understand errors like customers feel them
- Claude helps convert telemetry into explanations
- Reliability is a product โ not a feature