Learning Objectives
By end of Phase 3, you should be able to:
- Design observability for an ML system (what metrics, what alerts?)
- Interpret Grafana/Prometheus dashboards for model metrics
- Identify data drift and model drift from time-series signals
- Define SLOs for model-based systems
- Communicate operational risks to non-technical stakeholders
Why this is critical: This is where you spend 80% of your career. Phase 3 is not optional; it's your primary value add.
Resource 6: Observability Fundamentals
Title: An Introduction to Observability for LLM-Based Applications Using OpenTelemetry
Link: https://opentelemetry.io/blog/2024/llm-observability/
Time: 30 minutes
Technical relevance: OpenTelemetry is the emerging standard. Three signal types:
Metrics (scalar measurements): token count per request, cost, latency, accuracy
Traces (request lifecycle): one request from ingress → model inference → response, with timing per span
Logs (structured events): errors, warnings, model decisions, sampling for audit
Systems analogy: Just like monitoring an API (request rate, latency, error rate, saturation), you monitor ML systems. But the metrics are different: model confidence, accuracy, drift, cost.
What to understand:
- Why you need all three signal types
- How traces differ from logs: traces are for request paths (when did inference happen?), logs are for events (model returned confidence=0.3)
- Why semantic conventions matter: if every team emits
accuracy differently, your dashboards break
Resource 7: LLM-Specific Standards
Title: OpenTelemetry for Generative AI
Link: https://opentelemetry.io/blog/2024/otel-generative-ai/
Time: 20 minutes
Technical relevance: The standards for observing LLMs. Metric names like gen_ai.usage.input_tokens, gen_ai.request.duration, gen_ai.model.name. These will be in your dashboards.
What to understand:
- Semantic conventions: agreed-upon metric names and attributes
- Why this matters: you can switch vendors (OpenAI → Anthropic) and your dashboards still work
- What gets emitted: token counts (usage), latency, cost, model name, error status
Resource 8: Data Drift Detection
Title: How to Detect Model Drift in MLOps Monitoring
Link: https://towardsdatascience.com/how-to-detect-model-drift-in-mlops-monitoring-7a039c22eaf9/
Time: 25 minutes
Technical relevance: Statistical tests for distribution shift. When input distribution changes (data drift) or model performance degrades (model drift), your system should alert.
Statistical approaches:
- KL Divergence: Jensen-Shannon distance between training distribution and current distribution
- Population Stability Index (PSI): how much the distribution changed month-over-month
- Kolmogorov-Smirnov test: does the distribution shape change?
Systems perspective: Like detecting anomalies in application logs, but for feature distributions. Your model assumes "inputs look like training data." When that assumption breaks, accuracy breaks.
Phase 3 Glossary Callout
| Term |
Systems meaning |
| Observability | Ability to understand system state from measurements; not just uptime |
| Metric | Numeric measurement (scalar) over time; queryable, aggregable |
| Trace | Record of one request lifecycle; shows timing, dependencies, errors |
| Span | One unit within a trace; like a function call with timing |
| Semantic Convention | Agreed-upon naming and structure for metrics (e.g., gen_ai.usage.input_tokens) |
| SLO | Service Level Objective; promise to users (e.g., "99% accuracy", "p99 latency < 200ms") |
| Drift | Distribution shift; inputs or model behavior changed from training |
| Anomaly | Deviation from baseline; statistical threshold breach |
Phase 3 Try This
Design an Observability Stack: Your team just deployed an LLM-based customer service bot. Design three dashboards:
- Operational Dashboard (for ops/SRE): What do you need to know hourly? (latency, error rate, token cost, uptime)
- Quality Dashboard (for ML team): What indicates the model is working? (hallucination rate, confidence distribution, accuracy on labeled samples)
- Drift Dashboard (for product): Is something changing? (input distribution shift, model confidence drift, user complaints trend)
For each dashboard, list 3-5 metrics. For each metric, define: (1) the alert threshold, (2) what you do when it fires.
Phase 3 Teach-Back
To leadership: "We monitor three things: operational health (latency, errors, cost), model quality (accuracy, confidence, hallucination rate), and drift (is the input distribution changing?). If any of these degrades, we alert and investigate. This is how we defend against silent failures."
Phase 3 Production Example
Scenario: LLM bot running smoothly for 3 months. Token usage constant. Latency stable. Then accuracy (measured on a labeled holdout set) drops from 92% to 78% over one week. Investigation: users started asking questions outside the training distribution (new product features). The model is answering confidently but incorrectly. Root cause: no drift detection on input distribution. If the team had been monitoring "how different are today's inputs from training inputs?", they would have detected the shift and alerted the team for retraining.