These are the patterns most likely to waste your time if you don’t recognize them early. Each maps to a specific OTEL signal that would surface it.
Training Instability: Loss Spikes
What you see: Loss drops steadily, then spikes by 5–10x at a random step, then may partially recover.
Root cause: Learning rate is too high for the current phase of training. Gradient magnitude exceeds what the optimizer can handle cleanly.
What OTEL would show: gradient_norm spiking 5–10x its rolling average. If you don’t emit gradient_norm, you only see the loss spike after the fact.
What to do: Reduce learning rate by 10x and restart from the last checkpoint. Use gradient clipping (max_norm=1.0) as a preventive measure.
Dead Neurons: Activation Mean Collapse
What you see: Loss stops improving after a certain point. Validation accuracy plateaus even with more training.
Root cause: A layer’s neurons have entered the “dead ReLU” state — their gradient is always zero, so they never update. They’ve been silenced.
What OTEL would show: activation_mean for a specific layer approaching 0. dead_neuron_ratio above 20% for that layer.
What to do: Switch to LeakyReLU or ELU activations. Reduce learning rate. Use batch normalization before activation layers.
Observability Gap: Missing the Signal That Matters
What you see: Model ships with apparently good metrics, then underperforms in production. Post-mortem shows training looked fine.
Root cause: You monitored the wrong thing. Training accuracy was high (overfit). Validation loss wasn’t emitted. Or: validation set wasn’t representative of production inputs.
What OTEL would show (if properly instrumented): validation_loss / training_loss ratio > 1.3 triggers overfitting alert. Input distribution comparison (KL divergence between val set and production inputs) reveals dataset mismatch.
What to do: Always emit validation_loss, not just training_loss. Validate that your validation set reflects what users will actually send.
OTEL Anti-patterns: Instrumenting Too Much
What you see: Your telemetry pipeline is overwhelmed. Storage costs spike. Dashboards are cluttered with 200 metrics. No one uses them.
Root cause: Emitting every intermediate value — every weight, every neuron activation — instead of aggregated signals.
Rule: Don’t instrument per-weight or per-parameter. Instrument aggregates: gradient_norm (not per-layer gradients), activation_mean per layer (not per neuron), loss (not per-sample loss values). Focus on the core metrics in the OTEL bridge guide.