11 - Configure dashboards & alerts

Learning Outcomes

  • Learn about Observability Pillars.
  • Learn about USE & RED metrics.

Problem Statement

We need to configure Grafana dashboards so that it simplifies the process of observing our system. Apart from this we also need to configure alerts for the different scenarios, you can choose some threshold values to get started with. In case of an alert, a message should be posted in a Slack channel.


The following expectations should be met to complete this milestone.
  • Following Grafana dashboards should be configured for monitoring,
    • DB metrics
    • Application error logs
    • Node Metrics
    • Kube-state metrics
    • Blackbox metrics (endpoints uptime, latency, etc.)
  • Alerts for the following scenarios should be created,
    • When Disk & CPU utilization increases certain threshold.
    • When there is a spike in error rate in the last 10 mins.
    • When p90, p95, and p99 latency has increased certain thresholds.
    • When the number of requests has increased a certain threshold.
    • When DB, Hashicorp Vault, and ArgoCD servers restart.
  • In case of an alert well descriptive message should be posted in the Slack channel.

Further Reading