Understanding Prometheus & Grafana

Understanding Prometheus & Grafana — My DevOps Journey

November 08, 2025

Today, I want to talk about Prometheus and Grafana. I’m personally very excited about these tools because, Alhamdulillah, within one year I have completed the entire DevOps stack. Now, let’s move to the main topic.

Most people think DevOps means deploying a website, an application, using Ansible, building CI/CD pipelines, and that’s it.
But the truth is: real DevOps work actually begins after deployment.

Once an application goes live, the work of DevOps/SRE starts. We begin observing:

CPU usage
Memory consumption
Incoming and outgoing traffic
Application health
System behavior

This is the responsibility of an SRE or Senior DevOps Engineer.
And in this phase, three important concepts come in: SLA, SLO, and SLI.

✅ What is SLA? (Service Level Agreement)

An SLA is an agreement between two parties — usually the product owner and the customer.

The product owner commits that the application will be:

Running 24/7
Secure and safe
Scalable
Reliable and available
Performing well

In short: Your application should run smoothly without issues, and the owner is responsible for that promise.

✅ Observability Tools After Deployment

Once an app is deployed, we use different tools to observe, monitor, and track everything.

1. Monitoring

Tools monitor the node, app, and cluster to ensure everything is working properly.

2. Logging

For logs, we use:

Grafana
Loki
Prom tail

Logs help us see which user came, what actions they performed, and what errors occurred.

3. Tracing

For tracing, we use eBPF, which works at the kernel level of the operating system.
This helps in identifying deep system issues and troubleshooting efficiently.

4. Alerting

For alerting, we use:

Prometheus
Grafana

These tools notify us when something goes wrong.

✅ What is SLO? (Service Level Objective)

SLOs ensure that the SLA is not broken.
These are internal targets we set to maintain quality.

Examples of SLOs:

Monitoring must be enabled.
Logging must be enabled.
Tracing must be enabled.
Alerting must be enabled.

In short, SLOs are the rules we follow so the SLA remains intact.

✅ What is SLI? (Service Level Indicator)

SLIs are metrics that help us measure the performance of our system.
They show when CPU or memory goes up or down, how traffic behaves, etc.

Tools used to measure SLIs:

Prometheus (a Time Series Database — TSDB)
Grafana
EFK stack
ELK stack
Datadog

✅ Observability — Why This Matters

Observability helps us answer two important questions:

Why is the app running fine?
If it went down, why did it go down?

To visualize observability, we use tools like:

Kibana
CloudWatch
Grafana

These give us a clear picture of what is happening inside the system.

✅ Final Thoughts

I hope this blog helped you clearly understand SLA, SLO, SLI, and how tools like Prometheus and Grafana support observability in a DevOps environment.

Please share, subscribe, and support.
More DevOps content is on the way!

🌐 Online References

LinkedIn: https://www.linkedin.com/in/raees-yaqoob-qazi-ryqs/
Medium Blog: https://medium.com/@raeesyaqubqazi
Blogspot: https://brillertechnologies.blogspot.com/
Facebook Page: https://web.facebook.com/profile.php?id=61553548371216
YouTube Channel: https://www.youtube.com/@RaeesQ.
TikTok: https://www.tiktok.com/@mrryqs?_t=ZS-8y7t0fQfJKu&_r=1

Search This Blog

Raees Qazi (RYQ's) | DevOps Engineer | Learner | Mentor | Creator