LGTM Stack MCP Setup Guide for Advanced Observability

Nov 4, 2025

TL;DR

MCP turns LGTM from visibility to understanding. It adds a context layer over Loki and Tempo, correlates logs and traces, and returns an RCA plus ready-to-render Grafana dashboards, no manual LogQL/TraceQL needed.
Setup is straightforward and storage-safe. Run Loki and Tempo with S3/MinIO backends, then point the LGTM MCP chart at the in-cluster services using host-only URLs; MCP queries backends and never touches object storage directly.
Tenants are just headers. For self-hosted single-tenant, use {}; for multi-tenant or Grafana Cloud, attach X-Scope-OrgID and/or Authorization: Basic <user:token> in the tenants map to scope queries cleanly.
A single natural prompt drives the workflow. Ask “check errors, find root cause, build dashboards,” and MCP generates a question plan, runs minimal queries, correlates results, and emits dashboard JSON for validation.
Production-hardening is predictable. Scale Loki/Tempo separately, enforce RBAC and network policies around MCP, set bucket retention/lifecycle rules for cost control, and use indexed labels/TraceQL filters for query performance.

When Logs, Traces, and Metrics Refuse to Talk to Each Other

Picture a familiar night in your on-call rotation where Grafana is red, your alert dashboard looks like a Christmas tree, and you’re juggling three browser tabs: one for Loki logs, one for Tempo traces, and another for Prometheus metrics. Each tool shows part of the story, but none of them agree on the ending. The checkout service is failing, but the trace IDs don’t match the logs, and the metrics don’t explain the latency. You have all the telemetry you could ask for, but no context connecting it.

This is the daily reality for teams running modern distributed systems. The LGTM stack (Loki, Grafana, Tempo, and Mimir) gives engineers visibility, but not always understanding. Context lives in the gaps between these tools, forcing engineers to mentally correlate logs, traces, and metrics during every incident. This results in delayed root-cause analysis and an endless cycle of dashboard-hopping.

Cardinal’s LGTM stack MCP (Model Context Protocol) fixes that missing context. It doesn’t replace your observability tools; it orchestrates them. MCP adds an intelligence layer that lets systems share telemetry context in real time, so “what went wrong” and “why it happened” appear in the same conversation.

In this guide, you’ll learn how to:

Deploy LGTM MCP in both Grafana Cloud and self-hosted Kubernetes environments.
Connect Loki, Tempo, and Grafana through a single contextual protocol.
Automate correlation between logs, traces, and metrics for faster incident triage.
Enable AI-driven root-cause analysis and contextual Grafana dashboards.

By the end, you’ll have a working LGTM MCP environment capable of turning raw observability data into structured, actionable insight, without rewriting your existing monitoring stack.

The LGTM MCP Architecture: Adding Intelligence to the Standard Stack

Traditional LGTM stacks, Loki for logs, Grafana for visualization, Tempo for traces, and Mimir for metrics, work beautifully as independent tools. Each excels in its domain: Loki handles log ingestion at scale, Tempo manages distributed traces efficiently, and Grafana presents it all through dashboards. Yet, the stack’s biggest challenge is integration of meaning. These tools don’t inherently “understand” each other’s context. That’s the gap MCP fills.

How MCP Extends the LGTM Stack

The Model Context Protocol (MCP) acts as an orchestration layer over the LGTM components. It defines how logs, traces, and metrics should exchange context, using a shared schema and API interface. Rather than querying Loki, Tempo, or Mimir separately, MCP serves as a single semantic gateway. It allows a developer, or an AI system, to ask contextual questions like:

“Show all traces related to failed checkout-service requests with latency > 2s.”

Without MCP, answering this query would require joining log lines from Loki, span data from Tempo, and performance metrics from Prometheus, manually. With MCP, these data streams are correlated automatically using shared identifiers (like trace IDs and span attributes).

MCP therefore transforms LGTM from a visualization stack into an intelligence stack. It provides:

Unified query semantics across telemetry types.
Context propagation between traces, logs, and metrics.
A standardized API for LLMs and other AI tools to reason about observability data.

Data Flow: From Telemetry to Insight

A high-level view of how LGTM MCP fits into your observability pipeline:

Promtail & Application Agents collect logs and send them to Loki.
OpenTelemetry SDKs or Tempo agents send traces to Tempo.
Prometheus/Mimir gathers metrics.
MCP connects to all three, enriches them with correlation metadata, and exposes a unified API.
Grafana (or AI copilots) consume MCP outputs, rendering context-aware panels or root-cause narratives.

The MCP layer doesn’t replace existing observability tools; it normalizes their communication. It ensures that each telemetry signal understands the context of the others, effectively turning disparate data into a single, coherent story.

Why This Architecture is important

For distributed systems with hundreds of services, troubleshooting depends not on having more data, but on having more connected data.

By abstracting telemetry access into a protocol-driven layer, MCP:

Reduces manual query complexity.
Enables LLM-powered tools (like Cardinal’s Chip or Copilot integrations) to analyze observability data directly.
Simplifies API-level integration between LGTM components.

In essence, MCP turns your observability setup from reactive dashboards to proactive intelligence.

Understanding the LGTM + MCP Flow

At a high level, the LGTM + MCP stack operates as a context orchestration pipeline between telemetry data and developer insight:

Stage	Component	Role
1. Data Emission	Applications / Services	Generate logs and traces through Promtail and OpenTelemetry exporters.
2. Ingestion & Storage	Loki & Tempo	Collect telemetry and persist it into object storage (S3, GCS, or MinIO) using TSDB and block stores.
3. Context Orchestration	MCP Layer	Queries Loki and Tempo APIs (never the buckets directly), applies tenant headers, correlates related logs and traces, and builds a contextual graph of system behavior.
4. Insight Delivery	Grafana / AI Copilot	Receives structured results, including RCA summaries and ready-to-render dashboard JSON or panel specs.
5. Validation & Feedback	Developer / SRE	Reviews dashboards, verifies RCA accuracy, and feeds back into the observability loop.

Why this design works:

MCP never touches raw storage: it delegates reading to Loki and Tempo, ensuring clean separation of data access and context logic.
Tenant awareness: headers like X-Scope-OrgID or Authorization enforce multi-tenant boundaries without complex rewrites.
Compact by design: it keeps only the actors that matter (Apps → Loki/Tempo → MCP → Grafana) while maintaining technical precision for production readers.

Key Capabilities of the LGTM MCP Stack: From Visibility to Understanding

The LGTM MCP stack doesn’t reinvent observability tools; it elevates how they communicate.

While the traditional LGTM stack delivers visibility, LGTM MCP adds understanding by turning fragmented telemetry streams into correlated context. This section explores how that shift transforms debugging from reactive monitoring into proactive analysis.

Contextual Intelligence Instead of Raw Data

Most observability pipelines collect data. Few interpret it. LGTM MCP introduces an intelligent data layer that understands relationships between logs, traces, and metrics.

It does this by:

Correlating telemetry signals through shared identifiers like trace IDs, service names, and span attributes.
Enriching data with metadata such as deployment tags, Kubernetes namespaces, and service topology.
Normalizing queries across Loki, Tempo, and Mimir, so “show me failed requests” retrieves a full stack trace and its corresponding metrics automatically.

This contextual bridge transforms observability data from isolated events into a coherent narrative, reducing the time developers spend switching tools to piece together a root cause.

AI-Ready Observability for Real Workloads

LGTM MCP was designed with AI and LLM-based copilots in mind. Instead of manually building dashboards, engineers or AI agents can query MCP semantically:

“Find all pods in the checkout-service namespace where latency increased after deployment 124.”

Under the hood, MCP translates that query into Loki + Tempo + Mimir lookups, merging results into a single contextual response.

This architecture makes MCP a model-friendly observability gateway, a foundational layer for AI-assisted troubleshooting or autonomous diagnostics systems.

Faster Root-Cause Correlation

When a system incident occurs, MCP automatically links telemetry data that shares common identifiers. For example:

A 500 error log line in Loki is correlated with the trace span from Tempo that includes the same request ID.
MCP then fetches corresponding latency metrics from Mimir, creating a 360° view of the issue.

Instead of manually aligning timestamps or searching for correlation IDs, engineers see the causal chain instantly:

The failure sequence begins with an API latency spike, continues with a database lock wait, and culminates in a pod restart.

This mechanism turns incident investigation from a multi-step process into a single query.

Seamless Grafana Integration

LGTM MCP doesn’t replace Grafana; it augments it.

MCP connects directly into Grafana’s datasource layer, enabling:

Auto-generated dashboards built from real-time telemetry schemas.
Context-aware panels that explain the why behind a metric spike.
Integrated AI assistance for querying and summarizing complex telemetry.

Engineers still use the Grafana interface they know, but the data behind it becomes self-describing and correlated.

Multi-Tenant and Multi-Environment Flexibility

Large organizations rarely operate in one environment. LGTM MCP supports multi-tenant configuration through per-tenant headers and authentication contexts.

Each team, cluster, or environment can have its own:

Loki endpoint
Tempo endpoint
Authentication token

This separation allows enterprise-scale deployments while preserving data boundaries and access control.

In Summary

Traditional LGTM stacks make your telemetry visible; LGTM MCP makes it intelligible. It transforms your logs, traces, and metrics into a shared context graph that powers faster debugging, AI-native observability, and deeper integration with Grafana.

Setting up LGTM MCP with object storage and an OTEL demo

This setup persists logs and traces to object storage (S3/GCS/MinIO), runs Loki and Tempo against those buckets, and exposes LGTM MCP servers that query Loki/Tempo over HTTP. An OpenTelemetry demo then generates logs and traces so MCP can correlate them. Finally, a single copilot prompt drives: (1) a question bank; (2) targeted queries; (3) an RCA narrative; and (4) dashboard panel specs.

MCP never reads buckets directly; Loki and Tempo do. MCP speaks to their APIs and returns contextual results.

Prerequisites that remove guesswork up front

Failed installs usually come from missing permissions, blocked egress, or mismatched URLs. Verifying fundamentals first prevents noisy troubleshooting later.

Kubernetes 1.26+ with outbound access to your object store (or in-cluster MinIO).
Helm 3 and kubectl on your workstation.
Object store credentials with read/write/list scoped to two buckets (one for Loki, one for Tempo).
Optional Grafana instance (self-hosted or Cloud) if you want a dashboard UI.
A namespace (we’ll use default for concreteness).

Export the basics without committing secrets:

# region/endpoint/buckets for S3-compatible stores (S3 or MinIO)
export OBS_REGION=ap-south-1
export OBS_ENDPOINT=s3.ap-south-1.amazonaws.com     # use your MinIO endpoint if self-hosting
export LOKI_BUCKET=loki-prod-obs
export TEMPO_BUCKET=tempo-prod-obs
export OBS_ACCESS_KEY=AKIA...
export OBS_SECRET_KEY

Definition: S3 path-style vs virtual-hosted style. MinIO and some on-prem S3 gateways require path-style addressing. AWS S3 prefers virtual-hosted. We’ll keep this toggle explicit where needed.

Architecture choices that impact your Helm values

As values shape your runtime, pick the smallest thing that could work, then scale up.

Storage pattern: Loki uses TSDB schema and Tempo uses its native block store; both persist to object storage.
Control plane: LGTM MCP is a thin layer that calls Loki (/loki/api/...) and Tempo (/tempo/api/...) and adds correlation/context, not storage.
Topology: Start with single-binary Loki and basic Tempo; move to distributed modes only after data volume demands it.

Loki configuration that writes to an object store

Goal: install Loki in single-binary mode, disable auth, write blocks/indexes to a bucket using the TSDB v13 schema.

Key fields explained:

schemaConfig: tells Loki to use tsdb and the v13 schema from a given date forward.
storage_config.aws: configures the S3/MinIO backend Loki will use to store chunks/index.
server.http_listen_port: exposes Loki’s HTTP API internally on 3100.

gateway.enabled: false: skips the optional Loki gateway since MCP talks directly to Loki’s HTTP API.

# loki.values.yaml (corrected)
deploymentMode: SingleBinary

loki:
  auth_enabled: false

  # TSDB schema written to object storage (from a known cutover date)
  schemaConfig:
    configs:
      - from: "2024-01-01"
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h

  # S3/MinIO settings for persistence
  storage_config:
    aws:
      endpoint: ${OBS_ENDPOINT}
      region: ${OBS_REGION}
      s3forcepathstyle: true          # true for MinIO; false for AWS S3
      access_key_id: ${OBS_ACCESS_KEY}
      secret_access_key: ${OBS_SECRET_KEY}
      bucketnames: ${LOKI_BUCKET}

  server:
    http_listen_port: 3100            # <-- moved under loki.server

# Loki gateway is a separate component; keep disabled for in-cluster MCP access
gateway:
  enabled: false

singleBinary:
  replicas: 1

# For prod, enable and bind a PVC; for demo, emptyDir is fine
persistence:
  enabled: false

Why this works: Loki writes indexes/chunks to your bucket, but MCP never touches the bucket. MCP queries Loki’s API, and Loki reads from the bucket. This clean separation makes credentials and audit easier.

Install with env substitution:

envsubst < loki.values.yaml > _loki.values.yaml
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki -n default -f _loki.values.yaml

Tempo configuration that persists traces to an object store

Run Tempo with its block store backed by your object storage.

Key fields:

tempo.storage.trace.backend: s3: selects the S3-compatible backend.
tempo.storage.trace.s3.*: provides bucket credentials and endpoint.
server.http_listen_port: 3200: exposes Tempo’s query API internally on 3200.

OTLP receivers are enabled by default; we’ll use 4317 for the demo.

# tempo.values.yaml
tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: ${TEMPO_BUCKET}
        endpoint: ${OBS_ENDPOINT}
        access_key: ${OBS_ACCESS_KEY}
        secret_key: ${OBS_SECRET_KEY}
        region: ${OBS_REGION}
        insecure: false    # set true if your MinIO endpoint is http
  server:
    http_listen_port: 3200

ingester:
  replicas: 1

distributor:
  replicas: 1

compactor:
  replicas: 1

metricsGenerator:
  enabled: false

Install:

envsubst < tempo.values.yaml > _tempo.values.yaml
helm upgrade --install tempo grafana/tempo -n default -f _tempo.values.yaml

Why this works: applications send traces to tempo.default:4317. Tempo writes blocks to your bucket. Later, MCP queries Tempo’s API, and Tempo reads results from storage.

LGTM MCP values that point to in-cluster services (and why host-only URLs)

Why “host-only” matters: the MCP binaries themselves append /loki/api/... and /tempo/api/.... Adding those suffixes to values doubles the path and yields 404s. Keep URLs host-only.

Definition: Tenant in this chart is a logical target (e.g., cluster or environment) with its own HTTP headers. For in-cluster single-tenant, use {} (no headers).

# lgtm.values.yaml
loki:
  enabled: true
  image:
    repository: public.ecr.aws/cardinalhq.io/loki-mcp
    tag: "v1.3.0"
  service: { type: ClusterIP, port: 8080 }
  url: "http://loki.default:3100"  # host only
  tenants:
    default: {}                    # no auth headers for in-cluster mode

tempo:
  enabled: true
  image:
    repository: public.ecr.aws/cardinalhq.io/tempo-mcp
    tag: "v0.2.0"
  service: { type: ClusterIP, port: 8080 }
  url: "http://tempo.default:3200" # host only
  tenants:
    default: {}

cardinal:
  apiKey: "${CARDINAL_API_KEY}"
  secret: { name: cardinal-api-key, create: true }

Install:

CARDINAL_API_KEY="YOUR_CARDINAL_API_KEY" \
envsubst < lgtm.values.yaml > _lgtm.values.yaml

helm upgrade --install lgtm oci://public.ecr.aws/cardinalhq.io/lgtm-mcp \
  --version 1.6.0 -n default -f _lgtm.values.yaml

What just happened: the LGTM MCP pods start with URLs and tenant config mounted from a Secret. They expose /mcp (JSON-RPC) endpoints that translate your tool calls into Loki/Tempo API requests.

OTEL demo that writes real logs and traces into your buckets

Why seed data: empty backends make MCP look broken. A tiny log emitter plus a short trace generator gives deterministic signals to verify ingestion, storage, and query paths.

Promtail ships pod logs to Loki:

# promtail.values.yaml
config:
  clients:
    - url: http://loki.default:3100/loki/api/v1/push

Install Promtail and a noisy pod:

helm upgrade --install promtail grafana/promtail -n default -f promtail.values.yaml

cat <<'YAML' | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata: { name: logspam, namespace: default }
spec:
  replicas: 1
  selector: { matchLabels: { app: logspam } }
  template:
    metadata: { labels: { app: logspam } }
    spec:
      containers:
      - name: spammer
        image: busybox:1.36
        command: ["/bin/sh","-c"]
        args: ["i=0; while true; do echo \"$(date -Iseconds) level=ERROR app=checkout msg=failure-$i\"; i=$((i+1)); sleep 1; done"]
YAML

Telemetrygen emits traces to Tempo:

cat <<'YAML' | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata: { name: telemetrygen-traces, namespace: default }
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
      - name: telemetrygen
        image: ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest
        args:
          - traces
          - --otlp-endpoint=tempo.default:4317
          - --otlp-insecure
          - --duration=60s
          - --rate=10
          - --service=checkout
  backoffLimit: 0
YAML

What to expect: Promtail continuously pushes “ERROR … app=checkout” logs to Loki. The job writes a burst of checkout traces to Tempo. Both land in your buckets; Loki/Tempo queries them; MCP correlates them.

Post-install checks that confirm the plumbing is correct

Why to check now: quick JSON-RPC calls catch URL/path mistakes or missing data before deeper work.

Check Loki MCP:

kubectl -n default port-forward svc/lgtm-loki-mcp 8081:80 >/dev/null 2>&1 &
curl -s http://localhost:8081/mcp \
  -H 'content-type: application/json' \
  -d '{"jsonrpc":"2.0","id":"1","method":"tools/call","params":{"name":"loki_query","arguments":{"query":"{app=\"checkout\"}","limit":5}}}' | jq

Check Tempo MCP:

kubectl -n default port-forward svc/lgtm-tempo-mcp 8082:80 >/dev/null 2>&1 &
curl -s http://localhost:8082/mcp \
  -H 'content-type: application/json' \
  -d '{"jsonrpc":"2.0","id":"1","method":"tools/call","params":{"name":"tempo_query","arguments":{"query":"{resource.service.name=\"checkout\"}","limit":5}}}' | jq

Reading results: expect at least one Loki entry and several Tempo spans. If empty, widen the time window or re-run the telemetrygen job.

One Simple Prompt to Drive Questions, Queries, RCA, and Dashboards

You don’t need to craft elaborate, structured prompts or remember LogQL syntax; MCP does the heavy lifting. Once deployed, it understands how to query Loki and Tempo, correlate results, and even generate Grafana dashboards automatically.

In practice, you can ask it something as natural as:

Hey LGTM, can you check if there are any errors from the logs?
Also find the root cause and give me dashboards so that I can view and validate the data

That’s it.

Behind the scenes, MCP:

Generates a contextual question bank internally (e.g., “Which services are failing?” “What’s the p95 latency?”).
Runs correlated LogQL and TraceQL queries via the MCP tools (loki_query and tempo_query).

Synthesizes a root cause analysis narrative, supported by evidence from both logs and traces.
Produces Grafana dashboards or JSON specs so you can validate the data visually.

What looks like one simple request actually orchestrates multiple queries and joins under the hood, all aligned to your telemetry schema and tenant context.

Why this is significant: The goal of MCP is to make observability conversational and context-aware.

You don’t have to engineer complex prompts or write custom dashboards, just ask what you need to know.

If you want to automate it further (for daily reports or post-incident summaries), you can schedule the same natural-language query through your copilot or CI pipeline and let MCP continuously explain your system’s state in plain English.

Troubleshooting that targets the usual pitfalls

401/“no credentials provided” to Grafana Cloud: ensure Basic header base64 is USER:TOKEN and that your Helm values actually render into the tenants' Secret that the pods read.
404 from MCP to Loki/Tempo: remove /loki or /tempo from URLs, use host-only; MCP appends product paths.
Empty results: confirm Promtail is running and the trace job completed; verify Loki/Tempo logs show writes; widen query time windows.
Object store errors: toggle s3forcepathstyle for MinIO; verify bucket names, permissions, and region; check cluster egress/DNS.

Operating LGTM MCP Beyond the Demo: Scaling, Multi-Tenancy, and Cost Efficiency

Once you’ve proven the end-to-end flow with the demo setup, the next step is evolving it into a production-grade observability stack. The demo runs in a single namespace, with one Loki and Tempo instance and minimal data retention. Production environments demand stricter control over scale, cost, access, and reliability.

This section explains how to move from proof-of-concept to scalable enterprise operation.

Designing a Multi-Tenant Observability Topology

Multi-tenancy is the backbone of enterprise observability. It separates data from different teams, environments, or applications while maintaining shared infrastructure.

How it works in LGTM MCP:

Each tenant corresponds to a logical environment (e.g., prod, staging, or dev).
Tenants can have separate authentication headers (like X-Scope-OrgID).
MCP reads these headers from the tenants map in your values file and attaches them to every query.

Example multi-tenant config:

loki:
  tenants:
    prod:
      X-Scope-OrgID: "prod"
    staging:
      X-Scope-OrgID: "staging"
tempo:
  tenants:
    prod:
      X-Scope-OrgID: "prod"
    staging:
      X-Scope-OrgID: "staging"

Why this is significant: Multi-tenancy lets you isolate logs and traces per environment while sharing compute. SREs can query prod data securely while developers work with staging datasets, all within the same MCP layer.

Scaling Strategies for Loki, Tempo, and MCP

When workloads grow, LGTM components need to scale differently.

Component	Scaling Strategy	Key Metrics to Watch
Loki	Move from single-binary to distributed mode. Separate `ingester`, `querier`, and `index-gateway`.	Ingestion latency, index size, and query concurrency.
Tempo	Scale the `ingester` and `compactor` independently. Use replication and partitioning.	Span throughput, compaction lag, query latency.
MCP	Run multiple replicas of Loki-MCP and Tempo-MCP behind a `ClusterIP` service or Ingress.	Response time, concurrent queries handled.

Practical note: Loki scales horizontally for write and read throughput, while Tempo scales primarily for trace ingestion and block compaction. MCP is stateless and lightweight; it scales mostly for concurrent user load.

Implementing Cost Controls through Retention and Lifecycle Rules

Problem: Object storage is cheap, but not infinite. Long-lived logs and traces can quietly inflate bills.

Solution: Apply time-based retention and automatic deletion policies.

1. Loki retention via Helm values:

loki:
  limits_config:
    retention_period: 168h   # 7 days
  table_manager:
    retention_deletes_enabled: true

2. Tempo retention with block lifecycle:

tempo:
  compactor:
    retention_enabled: true
    retention: 168h

3. Object store lifecycle rules (S3/MinIO):

Define bucket lifecycle in your provider (e.g., “delete objects older than 30 days”).

This acts as a second safety net in case the app-level retention fails.

Securing MCP in a Multi-Team Environment

MCP servers expose JSON-RPC APIs that can trigger queries and retrieve data. Protect them as you would any internal API.

Security best practices:

Restrict network access: Expose MCP only within the cluster or via an authenticated API gateway.
Add service account RBAC: Only allow pods with proper ServiceAccount roles to invoke the MCP endpoints.
Use network policies: Deny all cross-namespace ingress except from trusted monitoring namespaces.
Rotate API keys regularly: Especially if using CardinalHQ or Grafana Cloud credentials within tenant headers.

Optional enhancement: Add an Nginx or Istio sidecar that enforces mutual TLS for MCP communications if you integrate with external systems (e.g., CI/CD bots or observability AIs).

Maintaining Performance and Reliability at Scale

When queries become heavier (thousands of series, millions of spans), the main bottleneck shifts from MCP to backend query performance.

Performance optimization checklist:

Use indexed fields in your LogQL and TraceQL queries (e.g., app, namespace, status_code).
Cache results using Grafana’s query caching or a reverse proxy.
Enable parallel query sharding in Loki and Tempo for large range scans.
Add dedicated SSD-backed caches (Memcached or Redis) for active blocks.
Keep Cardinal API latency low; MCP relies on it for external model context lookups.

Example: Real Multi-Environment Setup

Imagine a SaaS platform running multiple services (checkout, search, billing) across staging and prod.

Layer	Setup	Description
Loki	Two tenants (`prod`, `staging`) writing to separate S3 prefixes.	Prevents noisy dev logs from inflating prod queries.
Tempo	Shared backend, but tenant tags separate blocks.	Enables shared compaction and separate retention.
MCP	One instance per environment, both talking to the same backends.	Each one provides environment-specific observability.

This setup gives you full isolation with shared infra efficiency, a sweet spot between control and cost.

Comparing LGTM MCP to Other Observability Patterns: Context as the New Differentiator

As teams mature their observability practices, they often discover that traditional setups, even those using the same components (Loki, Tempo, Grafana), hit a wall in contextual understanding. The difference isn’t in the data you collect; it’s in how intelligently you can connect and interpret it.

LGTM MCP doesn’t replace the classic observability stack; it elevates it. Let’s look at how.

Traditional Grafana + Loki + Tempo: Fragmented Visibility

In a standard Grafana setup:

Loki handles logs with LogQL.
Tempo stores traces with TraceQL.
Prometheus covers metrics with PromQL.
Grafana dashboards visually connect them, but the correlation logic still lives in your head.

This model works for dashboards and alerts, but breaks down under investigative load:

Each query must be handcrafted and re-run.
Correlating logs, traces, and metrics requires constantly switching between tabs and timestamps.
You rely on tribal knowledge of labels, environment variables, or span IDs.

In short: The system collects data; the human provides the context.

Grafana Cloud Pipelines: Managed but Still Manual

Grafana Cloud improves this picture by offering a managed backend and global access tokens, but it doesn’t change the investigation model:

You still construct queries manually.
You still reason about relationships between logs and traces.
Context (the “why” behind an event) is still external, living in human memory or wiki pages.

Cloud pipelines are infrastructure convenience, not cognitive automation.

LGTM MCP: Turning Data Into Context-Aware Conversations

LGTM MCP extends your observability stack by adding a context layer on top of Loki and Tempo.

Instead of manually crafting LogQL or TraceQL, you ask questions in natural terms, and MCP translates, executes, and correlates them.

Example difference:

Task	Traditional Stack	LGTM MCP
Identify failing checkout requests	Manually search in Loki for `app="checkout" AND level="error"`	Ask: “Which checkout endpoints failed in the last 10 minutes?”
Correlate with traces	Manually copy span IDs to Tempo query	MCP auto-links logs and traces via span context
Generate dashboards	Build panels one by one in Grafana	MCP emits a Grafana dashboard spec (JSON) pre-filled with key metrics

MCP brings semantic alignment; it knows how Loki’s log schema, Tempo’s trace attributes, and your system topology fit together.

MCP vs OpenTelemetry Collector: Different Layers of the Stack

OpenTelemetry Collectors generate and route telemetry. MCP consumes and interprets it. You still need OTEL to emit logs/traces, but you use MCP to:

Query and correlate the resulting telemetry.
Explain causality (“this log pattern preceded that span failure”).
Generate dashboards or alerts without manual config.

In essence, OTEL gives you structured data; MCP gives you structured understanding.

When MCP Makes the Most Difference

MCP shows the biggest return in environments that are:

High-scale or distributed: dozens of services, dynamic environments.
Human-intensive: SREs or developers constantly chasing incidents.
Context-rich: where root causes depend on metadata, versions, or recent deploys.
Multi-tenant or regulated: needing per-team isolation and auditable RCA flows.

When logs are simple and systems are small, plain Grafana + Loki + Tempo is fine. When your platform complexity outpaces your ability to reason manually, MCP becomes the multiplier.

For developers:

LGTM MCP doesn’t ask you to abandon Grafana; it extends it with intelligence. It transforms your observability stack from a dashboard viewer into a diagnostic assistant. Instead of debugging by pattern recognition, you investigate through a structured context.

How LGTM MCP Improves Real-World Observability Workflows

When most engineers talk about “observability,” what they actually mean is visibility, the ability to view logs, metrics, and traces. But visibility doesn’t guarantee understanding. MCP’s contextual intelligence bridges that gap. Let’s revisit the scenario from the start of this post, the checkout service with sporadic latency spikes, and see how LGTM MCP changes the story in practice.

Before MCP: Reactive and Fragmented Debugging

Without MCP, debugging a production incident often follows this pattern:

An alert fires in Grafana (“checkout latency > 2 s”).
An SRE pivots to Loki to search logs.
They find HTTP 500 errors, then manually grab a trace ID to query in Tempo.
A different engineer checks Prometheus metrics to verify if resource contention is a factor.

Every hop costs context. Even experienced teams lose time correlating data and rebuilding mental models of what happened.

After MCP: Context-Driven Root-Cause Discovery

With LGTM MCP integrated, you query the system in plain terms:

“Show me why checkout latency spiked at 14:32 yesterday.”

MCP orchestrates three steps automatically:

Generates relevant queries, using embeddings and its question-bank engine trained on your telemetry schema.
Runs correlated Loki (LogQL) and Tempo (TraceQL) queries, pulling metrics and trace spans tied to the same transaction IDs.
Synthesizes insights, e.g., “Database write latency increased by 140 % on node db-2 due to connection pool exhaustion.”

Within seconds, Grafana dashboards appear with panels pre-filtered to that event window, no manual joins, no guesswork.

How It Works Under the Hood

Here’s what happens step-by-step:

Data storage in buckets – OTEL collectors export logs and traces to S3-compatible buckets (otel-logs, otel-traces).
LGTM ingestion – Loki and Tempo read from those buckets as their backend stores.
MCP orchestration – The LGTM MCP layer queries these stores using the tenant credentials you configured, applying the right schema and context.
Context synthesis – CardinalHQ’s model service enriches responses by linking stack traces, metrics, and deployment metadata.
Visualization – MCP emits a Grafana dashboard JSON that can be imported or rendered automatically.

The key shift: you’re no longer writing LogQL; you’re expressing intent, and MCP translates that into optimal, multi-source queries.

Quantifiable Benefits in Developer Workflows

Dimension	Without MCP	With LGTM MCP
MTTR (Mean Time to Resolution)	30–60 min of manual correlation	5–10 min automated root-cause inference
Dashboards Maintained	Dozens of static panels	Dynamic Grafana JSON generated on demand
Query Complexity	High; requires LogQL/TraceQL expertise	Low; expressed in natural language
Cross-Team Context	Tribal knowledge only	Standardized, model-backed context layer

The result is not merely “faster queries” but a shift from data lookup to knowledge retrieval.

A Developer’s Daily Routine, Reimagined

Morning stand-ups look different now:

Instead of sharing screenshots from Grafana, developers run:
“Summarize yesterday’s checkout errors and their causes.”
MCP surfaces a concise report: error spikes, correlated traces, affected nodes, and suggested fixes.
Dashboards are auto-generated only when anomalies appear, reducing dashboard bloat.

This transforms observability from a passive monitoring activity into an active debugging dialogue with your infrastructure.

Conclusion: From Observability to Understanding

Every engineering team collects telemetry, logs, metrics, and traces. But as systems grow, so does the cognitive load of interpreting them. LGTM MCP transforms observability from a passive act of watching dashboards to an active process of understanding system behavior through context.

In the checkout-service story we began with, engineers once chased error graphs, searched logs, and compared traces manually. With MCP, the same process becomes conversational and automated: a single natural query identifies the issue, correlates logs and traces, and produces actionable Grafana dashboards for verification.

By bridging the gap between data visibility and contextual reasoning, LGTM MCP enables:

Rapid root-cause analysis: contextual queries replace ad-hoc searches.
Adaptive dashboards: MCP builds views dynamically from real data.
Scalable observability: one layer orchestrates across tenants and clusters.
Human + machine collaboration: LLMs enrich raw telemetry with causal insight.

For developers and SREs, this means spending less time hunting metrics and more time improving reliability. MCP isn’t replacing Grafana, Loki, or Tempo; it’s augmenting them with intelligence.

The near future is heading toward autonomous observability, systems that not only detect anomalies but also explain and fix them. LGTM MCP lays the foundation: it already understands your telemetry and can reason about it. The next leap is self-healing operations, where insights evolve into automated remediations. To try out the MCP with just Grafana, check out this other blog.

FAQs

How does LGTM MCP authenticate to Loki and Tempo in multi-tenant setups?

LGTM MCP attaches tenant-scoped headers per target (e.g., X-Scope-OrgID, Authorization: Basic <user:token>). In self-hosted single-tenant mode, use {} (no headers). In Grafana Cloud, use the product “User/Instance ID” and an Access Policy token in the Authorization header.

Does LGTM MCP read directly from S3/MinIO buckets for logs and traces?

No. LGTM MCP never touches object storage. It queries Loki and Tempo over HTTP; those backends read from the bucket and return results. This separation simplifies security and auditing.

What’s the best way to set Loki object storage for production performance and cost?

Use TSDB schema (v13) with S3/MinIO, enable retention in Loki, and configure bucket lifecycle rules (e.g., auto-delete after N days). Prefer indexed labels (e.g., app, namespace, status) and consider results cache or Memcached for heavy queries.

How do I export traces to Tempo using OpenTelemetry OTLP?

Point your apps/collector to the OTLP gRPC endpoint (default :4317) of Tempo. Verify the Tempo storage backend (S3/MinIO) and ensure the distributor/ingester is reachable from your workloads. Use TraceQL filters (e.g., status != ok) for efficient queries.

How does Grafana ingest and display dashboards generated by MCP?

MCP can return dashboard JSON or panel specs. You can import JSON via Grafana UI or have your copilot call the Grafana HTTP API to create/update dashboards. Tie panels to your Loki/Tempo datasources and apply time/label filters for faster triage.

‹ Top Open Source Observability MCP Servers That Bring Context to Chaos

X.com