Measuring Support Engineering Effectiveness, Metrics That Matter
TL;DR — Half the metrics on a typical support dashboard are vanity or gameable. The ones that actually predict outcomes are reopen rate, time-to-resolution by tier, escalation density, and a single quality-adjusted CSAT. The rest go in an appendix.
I’ve watched too many support orgs spend a quarter optimizing a metric that didn’t matter, only to find that the customer satisfaction story got worse while the dashboard turned green. Goodhart’s law applies to support: when a measure becomes a target, it stops being a good measure. The fix isn’t to stop measuring. It’s to measure the right things and to triangulate so that gaming one number breaks another.
This piece is the metrics framework I’ve used to evaluate, redesign, and run support engineering measurement at three different companies. It’s opinionated. Some of the metrics you’ve probably been reporting for years are in the “retire” list. Some metrics nobody talks about are in the “elevate” list. Read it as a senior engineer or tech support manager who’s been asked to “improve metrics” and is trying to figure out which metrics actually reward improvement.
Pin everything to November 2025 tooling: Postgres 17 as the warehouse, dbt 1.9 for transforms, Grafana 11 for dashboards, Zendesk Explore as the source-of-truth for ticket data. If you’re on Salesforce Service Cloud the SQL changes but the framework doesn’t.
The metrics taxonomy
Start by separating metrics into four categories. Most teams conflate them, which is why their dashboards are noisy.
outcome
^
leading | lagging
|
----------------+----------------
|
operational | vanity
|
input
Outcome metrics: customer retention, expansion revenue, NPS, churn rate. The thing the business actually cares about. Slow to move, hard to attribute, but ultimately the only metrics that matter.
Leading metrics: reopen rate, escalation density, time-to-resolution variance. They predict outcome metrics. Faster to move and more actionable.
Operational metrics: handle time, first response time, queue depth, agent occupancy. They tell you how the machine is running. Useful for ops, dangerous as performance targets.
Vanity metrics: ticket volume, number of articles published, “AI-assisted resolutions.” They make the dashboard look busy. They don’t predict outcomes. They get gamed reliably.
Your leadership dashboard should have three outcome metrics, four leading metrics, and an appendix for operational. Vanity metrics get retired, full stop.
Step 1, the metrics to retire
Going to start with deletes because they’re the highest-leverage change.
Total ticket volume. Means nothing in isolation. Volume up could mean more product usage (good) or more product bugs (bad). Always report volume as a ratio (tickets per MAU, tickets per release) or don’t report it at all.
First contact resolution rate (when self-reported). Agents close tickets and reopen them under new IDs to inflate FCR. Unless your tooling can detect reopens across ID changes, FCR is a vanity metric. Reopen rate (next section) is the better signal.
Average handle time as a target. As a diagnostic it’s fine. As a target it incentivizes closing tickets fast, which incentivizes incomplete resolutions, which raises reopen rate two weeks later. The customer notices. The dashboard doesn’t.
Number of KB articles published. I’ve never seen this correlate with anything useful. A team can publish fifty articles a quarter that nobody reads. The real metric is KB article retrieval-to-resolution ratio (covered below).
AI-assisted resolution count. Almost always a fake metric driven by vendor pricing. Whether an LLM was in the loop doesn’t predict outcomes; whether the customer’s problem got solved does.
If your current dashboard has more than two of these, you have a measurement problem. Cut them this week, replace with what’s below.
Step 2, the leading metrics worth tracking
Reopen rate, segmented
A reopened ticket means you closed it before the problem was actually solved. This is the single best signal of resolution quality.
-- models/marts/support/fct_reopen_rate.sql
WITH ticket_status AS (
SELECT
ticket_id,
account_tier,
category,
resolved_at,
reopened_at,
EXTRACT(EPOCH FROM (reopened_at - resolved_at)) / 86400 AS days_to_reopen
FROM {{ ref('stg_ticket_events') }}
WHERE resolved_at IS NOT NULL
)
SELECT
DATE_TRUNC('week', resolved_at) AS week,
account_tier,
category,
COUNT(*) AS resolved_count,
SUM(CASE WHEN reopened_at IS NOT NULL
AND days_to_reopen <= 14
THEN 1 ELSE 0 END)::FLOAT
/ NULLIF(COUNT(*), 0) AS reopen_rate_14d
FROM ticket_status
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3;
The 14-day window matters. If you only count reopens within 24 hours, you miss the slow-burn ones where the customer tried the suggestion, it didn’t work, and they came back a week later. Healthy reopen rate for enterprise tier is under 8%; for self-serve it’s under 15%. If your number is much lower, you’re probably not capturing reopens correctly.
Time-to-resolution variance
The mean is misleading. The 90th percentile tells you whether you’re consistent.
SELECT
DATE_TRUNC('week', resolved_at) AS week,
priority,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY resolution_minutes) AS p50,
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY resolution_minutes) AS p90,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY resolution_minutes) AS p99,
PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY resolution_minutes)
- PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY resolution_minutes) AS p90_p50_spread
FROM {{ ref('fct_ticket_lifecycle') }}
WHERE resolved_at > now() - interval '90 days'
GROUP BY 1, 2;
The p90-to-p50 spread is the signal. A small spread means consistent service. A growing spread means you’ve got a small number of tickets that are eating disproportionate time, and those are the ones that customers churn over.
Escalation density
How often does a ticket need to escalate to L2 or L3? If it’s increasing, either your L1 is under-trained or your product complexity is growing faster than your team’s knowledge.
SELECT
DATE_TRUNC('week', t.created_at) AS week,
t.category,
COUNT(DISTINCT t.id) AS tickets,
COUNT(DISTINCT e.ticket_id) AS escalated,
COUNT(DISTINCT e.ticket_id)::FLOAT / NULLIF(COUNT(DISTINCT t.id), 0) AS escalation_rate
FROM {{ ref('stg_tickets') }} t
LEFT JOIN {{ ref('stg_escalations') }} e
ON e.ticket_id = t.id
AND e.escalated_to_tier IN ('l2', 'l3')
WHERE t.created_at > now() - interval '90 days'
GROUP BY 1, 2;
Useful breakdowns: by category, by agent (who’s escalating too much), by product area (where’s the complexity creeping in). If a single agent’s escalation rate is 3x the team median, they need training or pairing. If a product area is escalating 3x the rest, engineering owes you better documentation or a runbook.
Quality-adjusted CSAT
Raw CSAT is noisy. Response rates are low (10-20% in most orgs), satisfied customers under-respond, unhappy customers over-respond. Adjust.
WITH csat AS (
SELECT
ticket_id,
score,
responded_at
FROM {{ ref('stg_csat_responses') }}
WHERE responded_at > now() - interval '90 days'
),
tickets AS (
SELECT
ticket_id,
account_tier,
category,
resolved_at,
reopened_at
FROM {{ ref('stg_tickets') }}
WHERE resolved_at > now() - interval '90 days'
)
SELECT
DATE_TRUNC('week', t.resolved_at) AS week,
t.account_tier,
COUNT(c.score)::FLOAT / NULLIF(COUNT(*), 0) AS response_rate,
AVG(CASE WHEN c.score IS NOT NULL THEN c.score END) AS raw_csat,
AVG(CASE WHEN c.score IS NOT NULL THEN c.score
WHEN t.reopened_at IS NOT NULL THEN 1
ELSE NULL END) AS adjusted_csat
FROM tickets t
LEFT JOIN csat c ON c.ticket_id = t.ticket_id
GROUP BY 1, 2;
The adjusted CSAT counts a reopen as an implicit dissatisfied signal. Customers who got a real answer rarely reopen; customers who got a wrong answer often don’t bother filing a complaint, they just come back through the front door. Counting the reopen as a CSAT-of-1 is conservative but it correlates with churn much better than raw CSAT does.
Step 3, the operational metrics (in the appendix)
These belong in your team-lead dashboard, not your executive dashboard.
Handle time by category. Diagnostic only. Watch the distribution, not the mean.
First response time. Useful for SLA tracking, useless as a target. Pair with resolution time always.
Queue depth by tier. Real-time operational signal. If L2 queue depth is over the team’s daily capacity, work is going to age.
Agent occupancy. The percentage of an agent’s working hours spent on tickets. Healthy is 60-75%. Anything over 85% sustained means people aren’t doing the followup work, training, and KB updates that compound long-term.
I went deeper on the operational side of this in SLA driven operations for tech support managers if you want the full operating-rhythm context.
Step 4, the outcome metrics
These are the ones that justify your team’s existence to the business.
Account-level customer health
Combine ticket signals into a per-account score. The composite I use:
WITH account_signals AS (
SELECT
account_id,
COUNT(*) AS tickets_90d,
AVG(resolution_minutes / 60.0) AS avg_resolution_hours,
SUM(CASE WHEN reopened_at IS NOT NULL THEN 1 ELSE 0 END)::FLOAT
/ NULLIF(COUNT(*), 0) AS reopen_rate,
SUM(CASE WHEN priority IN ('p1', 'p2') THEN 1 ELSE 0 END) AS critical_tickets,
MAX(CASE WHEN csat_score <= 2 THEN created_at END) AS last_bad_csat
FROM {{ ref('fct_ticket_lifecycle') }}
WHERE created_at > now() - interval '90 days'
GROUP BY 1
)
SELECT
a.account_id,
a.account_name,
a.tier,
s.tickets_90d,
s.reopen_rate,
s.critical_tickets,
CASE
WHEN s.reopen_rate > 0.20 THEN 'red'
WHEN s.critical_tickets > 3 THEN 'red'
WHEN s.last_bad_csat > now() - interval '14 days' THEN 'yellow'
WHEN s.tickets_90d > 30 THEN 'yellow'
ELSE 'green'
END AS support_health
FROM {{ ref('dim_accounts') }} a
LEFT JOIN account_signals s ON s.account_id = a.account_id
WHERE a.tier IN ('enterprise', 'business');
Share this with your customer success team weekly. Red accounts need a proactive touch. Yellow accounts need watching. Green accounts are fine.
Tickets per active user
Volume normalized to product usage. This is the metric that tells you whether the product is getting easier or harder to use over time.
SELECT
DATE_TRUNC('month', t.created_at) AS month,
COUNT(DISTINCT t.id)::FLOAT / NULLIF(u.mau, 0) AS tickets_per_mau
FROM {{ ref('stg_tickets') }} t
JOIN {{ ref('stg_monthly_active_users') }} u
ON DATE_TRUNC('month', t.created_at) = u.month
GROUP BY 1, u.mau;
A rising tickets-per-MAU is a product UX signal, not a support engineering signal. The right reaction is to bring it to product engineering, not to staff up support. The full pattern for that conversation is in next week’s article on closing the loop, support feedback to product engineering.
KB retrieval-to-resolution ratio
How often does a KB article actually solve a customer’s problem? Track this through your help center’s search-then-no-ticket conversion, plus internal usage from the triage system.
WITH kb_views AS (
SELECT
article_id,
viewer_session_id,
viewed_at
FROM {{ ref('stg_kb_pageviews') }}
WHERE viewed_at > now() - interval '30 days'
),
ticket_after_view AS (
SELECT
v.article_id,
v.viewer_session_id,
EXISTS (
SELECT 1 FROM {{ ref('stg_tickets') }} t
WHERE t.requester_session = v.viewer_session_id
AND t.created_at BETWEEN v.viewed_at AND v.viewed_at + interval '4 hours'
) AS filed_ticket
FROM kb_views v
)
SELECT
article_id,
COUNT(*) AS views,
SUM(CASE WHEN NOT filed_ticket THEN 1 ELSE 0 END)::FLOAT / COUNT(*) AS deflection_rate
FROM ticket_after_view
GROUP BY 1
ORDER BY views DESC;
Articles with high views and low deflection are doing harm. They’re attracting customers without resolving their issue. Either rewrite them or delete them.
Step 5, the dashboard layout
One leadership dashboard. Three tabs.
Tab 1, outcomes. Account health distribution, tickets-per-MAU trend, churn correlation with support health.
Tab 2, leading indicators. Reopen rate by tier, resolution time variance, escalation density, adjusted CSAT.
Tab 3, operational. Queue depth, SLA compliance, handle time distribution. Updated hourly. The team leads live here.
Grafana panels for these are mostly time-series with annotations on major events (product releases, on-call shifts, hiring changes). The annotations are the most underrated feature. They let you correlate a metric movement with a specific change six months later when nobody remembers what happened.
The Grafana dashboarding best practices guide is the canonical reference for layout and refresh cadence; their recommendations on panel density are worth following.
Common Pitfalls
Picking a single “north star” metric. Goodhart’s law guarantees gaming. Pick a small set (four to six) of triangulating metrics. If improving one degrades another, the system catches it.
Reporting metrics monthly only. You need weekly cadence for leading metrics. By the time a monthly report shows a problem, you’re a month into the regression.
Mixing tiers in aggregate metrics. Enterprise and self-serve customers have completely different expectations and metric distributions. Always segment, always.
Treating CSAT response rate as a problem to fix. Most teams chase 50% response rates by nagging customers. Higher response rate makes the score noisier and trains customers to ignore surveys. A 15-20% response rate with the right sampling is fine.
Comparing yourself to industry benchmarks. They’re meaningless. Your product, your customer mix, your SLA structure, your team composition are all different from the vendor benchmark report. Compare to your own past, not someone else’s present.
Troubleshooting
Symptom, all metrics look good but customers are churning. Your metrics don’t capture the right outcome. Almost always means you’re not measuring the time-from-first-report-to-resolved-and-customer-confirmed-fixed for critical issues. The customer’s lived experience is different from your dashboard. Sample five churned accounts from the last quarter and walk through their full ticket history. The story will be specific.
Symptom, a single metric oscillates wildly week to week. Either the sample size is too small (segment less aggressively at low volumes) or there’s a data quality issue (check for nulls, check for outliers). If the metric is real and noisy, smooth it with a 4-week trailing average for the dashboard while keeping the raw metric for diagnostics.
Symptom, two teams report different numbers for “the same metric.” Definition drift. SQL definitions need to live in dbt, be versioned, and be referenced rather than copy-pasted. If your CSAT calculation lives in three different dashboards, it’s a matter of time before they diverge.
Wrapping Up
Measurement is leverage. Bad metrics teach the team to do the wrong work. Good metrics teach the team to do the right work and let leadership see the story. The patterns above will not give you a perfect dashboard the first week; they will give you a process for converging on one over a quarter.
Next in this series I lay out the escalation runbooks that turn measurement into action when something does go wrong. The metrics tell you something’s broken; the runbook tells you what to do about it.