background-shape
Lessons From Running Platform Engineering Teams in 2025
December 15, 2025 · 11 min read · by Muhammad Amal programming
Advertisement

TL;DR — Platform engineering succeeds when the team treats its developers as paying customers, ships paved roads with sharp edges deliberately filed off, and resists the urge to become a gatekeeping function. It fails when it becomes a centralized ops team with a fancy name.

The platform engineering label has been around long enough now that we can stop arguing about whether it’s a real thing and start talking honestly about what works and what doesn’t. I’ve run or co-led platform teams at three companies over the past five years, and 2025 was the year I felt I finally had enough data points to write down what I’d learned without sounding like I was just reciting the Team Topologies book.

This isn’t an introduction to platform engineering. If you’re not sure what a platform team is, the Team Topologies book by Skelton and Pais is the right starting point. This post assumes you’re running one, or about to start one, and you want the lessons that aren’t in any book yet.

Advertisement

Most of what I’ll cover is structural and cultural rather than technical. The technical choices (Kubernetes vs. ECS, Terraform vs. Pulumi, what your golden path looks like) matter, but they matter less than the operating model you build around them. I’ve seen teams ship beautiful infrastructure that nobody used and ugly infrastructure that everyone loved. The difference was always in the team’s relationship to its customers.

What platform engineering actually is

Strip away the marketing. A platform engineering team builds internal products that other engineers use to ship faster, more safely, and with less cognitive load. The “internal product” framing is the key part. Most platform teams that fail do so because they don’t take that framing seriously.

What “internal product” means in practice:

  • You have identifiable customers (other engineering teams).
  • You have a roadmap driven by customer needs, not by what the platform team finds interesting.
  • You measure adoption, satisfaction, and outcomes, not just uptime.
  • You have a coherent product surface area, not a bag of unrelated tools.
  • You write docs that an engineer can read on Monday morning and ship something by Wednesday.

If you can’t say yes to all five, you’re not running a platform product. You might be running a centralized ops team, an SRE pool, or an internal consultancy. Those are valid things to run, but they’re not platforms, and the patterns that work for them are different.

The paved road is the unit of value

Every platform team should be able to point to its paved roads. A paved road is a specific, opinionated way to do a common thing, with the boring parts already handled. Examples that I’ve seen work:

  • The standard way to deploy a stateless service (Helm chart, CI pipeline, dashboards, alerts, on-call setup, all pre-wired).
  • The standard way to ship a scheduled job (one YAML file, runs on a managed scheduler, integrates with the standard logging and metrics).
  • The standard way to expose a new HTTP API to other services (mTLS, rate limiting, retries, circuit breakers, all defaults).

What makes a paved road different from “infrastructure your team can use”:

  1. It’s opinionated. There’s exactly one way to do the thing.
  2. It’s instrumented. Anyone on the paved road gets observability for free.
  3. It’s documented. Not just reference docs. A walkthrough that takes you from zero to deployed in under a day.
  4. It’s supported. The platform team owns the road, not the users.
  5. It has a defined off-ramp. There’s a documented way to leave the paved road if your team genuinely needs to.

The fifth point matters more than people think. Paved roads that don’t have escape hatches become political tools. Make the escape explicit, document the trade-offs, and the road earns trust.

The internal product team operating model

Here’s the operating model I’ve found works for platform teams running multiple paved roads:

A small set of product surface areas

Three to five paved roads, each owned by a sub-team (or a clear DRI if the platform team is small). Each road has a one-page product brief that any engineer in the company can read and understand what the road does, who owns it, and what its roadmap looks like.

Customer interfaces

Each road has at least one named “platform engineer of the day” or rotation who fields customer questions in Slack. Customer questions get logged, even if they’re answered in the moment, so the team can see patterns over time.

A monthly customer review

Once a month, the platform team reviews customer signal. Not just incident metrics. Adoption rates, support ticket categories, satisfaction surveys (lightweight, three questions), and the open feature requests. The output is a one-page summary that goes to the team’s leadership and the broader org.

A quarterly roadmap process

The platform team’s roadmap is driven by customer signal, not by what looks interesting. Each quarter, the team commits to two to four “themes” with measurable outcomes. Themes I’ve seen work: “reduce time to ship a new service from three days to half a day,” “cut the number of teams running their own deploy scripts from twelve to four,” “ship a managed database story so teams stop running Postgres themselves.”

RACI for shared services

For services that span platform team and product teams (e.g., who owns the production database for service X?), a written RACI table prevents the most common arguments. A simplified version:

Concern Platform Product Team Other
Database uptime SLO R A -
Schema design C R, A -
Backup and restore R, A C -
Data classification C R, A Security: A
Capacity planning R, A C -
Migration of legacy services C R, A -

R = Responsible (does the work), A = Accountable (single throat), C = Consulted, I = Informed.

The point isn’t the specific assignments. The point is that the platform team and the product team have written down who does what, so the next time there’s a Postgres incident at 3am, nobody is figuring it out for the first time at 3am.

What we got wrong this year

A few things my team and I got wrong in 2025, in case it saves you the same lessons.

We built before we sold

Early in the year, we built a new paved road for background job processing without doing enough discovery. We assumed (correctly) that teams were spending too much time on this, but assumed (incorrectly) that they all had the same shape of problem. The road we built didn’t fit the three biggest customers. We had to rebuild it. Cost: about three months. Lesson: do the discovery work before you build, even when you think you already know the answer. I wrote about discovery techniques in Consultative Discovery for Complex Software Architectures.

We took on too many on-call duties

The team agreed to be on-call for a database that we’d built tooling around. Within six months, we were getting paged for incidents we couldn’t diagnose because we didn’t actually own the workload. The fix was to renegotiate the RACI, but the renegotiation took most of a quarter and damaged trust along the way. Lesson: don’t take on-call for things you don’t own end-to-end. Tooling ownership and operational ownership are not the same.

We measured uptime instead of leverage

For most of the year, the team’s primary metric was platform availability. It was great. Five nines. The problem was that nobody outside the team cared. The customers cared about how fast they could ship and how much time they spent fighting infrastructure. We pivoted late in the year to measuring “time from new service decision to production traffic” and the conversations with customers immediately got better. Lesson: pick metrics that customers actually care about.

We documented the “what” without the “why”

Our docs were comprehensive reference. They were terrible for new users because nothing in them explained why the platform was the way it was. New engineers would read them, conclude the platform was overcomplicated, and roll their own. Lesson: every paved road needs a one-page “why” doc that explains the trade-offs. Reference docs aren’t enough.

How to know if your platform team is succeeding

Three signals I now look for at the end of every quarter:

Adoption is growing without being mandated

If teams are choosing the paved road because it’s actually faster, that’s the strongest signal of success. If adoption only happens because leadership mandated it, you’ve built something people tolerate, not something they want.

Support burden is steady or declining as adoption grows

A healthy platform sees support load per user trend down over time as docs improve and the platform gets more self-serve. If support load grows linearly with adoption, you’re scaling a help desk, not a platform.

Engineers describe the platform as “boring”

The highest compliment I’ve gotten from a customer was “I don’t think about your stuff anymore, it just works.” That’s the goal. Platforms that engineers find exciting are usually platforms that demand too much attention.

A worked example, a quarterly roadmap

Here’s a slightly abstracted real quarterly roadmap from this year, to show what the artifact looks like:

# Platform Team Q3 2025 Roadmap

## Theme 1, deploy faster (Service Owner Time)
- Outcome: median time from `git push` to production traffic
  reduced from 47 minutes to under 20 minutes
- Owner: Anna
- Key bets: prebuilt CI containers, parallel test sharding,
  staging deploys decoupled from prod gates

## Theme 2, fewer Postgres incidents
- Outcome: cut customer-team-owned Postgres incidents by 50%
- Owner: Bao
- Key bets: managed Postgres offering, runbook automation,
  proactive capacity planning service

## Theme 3, sunset legacy job runner
- Outcome: migrate the remaining 11 teams off the old job
  runner; decommission by end of quarter
- Owner: Chen
- Key bets: pre-built migration scripts, office hours,
  pairing sessions with team leads

## What we're explicitly not doing this quarter
- New observability dashboards (deferring to Q4)
- Multi-cluster deploys (deferring indefinitely; no demand)
- Wasm-based edge functions experiment (interesting, no clear demand)

The “what we’re not doing” section is often the most important. It’s where you protect the team from drift.

Common Pitfalls

  • Becoming a gatekeeper. The minute the platform team is the bottleneck for production changes, you’ve lost. The job is to enable, not to approve. Approval workflows belong to security, compliance, and the teams owning the change, not the platform team.
  • Building for engineering elegance instead of customer outcome. A beautifully architected platform that nobody uses is worse than an ugly one that everyone adopts. Measure adoption first, elegance second.
  • Confusing “platform” with “everything that’s shared.” Internal libraries that aren’t actively maintained, deprecated services, abandoned tools, all get dumped on the platform team as “platform stuff.” Push back. The platform is the actively maintained surface, not the bone yard.
  • Treating customers as users instead of partners. Customers who feel they have no input become customers who route around you. Invite them into roadmap reviews, not just status updates.
  • Skipping the migration plan. Every time you ship a new paved road, you’re implicitly asking teams to migrate from whatever they have today. Without a migration plan that includes pairing, tooling, and a deprecation timeline, the new road is just additional surface area.

When This Goes Wrong

Customer teams build their own platforms. Diagnosis: your paved road doesn’t fit them, and the off-ramp is too painful or undocumented. Fix: do discovery on the teams that built their own. What was missing? Often a simple feature gap is the answer, and shipping it brings them back.

The platform team becomes the incident response team for everything. Diagnosis: you’ve taken on operational ownership for things you shouldn’t have. Fix: a hard RACI rewrite with leadership backing. Painful, necessary.

The team can’t recruit. Diagnosis: platform engineering is seen as a less interesting career path inside your company. Fix: visible career ladders, clear staff-plus engineering opportunities on the platform team, and putting platform engineers on stage at company all-hands for shipped wins. Will Larson’s Staff Engineer is the right resource to share with engineers considering the path.

Wrapping Up

Running a platform team is more like running an internal startup than running an ops team. The technical choices matter, but the operating model matters more. The teams I’ve seen succeed treat their customers as customers, take internal product management seriously, and ship paved roads that engineers want to be on.

The lessons from 2025 are continuations rather than breakthroughs. The fundamentals haven’t changed in five years. Treat developers as customers. Pave one road well rather than ten roads poorly. Measure outcomes, not uptime. Own what you can operate, partner on the rest. The teams that internalize these things ship calmer organizations.

For 2026, the question I’m holding is what AI-native platforms look like. Most platforms today were designed in a world where engineers wrote most of the code. That’s already shifting. The platform that serves engineers in 2027, when twenty to thirty percent of code is agent-written, will look different from the platform that serves engineers today. I don’t know yet what it looks like. That’s next year’s discovery work.

Advertisement