The Evolution and Structure of Distributed Platforms: Part 4 - Toward an Agentic Software Catalog

Table of Contents
What if maintenance wasn’t something engineers did, but something your system continuously executed? With a complete software catalog and agentic skills, updates stop being tickets and start becoming infrastructure.
The Operational Catalog
Previous posts in this series documented the rise of distributed systems along with the operational changes they bring. My last post described developer portals and how they can increase the efficiency of those operations. Recently, I’ve been using the Datadog Software Catalog, which offers much of the functionality I proposed in that post. With a working catalog documenting hundreds of system components, I’ve focused on implementing the next steps to automate maintenance work. Combining the structure of a software catalog with the capabilities of coding agents means we’ve hit a point where updates to hugely distributed systems can be done in broad strokes with minimal effort.
Moving From Patches to Skills
The first and most necessary shift is to stop thinking in terms of “maintenance updates” and start thinking in terms of maintenance skills. Instead of knowing how to perform certain updates and selectively taking the time to perform them, every update should be done via an agent. Initial updates should be planned within agents, executed with some hand-holding, and upon success, be exported as a reusable skill. “Solved” updates should be automatically applied from shared skills. We’ve reached a point where doing ad-hoc maintenance should be seen as akin to building your own solution for solved problems - meaning it’s fine at the hobbyist level but wasteful at the enterprise level.
Designing for Agent Maintainability
Before agents can participate in maintaining your system in a scalable manner,
your system needs to be legible to them. Some foundational patterns must be
enforced across the board. Components should all be well-formed (as defined in
part 2) and based on a template.
Dependencies should be strictly tracked, with top-level dependencies clearly
called out and all specific sub-dependencies pinned in a lockfile. Commands that
skills will use need to be standardized across all component repos; e.g.,
build, test, lint, and update_lockfile need to all run similarly within
containers and produce clear pass/fail signals.
Composition of Maintenance Skills
With a software catalog full of homogenous components, we’re free to define skills that automate maintenance across N instances of system components. However, our maintenance skills must be more than the instructions on how to perform a particular update. Think of skills as being incomplete unless they are able to completely replace all parts of maintenance, not just the update itself. These maintenance skills must be able to find what to work on, build up context on each repo, execute its update successfully, and contribute it according to organizational standards. Each of these can be broken down into a separate sub-skill that can be used across all skills.
Findability
Agents must be able to use the software catalog to find targets for broad automation. The software catalog is the foundation for answering questions like:
- Which services are written in Go vs Python?
- Which components are owned by which teams?
- What’s running in production but hasn’t been touched in 18 months?
- Where are the critical dependency edges?
Once your system is indexed along meaningful dimensions then agents can perform these top-down queries, finding and updating relevant codebases.
| DIMENSION | DESCRIPTION |
|---|---|
| component_type | Top level grouping, like API, Function, Worker. I detailed the component types in part 2 of this series |
| owner | The team responsible for the component |
| language/language_version | Runtime language and version |
| lifecycle_status | Status within the SDLC lifecycle - e.g. undeployed, active, or deprecated |
Omni Agents
After traversing the software catalog to find what to work on, we need an AGENTS.md document that describes the conventions of the component type being looked at. For this, the org should maintain a system-level AGENTS file which describes:
- What are the major domains of the system?
- What types of components comprise the system?
- What conventions and patterns exist within component types?
For any component type this should include a rundown of directory structures, framework usage, conventions, and gotchas. It should serve as instructional topology that guides agents on how to find and navigate system components. This should also be a living document that skills can contribute to as new learnings are encountered. In fact, the main advantage of this approach over per-repo AGENTS.md files is that learnings will continue to accumulate, and issues will be solved before they’re encountered in each repo.
Skill specific logic
The specific logic of the skill varies greatly but should be defined as deterministically as possible. For this reason, I’ve found it best to generate in-depth skill definitions using more capable (and costlier) models and then delegate their broad execution to cheaper models.
Validation steps
Skills that can validate their work are exponentially more effective than single-pass skills that work on a prompt until they think they’re done and exit. The reliability of the skill is dependent on the extent to which it can be tested. For this reason, baseline tests are a must, but anything beyond that is highly encouraged. If a skill can launch a dev version of the component locally and trigger it, gauging the behavior against a baseline implementation or an assumed result, then the skill can continually iterate.
Bridging the Gap: CI/CD as an Interface
Make your CI/CD system accessible - not just to humans, but to agents. Ideally pipeline steps can be run locally or exposed via controlled interfaces (MCPs, tokens, APIs). This allows agents to validate changes, run tests, observe outcomes, and iterate on failures.
Contribution Guidelines
Things like forking, upstream management, PR and co-author guidelines must be defined centrally. Agents should mark their model as a co-author on PRs. Additionally, Agents should be able to throttle the rate at which they publish PRs. An agent should be able to check the status of its published PRs and not continue opening new ones until previous PRs have passed checks.
Graceful Failures
Agents that can iterate can end up burning through tokens - for this reason, agents should have guidelines on how many iterations to perform before giving up. A failure should produce a clear message to the owner team that automated updates cannot be run.
Running the System: Scheduled Agentic Work
Once you have a catalog, a set of skills, and a way to execute them - you can begin to operate the system continuously. For this, you’ll need workers with a secure execution environment with appropriate access to source control, CI/CD systems, the software catalog, and any internal-only resources used in development. This is essentially an engineer’s full setup, including an identity that changes can be traced back to. Those workers should be controlled from a higher level work scheduling system which can allow for the rollout of singular changes or continual maintenance patches.
New Challenges
Perhaps the biggest new challenge, and something I’ll cover in the next post in this series, is the new bottleneck of PR reviews.