Designing Codebases for AI-Assisted Development

Summary

AI-assisted development is mostly a documentation and pattern problem.

Model quality still matters, but in day-to-day engineering the stronger determinant is the shape of the codebase the model is asked to operate inside. If names are unstable, patterns silently fork, instructions are oversized, and conventions are only implied, the model has too many plausible continuations. If the codebase is coherent, the model has fewer.

This is why codebase design matters for AI-assisted work. The goal is not to make the repository “AI-friendly” in a superficial sense. The goal is to reduce ambiguity in the places where models infer structure, retrieve facts, and choose between multiple locally plausible implementations.

Semantic Association and Conceptual Integrity

LLM-based tools operate through semantic association. They infer likely structure from nearby names, repeated code shapes, file organization, examples, and instructions that appear relevant to the current task.

That makes Conceptual Integrity - Monotonic Pattern Discipline directly relevant to AI-assisted development. A model does not need many competing patterns to become less reliable. Two or three locally valid approaches to the same problem are often enough to make the next suggestion less predictable. When in-context patterns are ambiguous, the model falls back on pre-training defaults - generic patterns from training data rather than repository-specific conventions.

When one dominant pattern exists, the model has a stronger local precedent:

naming becomes easier to continue correctly
surrounding code provides clearer examples
retrieval is more likely to surface the right implementation shape
suggestions are less likely to blend incompatible styles

Scope of this principle. Pattern uniformity matters most within a bounded unit: a module, a service, a component area. Microservice architectures are deliberately polyglot, and AI tools increasingly operate at that level of granularity rather than across an entire monorepo. The requirement is local consistency - standardize within the unit the agent is working in - not global uniformity across a heterogeneous system. What matters is that the agent sees one dominant pattern in context, not that the entire codebase uses the same pattern everywhere.

Names and Types as Steering Mechanisms

Names are not just a readability concern. In AI-assisted development, identifiers function as semantic anchors that directly steer model output.

Research confirms this is a measurable effect. Le et al. showed that removing identifier names - while leaving code structure intact - causes models to regress from intent-level summaries to line-by-line narration, and degrades tasks that should depend only on structure, such as execution prediction. Wang et al. found consistent degradation on code analysis benchmarks when meaningful names are replaced with opaque identifiers. In dynamically typed languages, naming carries even more weight because type declarations cannot compensate for a vague identifier.

This effect is not absolute. Fine-tuning on obfuscation-augmented datasets can partially recover performance, and specialized tools exist for working on obfuscated or minified code. But standard models - the ones teams use in practice - rely on names as strong signals for intent, not just as labels. Investing in precise, domain-specific names pays off regardless of tooling.

Type annotations narrow the space of plausible completions further. A UserId that is structurally incompatible with an OrderId prevents the model from passing the wrong identifier even when both are strings at runtime. Research on type-constrained code generation shows that enforcing type correctness during generation reduces compilation errors by roughly half compared to unconstrained generation, and that syntax-only constraints achieve a fraction of that improvement. The mechanism studied - constrained decoding - is a runtime technique, not a static property of source code. But the underlying principle transfers: the richer the type surface available to the model, the smaller the space of plausible-but-incorrect completions, and the more the compiler or type checker can automatically reject.

Note that the type-error reduction finding was measured in TypeScript specifically. In dynamically typed languages, LLM-generated errors are more often semantic than type-related, so type annotations cannot be the primary correctness lever there. The broader principle - prefer rich type definitions over untyped or loosely typed code - holds, but its impact varies by language.

For engineers, this means:

invest in precise, domain-specific names - a UserId is a stronger signal than a string
prefer rich type definitions over untyped or loosely typed code
treat branded types and domain-specific type aliases as part of the AI control surface, not just developer ergonomics

Token Economy Is a Design Constraint

Large context windows do not remove the need for discipline. They make discipline more important.

Token economy matters for two reasons:

fact retrieval accuracy depends on whether the relevant instruction or example is easy to locate in context - positional weighting is not uniform, and models attend less to material in the middle of long contexts
context window effective load depends on how much low-signal material competes with the task at hand

This is not just a cost issue. It is a reliability issue. A repository may have enough total context capacity to include everything, but that does not mean the model will use every part of it equally well. The “Lost in the Middle” finding from Liu et al. - that models perform significantly worse on information placed in the middle of long contexts - has been widely replicated and is well-supported across tasks and architectures.

However, the picture has evolved since 2023. Recent evaluations of 2025-era frontier models (GPT-4.1, Claude Sonnet 4, Gemini 2.5) show that the original U-shaped positional pattern is partially mitigated in newer models. A more robust concern has replaced it: Du et al. (EMNLP 2025) found that context length alone degrades performance by 13–85%, even when irrelevant tokens are replaced with whitespace and models are constrained to attend only to relevant material. The problem has shifted from “where in the context” to “how much context” - longer windows hurt, regardless of position.

This is supported from a different angle by empirical work on agent context files (Gloaguen et al., ETH Zurich, 2026): LLM-generated context files reduced task success by ~3% while increasing token cost by over 20%, even though the files were intended to help. Human-written context files provided a marginal ~4% improvement. The conclusion: over-documentation is not a theoretical risk. The distinction between human-written and LLM-generated context matters more than whether a context file exists at all. Counterintuitively, Cursor’s own A/B testing found that providing agents with only tool names and letting them fetch details dynamically - rather than pre-loading full context - reduced total token usage by nearly half while maintaining task success.

In practice, that means:

avoid oversized top-level instruction files; Anthropic’s guidance warns that CLAUDE.md files that are too long cause important rules to get lost in the noise
avoid repeating the same rule in several weakly different ways
keep high-signal examples easy to retrieve
separate durable decisions from operational instructions
prefer human-written context files; avoid auto-generating them from the codebase

More context is not automatically more clarity. Once noise grows faster than signal, effective context quality drops even if raw context capacity is still available.

Documentation as Control Surface

Documentation should not be treated as generic prose around the codebase. In AI-assisted development, it becomes part of the control surface.

Different documents serve different purposes:

CLAUDE.md / AGENTS.md should carry working instructions, common commands, and task-relevant operating constraints - kept minimal and human-written
ADRs should record decisions, tradeoffs, and what the codebase has explicitly chosen
pattern documents should show canonical solutions for recurring problems
guidelines should capture defaults, conventions, and team-level expectations

Collapsing all of that into one giant file usually makes the system worse. The model receives more tokens, but fewer clean signals. Smaller, purpose-specific documents are easier to load, easier to maintain, and easier for both humans and models to apply correctly.

Specificity matters as much as structure. Vague instructions like “follow best practices” provide no actionable constraint. Instructions should be concrete enough to verify mechanically: “use Result<T, DomainError> for all repository methods; never throw from a repository” is actionable in a way that “handle errors properly” is not. Prompt sensitivity research (Gonen et al., EMNLP 2023; Sclar et al., ICLR 2024) shows that surface-level wording changes - not just semantic content - can cause large accuracy swings in code generation tasks. Precise wording is not cosmetic.

The same principle applies to examples. Spotify’s engineering team, after deploying over 1,500 AI-generated pull requests through their internal coding agent, identified concrete code examples as one of the strongest levers for outcome quality: “Use examples. Having a handful of concrete code examples heavily influences the outcome.” The codebase itself is a few-shot prompt. A short canonical pattern document plus one strong implementation example is usually more valuable than a long abstract explanation with no concrete precedent. Importantly, Spotify’s examples are embedded in agent prompts for specific migration tasks - this is context engineering for agentic workflows, not a claim about codebase architecture in general.

Machine-Enforced Regularity

Patterns should be documented, but the most important ones should also be enforced mechanically.

Linting matters here not because formatting is sacred, but because structural regularity reduces ambiguity. Rules such as import grouping and ordering, member grouping and ordering, and consistent type import usage reduce the number of shapes a file can take. This is largely practitioner intuition rather than a finding with controlled empirical backing - controlled studies of linting’s effect on LLM output quality do not yet exist. But the reasoning is sound: consistent structure reduces token-level variation in what the model sees, which should make next-token prediction more stable.

That helps in several ways:

the model sees fewer equivalent but different local forms
diffs become easier to read and compare
generated code is less likely to drift into repo-specific style violations
reviewers spend less time correcting surface inconsistency

In monorepo setups, module boundary rules deserve special attention. Cross-boundary imports are a common AI-generated error: the model sees a type it needs, finds it in a sibling module, and imports it - without knowing that the architectural intent forbids that dependency. A boundary lint rule catches this automatically and lets the agent self-correct without human intervention.

In this sense, linting is part of the AI-assisted development stack. It turns conventions from “good ideas” into guaranteed repository constraints.

Operational Feedback Loops

AI output should be cheap to verify.

If build, lint, test, and typecheck workflows are obscure, slow, or inconsistent, low-quality output survives longer than it should. The issue is not that the model made a mistake. The issue is that the repository made the mistake expensive to detect.

Fast feedback loops improve AI-assisted development because they shorten the path between “plausible” and “proven.” The more quickly a generated change can be checked against real constraints, the less value there is in arguing about whether the output “looks right.”

One practical measure is to introduce a wrapper command that filters build, format, lint, or test output before it reaches the model. If the wrapper removes warnings, success markers, timing metadata, and other non-actionable noise while preserving errors, the model gets a denser and more useful signal. DORA’s 2025 research on AI-assisted development found that teams with loosely coupled architectures and fast feedback loops saw meaningful productivity gains from AI tools, while teams with slow or opaque validation pipelines saw little benefit - the feedback loop, not the model, was the bottleneck.

That is another reason to keep operating instructions explicit. Common commands, validation steps, and expected checks belong in the working documentation, not only in team memory.

Task Decomposition

Codebase design sets the ceiling for what the model can reliably do. Task decomposition determines how much of that ceiling the team actually reaches.

The relationship between task scope and success rate is not linear. Small, well-bounded tasks - a single function, a single component matching a reference pattern, a single entity with tests - succeed at high rates. Larger tasks succeed at much lower rates and require either decomposition or significant human steering. This is a consequence of compound uncertainty: each additional decision the agent must make multiplies the chance of diverging from the intended approach.

In layered architectures, the most reliable decomposition follows the layers: domain first, then data-access, then feature/UI. Each step has a clear input (the previous layer’s public API), a clear output, and its own verification command. The agent completes and verifies one step before moving to the next.

Recurring task patterns - “create a new entity,” “add a feature page” - should be codified into reusable workflow definitions rather than free-form prompts. This enforces the team’s preferred decomposition and reference patterns by default.

Test Infrastructure as Specification

Types constrain the space of code the model can generate. Tests constrain the space of behavior it can produce.

When a test suite exists before the implementation, it functions as a machine-verifiable specification. The agent generates code, runs the tests, and self-corrects until they pass. This is more reliable than relying on the agent to infer intent from prose, because tests produce binary pass/fail signals rather than probabilistic matches against natural language.

Kent Beck describes this as TDD becoming a “superpower” with AI agents specifically because tests prevent AI-introduced regressions. The failure mode to watch for is agents deleting or weakening tests in order to make them pass - a signal that the specification is being circumvented rather than satisfied.

Test factories (Object Mothers, test builders) prevent the agent from constructing invalid test data. Instruction files should direct the agent to use them rather than constructing domain objects directly, for the same reason branded types are preferred over raw primitives.

Continuous Calibration

Every time a reviewer corrects AI-generated code, there is an implicit signal: the agent’s context was incomplete or its constraints were insufficient. The question to ask is: “Could this correction have been prevented by an instruction file update, a lint rule, or a better canonical example?”

If yes, the correction should produce a documentation or tooling change in the same review cycle. Over time, this feedback loop transfers patterns from human review memory into the agent’s operating environment. The rate of “same class of correction” recurring should decrease.

Martin Fowler frames the current posture well: treat every AI output as a pull request from a highly productive but untrustworthy collaborator - review everything, and route the patterns from that review into constraints the collaborator cannot bypass next time.

Practical Rules

Keep top-level instruction files compact, task-oriented, and human-written.
Prefer one dominant pattern per recurring problem within a module or service; global uniformity across heterogeneous systems is not required.
Use precise, domain-specific names and types - they steer the model as much as instructions do.
Write instructions specific enough to verify mechanically - vague guidance is noise.
Record decisions separately from working instructions.
Back important conventions with linting and formatting rules, including module boundary enforcement.
Keep canonical examples close to the code they describe - concrete examples steer more than prose.
Make build, lint, typecheck, and test commands easy to discover; filter their output to remove noise before it reaches the model.
Decompose tasks to the smallest independently verifiable unit.
Write tests before implementation when possible; use test factories to prevent invalid fixtures.
Treat every recurring review correction as a candidate for a lint rule or instruction update.

References

Anthropic, Manage Claude’s memory: https://docs.anthropic.com/en/docs/claude-code/memory
Anthropic, Long context prompting tips: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/long-context-tips
Anthropic, Context windows: https://docs.anthropic.com/en/docs/build-with-claude/context-windows
Liu et al., Lost in the Middle: How Language Models Use Long Contexts, TACL 2024: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long
Du et al., Context Length Alone Hurts LLM Performance Despite Perfect Retrieval, EMNLP 2025: https://arxiv.org/abs/2510.05381
Gloaguen et al., Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?, ETH Zurich / LogicStar.ai, 2026: https://arxiv.org/abs/2602.11988
Le et al., When Names Disappear: Revealing What LLMs Actually Understand About Code: https://arxiv.org/abs/2510.03178
Wang et al., How Does Naming Affect LLMs on Code Analysis Tasks?: https://arxiv.org/abs/2307.12488
Muendler et al., Type-Constrained Code Generation with Language Models: https://arxiv.org/abs/2504.09246
Gonen et al., Demystifying Prompts in Language Models via Perplexity Estimation, EMNLP Findings 2023: https://arxiv.org/abs/2212.04037
Sclar et al., Quantifying Language Models’ Sensitivity to Spurious Features in Prompt Design, ICLR 2024: https://arxiv.org/abs/2310.11324
Spotify Engineering, Background Coding Agents: Context Engineering (Honk, Part 2): https://engineering.atspotify.com/2025/11/context-engineering-background-coding-agents-part-2
DORA, State of DevOps 2025: https://dora.dev/research/2025/dora-report/
GitHub, Why AI is pushing developers toward typed languages: https://github.blog/ai-and-ml/llms/why-ai-is-pushing-developers-toward-typed-languages/
typescript-eslint, member-ordering: https://typescript-eslint.io/rules/member-ordering/
ESLint, sort-imports: https://eslint.org/docs/latest/rules/sort-imports