2025-11-19

Your AI Is Copying Bad Code. Here’s How To Get It To Stop.

One colleague references code on screen to another

By Ajinkya Kunjir, Research Lead & SDET, Orium

6 min read

I love how fast modern AI coding tools make me. I don’t love that they happily reproduce decades of bad habits. Ask them for a polished storefront homepage and you’ll get something beautiful that often fails basic accessibility checks. That tension became the starting point for a research sprint: could careful prompting teach AI to generate accessible UI, not just plausible UI?

We designed a controlled experiment across four popular tools—Builder Fusion, Claude Code (Sonnet 4), GitHub Copilot (GPT-5), and Vercel V0—targeting the same mini project: a minimalist luxury watch homepage with a sticky header and mega menu, welcome dialog, predictive search, two carousels, a journal grid with “load more,” and a standard footer. I tested the five most failure-prone components and evaluated results against WCAG 2.1 AA.

The twist wasn’t the project. It was the prompts.

Four ways to ask, four very different outcomes

Phase 1: Say nothing about accessibility. Our baseline prompt described the page and components, with constraints to ship in three files (HTML, CSS, JS). No accessibility guidance. Results looked slick and failed predictably— missing semantic structure, inconsistent or incorrect ARIA, weak keyboard support, unlabeled images and form controls. Conformance clustered in the 70–80% range. Pretty, but broken.

Phase 2: Say “make it WCAG 2.1 AA compliant.” This simple nudge helped. All four tools improved heading structure and labeling; some started auto-adding alt text and form labels. But the persistent issues stayed persistent: shaky focus management, inconsistent keyboard behavior, and live updates that weren’t announced to assistive tech. Scores plateaued around 82–83%. Better, not good.

Phase 3: Ask an LLM to rewrite the prompt with explicit success criteria. Here we asked AI to expand our instructions with detailed accessibility requirements. It did—mostly by stuffing ARIA everywhere. We saw better landmarks and some keyboard gains, but also a new class of “A-level” failures created by unnecessary and incorrect roles. The net effect: minimal score movement and a few regressions, with most tools hovering ~81–83%. In some cases we fixed AA issues while introducing fresh A issues. That’s not progress.

Phase 4: Lead with a native-first accessibility checklist. Instead of burying accessibility inside the build request, we put it up front as a prerequisite and kept it simple: semantic HTML first, ARIA only if needed; logical tab order; visible focus; skip link respected; clear labels and error handling; polite live regions for search and assertive for form errors; contrast targets; respect reduced motion. Then we asked for the same page. This changed the pattern. Predictive search and dialogs leaned on native elements; menus behaved; labeling stopped multiplying; keyboard navigation felt natural. Scores rose again—Claude Code and V0 were standouts—while Copilot dipped a few points due to a couple of stubborn ARIA misuses.

The takeaway is blunt: moving from “ARIA-heavy accessibility” to a native-first prompting strategy produced cleaner, more compliant code with fewer false positives and stronger keyboard support.

What the numbers said

Across phases, the broad trend was clear. Baseline conformance sat roughly in the mid-70s to ~80%. Adding “make it accessible” nudged tools into the low-80s. The ARIA-stuffed rewrite didn’t unlock a new level; if anything, it risked new breakage. The native-first prompt stabilized semantics and lifted quality again, with Claude Code and V0 edging ahead in our final round, and Copilot showing a small drop tied to specific ARIA attribute errors.

Numbers aren’t the whole story, though. The feel of the output changed. When tools relied on native patterns first, focus order made sense, keyboard interactions worked without gymnastics, and assistive tech announced changes reliably. When tools reached for ARIA early, they over-labeled, mis-labeled, or conflated patterns—most visibly around combobox behavior in predictive search. That’s exactly the kind of “helpfulness” that can sabotage accessibility in the real world.

Why “native-first” works

Modern browsers already implement a huge amount of accessible behavior. Semantic HTML exposes names, roles, and states that assistive tech understands. When you start with those primitives, you inherit the right accessibility defaults. ARIA is powerful, but it’s meant to fill gaps, not reinvent everyday controls. Ask an AI to sprinkle ARIA, and it will. Ask it to use native HTML first, and it uses the safest path.

The results aligned with that principle. Our best runs leaned into native navigation landmarks, form labels, lists and headings, dialog semantics, and buttons that behaved like buttons. The fewer custom roles we introduced, the fewer edge cases we created.

What this means for teams using AI to build UI

If your default prompt is “build X and make it accessible,” you’ll get inconsistent gains and recurring pain. Swap that for a checklist-style preamble that sets the ground rules, then describe your components. Keep it human and semantic, not jargon-heavy. You don’t have to know the exact ARIA recipe for a combobox; you do need to state the outcomes you want: logical tab order, arrow-key navigation in menus, Enter to select, polite live announcements for search results, assertive error announcements for forms. The model can translate those outcomes into code, and it will do better when nudged toward native elements first.

Two practical cautions from the experiment: 1. Prompting isn’t a silver bullet. Even in the best runs, we still reviewed output with real tools and caught edge cases. Keep humans in the loop. 2. Tool behavior varies. Across our tests, Claude Code and Vercel V0 responded especially well to the native-first approach; Builder Fusion needed more re-prompting; Copilot (GPT5) was solid overall but occasionally overreached with ARIA, which hurt its final score in Phase 4. Your mileage will vary by component and update cadence, so verify.

How to start, fast

Here’s the pattern and sample prompts (in quotations) that worked for me— adapt as needed: 1. Lead with the guardrails. “Use semantic HTML first. Only add ARIA if a native pattern can’t express the behavior. Provide logical tab order, visible focus, a working skip link, and label all controls. Announce live updates politely for search and assertively for form errors. Meet WCAG 2.1 AA color contrast. Respect reduced motion.” 2. Describe the component goals, not the ARIA. “Build a header with logo, predictive search, and a mega menu. The menu should support arrow keys and Enter to select. The search should announce result counts without stealing focus. Dialogs must trap focus and close with Escape.” 3. Ask for validation. “Append a short explanation of how this meets the navigation and forms criteria above.” 4. Iterate per component. Generate, test, and refine one complex widget at a time—menu, search, dialog, carousel—rather than the whole page at once. It’s faster to isolate and fix interaction bugs when you’re not diffing a thousand lines of combined output.

That’s the workflow I now use day-to-day. It’s simple enough for non-experts, and it consistently produces code that’s closer to shippable on the first pass.

Closing thought

AI mirrors the patterns we reward. If you ask for a page, it’ll copy patterns from the web—warts included. If you ask for a page that honors human needs first, it will move in that direction. Our experiment didn’t make any tool perfect, but it did prove a reliable way to get better outcomes: teach your AI to start with the platform’s built-in accessibility, then add only what you must. That one change took us from “pretty but broken” to “cleaner, more compliant, and keyboard-friendly.” It will do the same for your team.

Building Agents? Stop Treating messages[] Like a Database

Stop using messages as your agent's memory. Learn how structured state makes AI agents more reliable, efficient, and production-ready.

Why AI Forces a Rethink of Change Management

Traditional approaches to change management weren’t working before. AI just makes the gaps impossible to ignore.

AI Isn’t Killing Services, It’s Redefining Them

How smart companies are evolving with agent-powered delivery models, and what it takes to lead in the new era of intelligent services.