I Built Wintura.ai with Claude as My Pair-Programmer — Here's What AI Can and Can't Do in 2026
First-person breakdown of where Claude Sonnet 4.6 + Haiku 4.5 worked and where they failed across 6 months of shipping a production B2B SaaS solo. Real examples, not benchmarks.
TL;DR. I shipped wintura.ai — a production multi-tenant B2B SaaS — solo in 6 months. Claude Sonnet 4.6 + Haiku 4.5 + Cursor were my full-time pair-programmer for the entire build. AI was excellent at four things: scaffold generation, boilerplate refactoring, test-case enumeration, and documentation drafting. AI was bad at four things: multi-tenant boundary design, security-sensitive auth flows, error-handling edge cases, and any code where the right answer is don't write the code at all. Per JetBrains' April 2026 survey, 90% of developers regularly used at least one AI tool at work in January 2026, and per Taskade's State of Vibe Coding 2026 roughly 41% of new code shipped is AI-generated — but the question "can AI build my app?" depends on whether you have someone qualified to make the decisions AI can't.
The honest answer in one sentence
Yes, AI can build most of the code in your app — roughly 41% of new code shipped is AI-generated, per Taskade's State of Vibe Coding 2026 (citing Hashnode 2026 data). The remaining ~59% — architectural decisions, security boundaries, production-hardening edge cases — needs a human who knows what they're looking at. The founders who succeed with AI tools in 2026 understand which 59% they're personally accountable for. The founders who fail think the 41% is the whole job.
This post is the verifiable first-person breakdown. I'm not benchmarking — I'm posting actual examples from the 6 months I spent shipping Wintura.
What AI was excellent at (the 41%)
1. Scaffold generation
Claude generated the initial Next.js 16 App Router structure, the database schema migrations, the API route handlers, and the React Server Component boilerplate for every page in Wintura. Per Cursor's $2B ARR milestone in February 2026 (the fastest B2B SaaS to $2B ARR ever — doubled from $1B in just three months), this scaffolding-generation use case is what drove the vibe-coding market to $4.7B in 2026 per Taskade's State of Vibe Coding report.
Specific example: writing the initial /lib/db/proposals.ts file with 23 query functions — list, get, create, update, delete, plus search, paginate, filter-by-status, filter-by-client. AI wrote the entire file from a one-paragraph description. I reviewed it in 10 minutes. Without AI: 2-3 hours of typing.
2. Boilerplate refactoring
Renaming a TypeScript field across 47 files. Converting a class component to a function component. Migrating from next/legacy/image to next/image. Pulling repeated logic into a shared hook. All trivial for AI in 2026 — write the goal, point at the codebase, review the diff.
Specific example: the April 2026 brand rebrand from Burnt Sienna (#C2410C) to Cobalt (#2563EB). 30+ files swept in 10 minutes via Cursor's multi-file edit. Manually: a full afternoon plus inevitable misses.
3. Test-case enumeration
The 24 e2e test files in Wintura — per the Wintura.ai case study — were drafted by Claude from a flow description, then I edited the assertions and edge cases. AI is genuinely better at enumerating test cases than I am, because I have blind spots from being the author of the production code. AI doesn't.
Specific example: the proposal-send flow test. I described the happy path. Claude wrote 14 test cases including: client without billing info, client with deleted billing, proposal with no scope items, proposal exceeding character limits, tenant with revoked Stripe Connect account, network failure mid-send, double-submit race condition. I would have written 4 of those 14 myself.
4. Documentation drafting
CLAUDE.md, decisions-log.md, design-tokens.json descriptions, code comments. AI drafts. Human edits. Result: thorough documentation that wouldn't have existed if I'd been writing it from scratch under shipping pressure.
What AI was bad at (the 59%)
1. Multi-tenant boundary design
The single hardest part of Wintura was the multi-tenancy architecture. Claude is genuinely capable of writing tenant-scoped queries when you ask explicitly. Claude is genuinely incapable of recognizing that tenant-scoping is the architectural problem you should be solving in the first place.
Specific example: when I prompted "build a function to list all proposals," Claude wrote a working function — without tenantId enforcement. When I prompted "build a multi-tenant function to list proposals," Claude added a tenantId parameter but enforced it at the application layer, not Row-Level Security at the database layer. The right architectural choice — enforce at the database, type-check at compile time, never allow application-layer-only scoping — was a decision I had to make. Claude executes my architectural decisions well. It does not make them.
This is the gap Cursor's documentation describes when they say AI is a force multiplier, not a substitute for judgment. The judgment is the 59%.
2. Security-sensitive auth flows
NextAuth v5 (still in beta as of May 2026) has subtle correctness requirements. The session cookie must be locked to the apex domain (not a subdomain wildcard) to prevent token leakage. Password reset flows must resist user-enumeration attacks. Magic link tokens must expire and be single-use. CSRF tokens must rotate on privilege change.
Claude wrote my NextAuth config three times. The first version had the cookie unlocked. The second version had a password reset that leaked enumeration via response timing. The third version had a magic-link token that was reusable. All three would have passed unit tests. None would have passed a security review.
I caught these because I've read OWASP's auth cheat sheet maybe 50 times in my career. A founder using Bolt or Lovable to scaffold auth would not catch them. This is exactly what Soatech's Production Lift installs onto a Bolt prototype — production-grade auth that has been reviewed against the actual attack surface, not just the happy path.
3. Error-handling edge cases
The proposal-send flow in Wintura has 12 ways to fail. Stripe API can return rate-limit errors. HubSpot API can return a 502. The Puppeteer microservice can time out rendering a 50-page proposal. The R2 upload can fail signing. The client email can bounce. Each of these needs specific handling — not a generic try/catch.
Claude's default error handling is a try/catch that logs to console. That's wrong for production. The right pattern is structured error types per failure mode, with explicit retry policies (immediate retry for rate limits, dead-letter queue for HubSpot, fallback to non-sealed PDF for Puppeteer timeout). Claude writes this when I describe each failure mode and the desired recovery, but it doesn't anticipate them on its own.
4. Code you shouldn't write at all
The single best engineering decision I made on Wintura was deleting a feature Claude had spent 2 hours implementing. The feature was a "smart approval routing" workflow — proposals would route through tenant-defined approval chains before sending. Claude built it correctly. Then I realized: no agency in my ICP wanted this. It was complexity for an imagined user, not a real one.
AI is exceptionally good at building what you ask for. AI is exceptionally bad at telling you not to ask for it. That's the founder's job — and it's the biggest source of wasted time when AI tools make it cheap to build features that shouldn't exist.
What this means for your "can AI build my app?" question
If you're a technical founder, the realistic 2026 answer is: AI can compress 6 months of full-time work into ~4 months. You will still be working full-time on architectural decisions, security, edge cases, and product judgment. The Wintura case study at /case-studies/wintura-ai is the verifiable proof point.
If you're a non-technical founder, the realistic 2026 answer is: AI can build the prototype via Bolt, Lovable, v0, or Cursor for $20-25/month. The prototype can validate your idea. The prototype cannot become production. Per the Bolt-vs-Production-Lift comparison, the rebuild-to-production cost is consistently €20K-€100K. The honest path: use Bolt to validate, then either learn enough engineering to make the architectural decisions yourself, OR pay someone who has — Soatech's Production Lift (€3,500, 1 week) is exactly this transaction.
The 2026 data backing this
- Taskade State of Vibe Coding 2026: Vibe coding is a $4.7B market in 2026 (per Roots Analysis). 41% of new code shipped is AI-generated (per Hashnode + Designrush).
- Cursor crossed $2B ARR in February 2026 — fastest B2B SaaS to that milestone in history, doubled from $1B in just three months. $2.3B Series D at a $29.3B valuation.
- Lovable grew to roughly $200M-$400M ARR over 18 months — multiple sources cite different snapshots (TechCrunch March 2026 on the velocity; LinkedIn post from CEO Anton Osika on the $1.8B valuation Series).
- JetBrains AI Coding Tools survey, April 2026: 90% of developers regularly used at least one AI tool at work in January 2026. The minority not using AI are senior engineers who consciously chose to opt out — not because AI is bad, but because they're working on edge cases AI can't reason about.
These numbers say AI is real and unavoidable. They do not say AI is a substitute for engineering judgment.
What the Wintura experience taught me to offer at Soatech
The realistic productized model for AI-accelerated engineering in 2026 is exactly what Soatech sells:
- Production Lift (€3,500, 1 week): take a Bolt/Lovable prototype to production. AI generated the prototype. An Architect adds the 59% it can't.
- Technical Blueprint (€2,500, 5 days): pre-build architecture sprint. AI can't make architectural decisions; an Architect can. Walk away with the plan.
- MVP Sprint (from €8.5K, under 30 days): clean-slate V1 build. AI accelerates generation; the Architect makes the decisions that determine whether it's production-grade.
The arbitrage is real: AI compresses generation time roughly 2×. The labor cost in a Soatech fixed-price MVP Sprint is 30-50% lower than a comparable hourly agency quote, with same or better outcomes — because the velocity gain comes from AI, not from cutting corners. See the Wintura.ai case study for the full architectural picture.
Frequently asked questions
Can Claude or Cursor actually write production code in 2026?
Yes, with a competent reviewer. Per Taskade's State of Vibe Coding 2026, roughly 41% of new code shipped is AI-generated (per Hashnode 2026 data). In my Wintura experience, AI handled scaffolding, boilerplate refactoring, test enumeration, and documentation. The remaining 59% — architectural decisions, security boundaries, error edge cases — needed human judgment. Production code is the merge of both.
What's the difference between Bolt/Lovable and Cursor for production code?
Bolt and Lovable are AI app builders that generate full applications from prompts — fastest for prototyping. Cursor is a code editor with AI integration — best for an engineer pair-programming on a real codebase. Wintura was built primarily in Cursor; the prototypes I validate with clients often start in Bolt. They serve different points in the build lifecycle.
How much faster was Wintura with AI compared to without?
Honest estimate: ~2× faster on raw code generation, ~1.3× faster on the overall project (accounting for the time I spent reviewing, refactoring, and rejecting AI suggestions). Six months solo with AI; would have been ~9-10 months solo without. Per JetBrains AI Coding Tools survey (April 2026), 90% of developers regularly used at least one AI tool at work in January 2026 — matching the broader picture where AI accelerates engineering work without replacing engineers.
Where does AI fail most often that founders don't notice?
Multi-tenant data isolation. Auth flow security against enumeration / timing attacks. Error handling for production failure modes (rate limits, third-party API outages, infrastructure timeouts). And "code you shouldn't write at all" — features for imagined users that AI builds cheaply but waste your time. See when vibe coding fails for the full failure-mode list.
Should I trust an AI-built app with real customer data?
Trust the engineer who reviewed it, not the AI. If the engineer can answer "how is tenant data isolated at the database layer?" with a specific Row-Level Security policy or compile-time-enforced query function — trust their work. If they can answer only with "the application checks tenant ID before each query" — get a Production Lift before accepting paying users.
Is "vibe coding" the same as "AI coding"?
Roughly yes, in 2026 vocabulary. "Vibe coding" emphasizes the prompt-to-prototype loop (Bolt, Lovable, v0). "AI coding" emphasizes engineer-with-AI-pair-programmer (Cursor, Claude Code, GitHub Copilot). The first generates apps from descriptions; the second accelerates engineers writing code. See what is vibe coding for the full taxonomy.
Can AI replace a senior engineer in 2026?
No — and the JetBrains 2026 survey data shows this clearly: 90% of developers regularly use AI tools at work, but the developer population isn't shrinking. AI replaces typing, not deciding. Senior engineers spend most of their time deciding (architecture, trade-offs, security, performance) and a small fraction typing. AI compresses the typing fraction; the deciding fraction is unchanged.
What would I do differently if I were starting Wintura today?
Start with Soatech's Technical Blueprint. That's not a sales pitch — that's literally what I'd do. The €2,500 / 5-day architecture sprint produces the schema + API contracts + AI generation strategy I would have benefited from having in front of me on day one. I figured them out as I went. Better to have them locked before the first commit.
Related Articles
Vibe Coding vs Hiring a Developer: Which Should You Choose?
Vibe coding vs hiring a developer: compare cost, quality, speed, scalability, and maintenance to decide which approach fits your project and budget.
AI-Generated Code Quality: What Founders Need to Know
AI generated code quality varies wildly. Learn about security vulnerabilities, technical debt, testing gaps, and why code review matters more than ever.
5 Specific Patterns Where Bolt and Lovable Fail in Production — with the Production-Lift Fix
Real anti-patterns from Bolt/Lovable exports that fail when paying users arrive: app-layer tenancy, mock auth, missing webhook verification, generic error handlers, no a11y. Each with the production fix.
Ready to build something great?
Architect-led, AI-accelerated. Let's turn your idea into a shipped product.