What Every Optimization Team Needs to Understand About Working With AI Whether They Use It for Code or Not
We were testing an AI assistant with a billion-dollar retail brand, running it through code generation for A/B tests. The results were... inconsistent. We'd give it the exact same prompt in different chat sessions and get completely different code outputs, different formats, different strategies, sometimes different outcomes entirely. One version worked perfectly. Another broke the site navigation.
Most teams would look at that inconsistency and conclude that "AI isn't ready for production code."
We saw something different we saw that "AI hasn't been trained yet."
That realization changed everything. Because the question everyone asks, "What should AI be allowed to code?" implies AI has fixed capabilities. That's the wrong question.
The question that actually matters is, "What have we taught AI about our stack, our standards, and our hard-won lessons?"
This article is about code generation, but the lessons apply to everything AI does in optimization. The difference between "CSS-only" AI and "almost everything" AI isn't the model you're using. It's the infrastructure you build around it. And that infrastructure like documentation, constraints, institutional knowledge, is what transforms how teams work with AI on any complex task.
These same training principles apply whether AI is writing code, generating hypotheses, or designing tests.
Why the "CSS Yes, JavaScript No" Rule Exists (And When It Breaks Down)
If you've read anything about AI code generation for optimization, you've probably seen this advice: let AI handle CSS, keep it away from JavaScript. It's good default guidance. CSS changes are visual, hard to break catastrophically, and easy to roll back. JavaScript touches functionality, can degrade performance, and has a larger blast radius when things go wrong.
But this rule isn't really about technical complexity. It's about knowledge gaps.
Without training, AI doesn't know:
Your platform's quirks (we discovered our testing platform breaks with modern ES6+ JavaScript features)
Your data structures (inventory lives in
__NUXT__.data['productInventory-{SKU}'], not some generic variable)Your institutional lessons ("the bot detection system becomes aggressive if dev console is open during testing")
Your edge cases (browsers handle whitespace differently, size codes don't match display values, single-page apps break standard monitoring patterns)
Here's what happened when we asked AI to build a low stock indicator for a product page, a test that shows "Only X left!" when inventory is low and updates dynamically when customers change product variants.
Generic AI output:
const stockIndicator = async () => {
const inventory = __NUXT__.data?.productInventory;
const size = element?.querySelector('.size')?.textContent;
if (inventory[size] < 5) {
element.innerHTML += `<span>Only ${inventory[size]} left!</span>`;
}
};
This looks reasonable. It would pass a code review from someone unfamiliar with the specific environment. But in production, it fails completely:
Uses
async/await(our platform throws errors with this syntax)Uses optional chaining (
?.) which breaks in the platform's JavaScript environmentAssumes inventory data structure matches what AI imagines (it doesn't)
Assumes size display text matches inventory keys (fails on "3.5 Boys = 5.0 Women")
Processes once and never updates when customers change colors
Adds indicators without clearing old ones, creating duplicates
The "CSS yes, JavaScript no" rule protects teams from exactly this scenario. But here's what nobody talks about: the line between safe and dangerous isn't technical complexity—it's knowledge transfer.
What "Training AI" Actually Looks Like
After several weeks of trial and error, testing, and refinement with our retail client, we built a framework that turns generic AI into specialized AI. It's based on four pillars and here's the important part: these pillars work for any complex AI task, not just code generation.
1. Technical Constraints & Compatibility Rules
Document everything that breaks so AI doesn't have to learn through trial and error.
For code generation, this meant documenting our platform's limitations:
// ❌ BREAKS in our platform (generic AI writes this)
const text = element?.textContent?.trim();
const value = someValue ?? 'default';
// ✅ WORKS in our platform (trained AI knows this)
const text = element && element.textContent ? element.textContent.trim() : '';
const value = someValue || 'default';
Our constraints document grew to include:
No ES6+ features (arrow functions, optional chaining, nullish coalescing)
No emojis in console.log statements (they crash the platform)
Always use
.trim()on text content (browsers handle whitespace differently)Specific header tokens to bypass bot detection in development environments
Why this matters beyond code: The same principle applies to hypothesis generation. You need to document what tests failed, what hypotheses were flawed, what assumptions proved wrong. Without that documentation, AI can't learn from your team's experience.
2. System-Specific Architecture & Data Patterns
Give AI the map to your actual territory, not theoretical best practices.
Instead of telling AI to "find the add-to-cart button," we provide actual patterns:
// Real selectors from the actual site:
__NUXT__.data['productInventory-VN000D5INVY']
document.querySelector("[data-test-id='size-picker']")
document.querySelector("[aria-label='Site Utility Navigation']")
We documented:
Where inventory data actually lives and its exact structure
Real component selectors with data attributes
User authentication checking patterns
How the single-page app handles navigation
Why this matters beyond code: When AI generates test hypotheses, it needs to understand your actual funnel, your real user segments, your product catalog structure. Generic knowledge about "ecommerce best practices" doesn't help when your checkout flow has unique steps or your product categorization follows industry-specific patterns.
3. Mandatory Standards & Templates
Every output should follow the same structure, making it reviewable and maintainable.
Our code template looks like this:
/*
* TE-303 Low Stock Indicator Test
* Summary: Displays "Only X left!" on PDP when inventory is low (1-5 items)
* Version: v8 - Robust
* Last Updated: July 10, 2025
*/
const LOG_PREFIX = "[TEST-213]";
console.log(LOG_PREFIX + " Initializing - Version 8");
// Configuration section
const LOW_STOCK_THRESHOLD = 5;
const MONITORING_INTERVAL_MS = 500;
// Test logic follows standard patterns...
Standard patterns include:
Version control and change tracking in header comments
Logging prefixes for debugging
Configuration sections at the top
Error handling and fallback strategies
QA validation checkpoints in comments
Why this matters beyond code: Consistent structure makes everything reviewable. Whether it's code, hypothesis documents, or test plans—standardization enables quality control. When every hypothesis follows the same format (if/then/because, success metrics, risk assessment), your team can review and approve faster.
4. Institutional Knowledge Codification
This is where the magic happens. Every "learned the hard way" lesson becomes a permanent part of what AI knows.
Our lessons-learned document includes gems like:
"Bot catcher becomes spiteful if dev console is open" → provide bypass header token
"Size codes don't match display values" → provide extraction and fuzzy matching pattern
"Not all browsers handle whitespace the same" → always use
.trim()"Single-page app navigation breaks standard mutation observers" → use this specific monitoring pattern
Here's what trained AI produced for the low stock indicator, handling the complex reality of size code matching:
// Function to extract size code from complex size values
function extractSizeCode(sizeValue) {
// Handle complex size formats like "3.5 Boys = 5.0 Women"
const match = sizeValue.match(/^([0-9]+\.?[0-9]*)/);
if (!match) {
console.warn(LOG_PREFIX + ' Could not extract size from:', sizeValue);
return null;
}
const sizeNumber = match[1];
// Convert to format used in inventory keys (e.g., "3.5" -> "035")
const parts = sizeNumber.split('.');
const whole = parts[0].padStart(2, '0');
const decimal = parts[1] || '0';
return whole + decimal;
}
// Function to find variant key using fuzzy matching
function findVariantKey(inventoryData, sizeCode) {
const allKeys = Object.keys(inventoryData.variants);
// Try different patterns to find the right key
const patterns = [
key => key.includes(':' + sizeCode + ':M:1:'),
key => key.includes(sizeCode),
key => {
const unpadded = sizeCode.replace(/^0+/, '') || '0';
return key.includes(':' + unpadded + ':');
}
];
for (const pattern of patterns) {
const matchingKeys = allKeys.filter(pattern);
if (matchingKeys.length > 0) {
return matchingKeys[0];
}
}
return null;
}
This isn't code a junior developer writes on day one. This is what you write after tests break in production three times for three different reasons. The trained AI knows this because we documented those three failures.
Why this matters beyond code: Your optimization team has institutional knowledge about what tests work, what audiences respond to, what copy converts. If that knowledge lives only in people's heads, AI can't use it. Neither can new team members. Document it once, and both AI and humans benefit.
The Investment Timeline
Building this framework took several weeks of trial-and-error, knowledge capture, testing, and adjustments. That sounds like a lot. But here's what we discovered: about 80% of the framework applies to other clients and projects. Only 20% requires customization.
The 80% that transfers:
Methodology and approach to documentation
Template structures and standards
Error handling patterns
Quality assurance checklists
The 20% that changes:
Platform specifics (different A/B testing tools have different quirks)
DOM structure and component patterns (every site is different)
Industry-specific data models and business logic
We've now applied this framework to a second client on a different testing platform. What took weeks the first time took days the second time. The third time will take hours.
The Proof: Production-Ready Code From Trained AI
Let's return to that low stock indicator test. With proper training, AI produced code that:
Handles async data loading: Single-page apps load data dynamically. The code waits for inventory data to be available rather than failing when it's not immediately present.
Uses fuzzy matching with fallbacks: When "3.5 Boys = 5.0 Women" needs to match inventory key "035", the code tries multiple matching patterns until it finds the right one.
Monitors for variant changes: When a customer switches from black to blue shoes, the code detects the change, clears old stock indicators, and waits for the size picker to reopen before processing new inventory levels.
Includes comprehensive logging: Every step logs its progress with consistent prefixes, making debugging trivial.
Follows platform compatibility rules: No ES6+ features, explicit null checking instead of optional chaining, function declarations instead of arrow functions.
Here's another example, a cart page test that shows reward points customers can earn. The trained AI produced code with multiple fallback strategies:
// Primary approach: Wait for element
utils.waitForElement('[data-test-id="cart-checkout-button"]').then(
(checkoutButton) => { injectCartMessage(); }
);
// Backup: Direct navigation delay
setTimeout(() => {
if (window.location.pathname.includes('/cart')) {
injectCartMessage();
}
}, 500);
// Backup: Mutation observer for dynamic changes
const cartObserver = new MutationObserver((mutations) => {
// Detect cart-related changes and update message
});
// Backup: Periodic check as final safety net
setInterval(() => {
if (window.location.pathname.includes('/cart')) {
// Check if message is missing and inject if needed
}
}, 3000);
This is what institutional knowledge looks like. We learned through production failures that different navigation patterns require different detection strategies. Single-page apps sometimes load instantly, sometimes take seconds. Direct navigation works differently than in-app navigation. The mutation observer catches most changes, but not all of them.
Generic AI doesn't know to implement four different fallback strategies. Trained AI knows because we documented why each strategy exists.
The quality difference isn't "AI writes code faster." It's "AI writes code that actually works in your environment because it knows all the lessons you've learned."
The Paradigm Shift - This Isn't Really About Code
Here's what matters most: the code generation framework we built isn't special because it's about code. It's special because it's a model for training AI to do any specialized work.
The pattern applies everywhere in optimization:
Hypothesis generation? Same four pillars.
Constraints: What types of tests have failed before? What's off-brand or politically sensitive?
Architecture: What's your funnel? What segments exist? What's your product catalog structure?
Standards: What format should hypotheses follow? What evidence is required for approval?
Institutional knowledge: What assumptions were proven wrong? What worked unexpectedly well?
Test plan creation? Same approach.
Constraints: Sample size requirements, test duration rules, statistical thresholds
Architecture: Available tools, tracking infrastructure, reporting structure
Standards: Required sections, approval workflow, success criteria format
Institutional knowledge: Common implementation pitfalls, stakeholder preferences, political landmines
Copy generation? You see where this is going.
Constraints: Brand voice guidelines, legal requirements, character limits
Architecture: Product categories, customer segments, value propositions
Standards: Tone consistency, formatting rules, CTA patterns
Institutional knowledge: What headlines converted, what claims backfired, what language resonates with your audience
The universal truth is that AI without context produces generic output. AI with your institutional knowledge produces specialized output that actually works in your environment.
This leads to three organizational archetypes:
The Unprepared: No documentation, no standards, no training infrastructure. Gets generic AI outputs that sort of work. Appropriate for low-stakes exploration, content drafts, brainstorming. Fine for "give me five headline ideas" but not for "write production code."
The Dangerous: Uses AI for specialized work without building infrastructure first. Outputs look good initially but break under edge cases or in production. Warning: This is where most teams currently operate with AI code generation. It appears to work until it catastrophically doesn't.
The Strategic: Invested in documentation, standards, and knowledge capture. Gets specialized outputs that handle edge cases because AI has been trained on institutional knowledge. Result: high capability, controlled risk, massive efficiency gains.
The choice isn't "should we use AI?" It's "are we willing to invest in teaching AI our context?"
How to Build Your Framework (For Any AI Use Case)
Don't build infrastructure in a vacuum. Pick the specific AI use case you want to enable first: code generation, hypothesis development, test analysis, whatever will deliver the most value.
Start with your last 10 outputs. Code reviews, hypotheses, test plans, whatever AI will help create. Document everything that required correction. What kept breaking? What patterns emerge? What went wrong?
Create your constraints document. For code: platform compatibility issues, performance requirements, security rules, browser quirks. For hypotheses: test types that are off-limits, sample size requirements, brand positioning boundaries, segments that exist vs. segments AI might invent.
Map your system architecture. For code: real selectors from your site, actual data structures, utility functions that exist, common patterns. For optimization: your conversion funnel stages, available user segments, product categorization, tracking infrastructure.
Build standard templates. Create the structure every output should follow. For code, that's headers with version tracking, logging patterns, configuration sections. For hypotheses, that's if/then/because format, success metrics, risk assessment. Whatever makes outputs consistent and reviewable.
Codify institutional knowledge. This is the secret sauce. Create a "lessons learned" document and update it every time something unexpected happens. What broke in production and why. What assumptions proved wrong. What edge cases you discovered. What workarounds you developed.
Test and refine iteratively. Start with low-risk applications. Validate every output against your standards. Add new rules when you discover gaps. Document what AI gets wrong and why.
The 80/20 rule holds: expect 80% of your framework to be reusable across different projects, brands, and team members. The 20% that changes: platform specifics, site architecture, industry patterns.
Timeline expectations: First framework takes weeks. Second application takes days. Third and beyond takes hours to days.
The ROI calculation isn't "how much time does AI save per task?" It's "how much time did we invest upfront, and how many tasks will benefit?" If you'll run 100+ tests with AI assistance, weeks of setup pays for itself immediately.
When to Still Say No (Even With Perfect Setup)
Some things remain high-risk regardless of training:
Server-side code generation: API endpoints, database queries, authentication logic. Risk: data breaches, security vulnerabilities. Recommendation: human-written, security-reviewed, never AI-generated without expert oversight.
Critical path functionality: Checkout flows, payment processing, user authentication. Risk: revenue loss, customer trust damage. Recommendation: extensive staging testing, phased rollout, thorough QA even if AI-generated.
Compliance-sensitive code: GDPR consent management, accessibility requirements, legal disclaimers. Risk: regulatory violations, lawsuits. Recommendation: legal review required regardless of who or what wrote it.
The risk hierarchy:
Tier 1 - Train AI, automate freely: Visual changes, content swaps, non-critical features. If properly trained, AI can handle these autonomously.
Tier 2 - Train AI, require human review: Interactive elements, dynamic content, performance-sensitive modifications. AI generates, humans validate before deployment.
Tier 3 - Human-written, AI-assisted at most: Business logic, data handling, security-relevant code, legal/compliance requirements. Humans write it, maybe use AI for suggestions.
Ask three questions: What's the worst case if this breaks? Can we test it safely in staging? How quickly can we roll back? High consequence plus hard to test plus slow rollback equals don't let AI touch it, regardless of training.
Your AI framework should make 80% of your work safer and faster. It shouldn't make 100% of your work automated. Some work is too important to delegate, even to extremely well-trained AI.
The Line Is Drawn by Investment, Not Technology
We started with a simple question: "What should AI be allowed to code?"
But that was the wrong question.
The right question: "What infrastructure do we need before AI can do specialized work safely?"
And the real insight: This isn't actually about code at all.
The framework we built for code generation which contains constraints, architecture, standards, institutional knowledge, is the same framework that makes AI useful for any complex optimization task. Whether AI is writing JavaScript, generating hypotheses, or analyzing test results, the pattern holds:
Generic AI → generic outputs that sort of work
Trained AI → specialized outputs that handle your edge cases
The teams succeeding with AI aren't the ones with the best models or biggest budgets. They're the ones who treated AI like a new team member who needed proper onboarding, documentation, standards, context, and institutional knowledge.
The "CSS yes, JavaScript no" rule isn't about technical capability. It's about whether you've done the work to make AI capable.