Most AI-powered tools feel the same from the outside. You type something, a spinner appears, and eventually you get a wall of text. That interaction model is fine for a demo. It doesn’t hold up in production, especially when users are paying for it.
We built a code generation platform that streams AI responses in real time, serves users across seven languages, and processes payments in 20+ currencies. Concept to production took about two months. Along the way, we ran into a set of problems that come up in nearly every AI product we work on. Here’s what we learned.
Why streaming changes the product, not just the speed
The first version of any AI interface usually works the same way: send a prompt, wait for the full response, render it. Simple to build. Painful to use.
The problem is time-to-first-token. A complex code generation request might take several seconds to complete. In a batch model, the user stares at a spinner for the entire duration. With streaming, they start reading useful output within the first second. The perceived speed difference is dramatic even when the total generation time is identical.
But speed isn’t the real benefit. Streaming changes how people interact with the output. Users scan the first few lines and course-correct early. If the code is heading in the wrong direction, they stop and rephrase instead of waiting for a complete response they’ll throw away. That feedback loop is the difference between a tool people use once and one they come back to.
For a paid product, this matters even more. Users who feel like the tool is responsive stay longer and convert at higher rates. Users who stare at spinners leave.
Choosing the right streaming approach
The decision was chunked transfer encoding over HTTP, not WebSockets. The reasoning: AI output is unidirectional (server to client). WebSockets add connection management complexity, reconnection logic, and infrastructure overhead that only pay off when you need bidirectional communication. For streaming generated output to a browser, a standard HTTP connection is simpler and more reliable.
The front-end renders each chunk as it arrives rather than waiting for the full response. Code highlighting, explanations, and suggestions build up in real time. The user starts reading working code within the first second, before the model has finished generating the rest.
One decision that paid off early: using the AI provider’s function calling for structured outputs alongside the raw text stream. Instead of trying to parse code blocks and metadata out of freeform markdown on the client, the API returns structured data shapes that the front-end knows how to render. That separation between “what to display” and “how to display it” eliminated an entire class of parsing bugs.
Why generic AI output isn’t good enough
A general-purpose language model knows a lot about code syntax. It also confidently generates syntax that doesn’t exist. For a code generation tool, “close but wrong” is worse than no answer at all.
The solution was retrieval-augmented generation (RAG): a pipeline that retrieves relevant documentation at query time and injects it into the prompt as context. The model still generates the response, but instead of relying on its training data (which may be outdated or subtly wrong), it works from verified reference material.
The difference shows up most in edge cases: obscure methods, platform-specific behavior, and deprecated syntax that the model might otherwise present as current. For a product where users trust the output enough to paste it into their own projects, that accuracy gap is the difference between a useful tool and a liability.
No fine-tuning required. The documentation gets updated independently of the model. If you’re building an AI product in a domain with specific reference material, RAG is often the right first step before considering more expensive approaches like fine-tuning.
Designing a freemium model that converts
The platform serves three tiers: anonymous visitors, free registered users, and paid subscribers. Each gets different query limits, model access, and response length.
| Anonymous | Free | Paid | |
|---|---|---|---|
| Queries | Limited | Moderate | Unlimited |
| AI model | Base | Base | Advanced |
| Response length | Capped | Capped | Full |
The design challenge isn’t technical. It’s strategic. Free users need enough access to experience real value, but the ceiling needs to be low enough that power users hit it naturally. Set the limit too high and no one converts. Set it too low and no one stays long enough to see the value.
The approach that worked: contextual upgrade prompts triggered when users approach their limits, not random banners. The user has just experienced the value of the tool. They’ve generated code they needed. Now they see they’re running low. That’s when the upgrade pitch lands, because the product has already made its case.
On the technical side, all access control runs through a single middleware layer. Model selection, usage metering, and rate limiting happen before any request reaches the application logic. When pricing or limits change (and they will), the update happens in one place.
Going global: payments and seven languages
Global payments sound straightforward until you start building them. Currency conversion, regional pricing, tax handling, subscription lifecycle events, and webhook reliability all compound quickly.
The critical decision: making Stripe webhooks the single source of truth for subscription state. The alternative (polling the API or relying on client-side checkout confirmation) creates race conditions. A user completes payment but their access doesn’t update for thirty seconds? That’s a support ticket. Webhooks like checkout.session.completed and invoice.paid drive all access changes immediately. Idempotent event processing handles duplicate deliveries, which happen more often than you’d expect.
Regional pricing matters more than currency conversion. Twenty-plus currencies with pricing set per region, not a single USD price run through an exchange rate. Users in different markets have different price sensitivity. A flat conversion ignores that entirely.
Internationalization was a day-one requirement, not a retrofit. That distinction matters. Retrofitting i18n means hunting down every hardcoded string in a finished codebase. Starting with it means every user-facing string goes through a translation layer from the first commit. The platform supports seven languages across all UI text, validation errors, and email templates.
Where payments and i18n intersect is where most teams get surprised. An order confirmation sent in English with USD formatting to a Japanese user paying in yen breaks trust immediately. Every transactional email needs to render in the user’s language with their local currency. Getting one of these systems right isn’t enough. They have to work together.
Search visibility across languages requires the same attention. Each language needs its own URL routing with canonical URLs and hreflang alternates. Without this, search engines either index only one language or treat the translations as duplicate content. If you’re building for a global audience, i18n isn’t a feature. It’s infrastructure.
What two months gets you
The finished platform: multiple AI tools sharing a common streaming architecture, each with its own prompt configuration and interaction history. Users describe what they need in plain language and get working code back in real time. Sub-second streaming responses, subscription billing across 20+ currencies, and a fully localized experience across seven languages.
That timeline was possible because every system (streaming, payments, access control, i18n, RAG) was scoped to do one thing well and connected through clean boundaries. The complexity of any AI product isn’t in any single feature. It’s in how the features interact. Keep the boundaries clean and a two-month timeline is realistic for a system this broad.
The streaming layer is one example of that principle. The broader pattern (a clean contract between application code and the AI provider, so swaps and upgrades stay cheap) is covered in model-agnostic AI architecture.