AI Chat with Streaming
Per-project AI chat with SSE streaming, prompt engineering, and analytics.
Part 6 of the "Building a Modern Portfolio Without a Meta-Framework" series
AI chat features are everywhere now. Most implementations involve calling OpenAI's API and hoping for the best. I wanted something tighter: a chat that knows about my specific projects, streams responses in real-time, and tracks usage without external services.
The result is a per-project chat where visitors can ask questions about any project. The AI responds as me, with full context about the project's architecture, technologies, and challenges.
The Architecture
The chat flows through three layers:
- Client: React hook managing messages, SSE parsing, and UI state
- Worker: Hono endpoint handling validation, AI calls, and analytics
- AI: Cloudflare Workers AI running Llama 4 at the edge
All running on Cloudflare's infrastructure, so the AI inference happens in the same network as the Worker. No external API calls, no API keys to manage.
System Prompt Engineering
The system prompt is where LLM behavior lives or dies. Here's what I landed on:
const projectSystemPrompt = `You are Adam Grady, a Senior Software Engineer.
You speak in first person about the provided project information.
Be conversational and enthusiastic about your work.
IMPORTANT RULES:
- Be brief and direct. Give short, punchy answers (1-3 sentences when possible).
- NEVER repeat or rephrase the user's question back to them.
- NEVER start responses with phrases like "Great question!" or similar filler.
- Jump straight into the answer.
- Use markdown formatting.
- NEVER use emojis in your responses.
CRITICAL - TOPIC BOUNDARIES:
- You ONLY answer questions about THIS SPECIFIC PROJECT and my work on it.
- If asked about ANYTHING else, respond with: "I can only discuss this project."
- Do NOT demonstrate general knowledge outside this project context.
- When uncertain if something is related, err on the side of declining.`;
A few deliberate choices here:
- Brevity rules: LLMs love to ramble. Explicit instructions to be brief cut response length by 60%.
- No question echoing: "Great question! You asked about..." wastes tokens and feels robotic.
- Topic boundaries: Without this, the model happily becomes a general-purpose assistant. I don't want visitors asking it to write code or explain quantum physics.
The project context gets appended dynamically:
const systemMessage = {
role: 'system',
content: `${projectSystemPrompt}
Here is the project information:
Title: ${project.title}
Description: ${project.description}
Technologies: ${project.technologies.join(', ')}
Link: ${project.link ?? 'N/A'}
Github: ${project.github ?? 'N/A'}
Content: ${project.content}`,
};
That project.content is the full MDX content converted to markdown. More on that later.
I could have used tools or retrieval-augmented generation, but for this use case, a well-crafted prompt with project context suffices.
Why SSE Over WebSockets
For streaming LLM responses, you have two real options: WebSockets or Server-Sent Events.
WebSockets give you bidirectional communication. SSE gives you server-to-client only. For chat, that's all you need—the client sends messages via POST, the server streams back.
SSE wins here because:
- Simpler: No connection handshake, just a long-lived HTTP response
- HTTP/2 native: Multiplexes over existing connections
- Auto-reconnect: Browsers handle reconnection automatically
- Firewall-friendly: It's just HTTP
The format is dead simple:
data: {"response":"Hello"}
data: {"response":" world"}
data: [DONE]
Each chunk is a line starting with data:, followed by JSON. The client accumulates these chunks to build the full response.
Client-Side Stream Parsing
The useProjectChat hook handles all the streaming complexity:
export function useProjectChat({
slug,
onStreamCompleted,
}: UseProjectChatOptions) {
const [messages, setMessages] = useState<ChatMessage[]>(() =>
loadMessages(slug),
);
const [isLoading, setIsLoading] = useState(false);
const abortControllerRef = useRef<AbortController | null>(null);
const sendMessage = useCallback(
async (content: string) => {
// Cancel any existing request
abortControllerRef.current?.abort();
abortControllerRef.current = new AbortController();
const streamStartTime = Date.now();
let assistantContent = '';
const response = await fetch(`/api/chat/projects/${slug}`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
sessionId: getSessionId(),
messages: newMessages,
}),
signal: abortControllerRef.current.signal,
});
// Add placeholder message
setMessages((prev) => [...prev, { role: 'assistant', content: '' }]);
// Process SSE stream
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() || ''; // Keep incomplete line in buffer
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = line.slice(6);
if (data === '[DONE]') continue;
const parsed = JSON.parse(data);
if (parsed.response) {
assistantContent += parsed.response;
// Update message in place
setMessages((prev) => {
const updated = [...prev];
updated[updated.length - 1] = {
role: 'assistant',
content: assistantContent,
};
return updated;
});
}
}
}
}
onStreamCompleted?.(
assistantContent.length,
Date.now() - streamStartTime,
);
},
[slug, messages],
);
return { messages, sendMessage, isLoading };
}
The key insight: buffer incomplete lines. SSE chunks can arrive mid-line, so lines.pop() keeps the partial line for the next iteration.
The AbortController handles request cancellation. If someone navigates away or starts a new request, we abort the previous one cleanly.
MDX to Markdown Conversion
Projects are written in MDX, but the AI needs plain text. React components won't help an LLM understand your architecture.
The solution: render the MDX to HTML, then convert to markdown:
import { renderToString } from 'react-dom/server';
import { NodeHtmlMarkdown } from 'node-html-markdown';
const nhm = new NodeHtmlMarkdown();
function getProject(slug: string) {
const project = projects[slug];
if (!project) return;
return {
...project.metadata,
content: nhm.translate(renderToString(project.default())),
};
}
renderToString converts the React component tree to HTML. NodeHtmlMarkdown strips the tags and produces clean markdown the AI can understand.
This runs on every request, but it's fast enough. The alternative—pre-computing markdown at build time—adds complexity for minimal gain.
The Chat Endpoint
Putting it all together:
const projectChatSchema = z.object({
sessionId: z.uuid(),
messages: z.array(
z.object({
role: z.enum(['user', 'assistant']),
content: z.string(),
}),
),
});
app.post(
'/api/chat/projects/:slug',
validator('json', (value, c) => {
const parsed = projectChatSchema.safeParse(value);
if (!parsed.success) return c.text('Invalid!', 401);
return parsed.data;
}),
async (c) => {
const { slug } = c.req.param();
const { messages, sessionId } = c.req.valid('json');
const project = getProject(slug);
if (!project) return c.text('Project not found', 404);
const systemMessage = {
role: 'system',
content: `${projectSystemPrompt}\n\n${projectContext}`,
};
const startTime = Date.now();
const response = await c.env.AI.run(
'@cf/meta/llama-4-scout-17b-16e-instruct',
{
messages: [systemMessage, ...messages],
stream: true,
max_tokens: 512,
temperature: 0.3,
top_p: 0.9,
},
);
return c.body(response as ReadableStream, 200, {
'Content-Type': 'text/event-stream',
});
},
);
Zod validates the request. The project content gets loaded and converted. The AI runs with streaming enabled. The response gets teed for analytics. Clean.
The model parameters matter:
max_tokens: 512: Keeps responses concisetemperature: 0.3: Lower = more deterministic, less creative ramblingtop_p: 0.9: Nucleus sampling for natural-sounding text
What I'd Add for Production
This works well for a portfolio. For a production chat:
- Rate limiting: Cloudflare has built-in rate limiting by IP
- Content moderation: Filter inappropriate requests before they hit the model
- Conversation history in D1: Currently stored in
sessionStorage, lost on browser close - Model fallbacks: If Llama fails, try a backup model
- Prompt injection detection: The current boundaries help, but dedicated detection is better
Alternatives I Considered
OpenAI/Anthropic APIs: Better models, but external calls add latency and require API key management. Cloudflare AI runs in the same network.
Non-streaming: Simpler implementation, but users stare at a spinner for 2-5 seconds. Streaming shows progress immediately.
WebSockets: Overkill for request-response chat. SSE is simpler and HTTP-native.
Vercel AI SDK: Nice abstractions, but I wanted to understand the raw primitives first.