Waiting for a multi-second LLM response to complete before showing anything to the user is a perceived-latency loss the chat surface cannot afford. This release moves the platform's AI chat surface to Server-Sent Events so tokens stream to the browser as the model produces them ; the user sees the answer composing in real time.
- Token-progressive delivery. Each SSE event carries the next chunk of the LLM response ; the browser appends to the rendered message without buffering.
- Connection-lease management. Each active chat session leases a connection from a bounded pool ; pool exhaustion under spikes blocks new sessions briefly rather than crashing existing ones. Operators can size the pool against expected concurrency.
- Mid-stream cancellation. A user-initiated Stop cancels the in-flight LLM call cleanly ; the connection releases back to the pool immediately, the partial response is preserved in the session history.
- Knowledge-aware assistant mode. The assistant consults the RAG layer (HNSW on Informix or pgvector on PostgreSQL) before each turn ; retrieved document context flows into the LLM call alongside the user message, with the source citations rendered in the chat as inline references.