Skip to content

Streaming Mode

Memlayer supports streaming responses from all providers (OpenAI, Claude, Gemini, Ollama). This guide explains how streaming works, its performance characteristics, and best practices.

What is Streaming?

Streaming mode yields response chunks as they're generated by the LLM, rather than waiting for the complete response. This provides a better user experience with faster perceived latency.

Basic Usage

Streaming with OpenAI

from memlayer.wrappers.openai import OpenAI

client = OpenAI(model="gpt-4.1-mini", user_id="alice")

# Enable streaming with stream=True
for chunk in client.chat(
    [{"role": "user", "content": "Tell me about machine learning"}],
    stream=True
):
    print(chunk, end="", flush=True)
print()  # Newline after completion

Streaming with Claude

from memlayer.wrappers.claude import Claude

client = Claude(model="claude-3-5-sonnet-20241022", user_id="alice")

for chunk in client.chat(
    [{"role": "user", "content": "Explain quantum computing"}],
    stream=True
):
    print(chunk, end="", flush=True)

Streaming with Gemini

from memlayer.wrappers.gemini import Gemini

client = Gemini(model="gemini-2.5-flash", user_id="alice")

for chunk in client.chat(
    [{"role": "user", "content": "What is a neural network?"}],
    stream=True
):
    print(chunk, end="", flush=True)

Streaming with Ollama

from memlayer.wrappers.ollama import Ollama

client = Ollama(model="llama3.2", user_id="alice")

for chunk in client.chat(
    [{"role": "user", "content": "Describe photosynthesis"}],
    stream=True
):
    print(chunk, end="", flush=True)

Memory Search with Streaming

When using memory search, the LLM may call the search_memory tool before streaming:

# Message that triggers memory search
for chunk in client.chat([
    {"role": "user", "content": "What did I tell you about my project?"}
], stream=True):
    print(chunk, end="", flush=True)

# Timeline:
# t=0ms:    Message sent
# t=50ms:   LLM calls search_memory tool
# t=250ms:  Search completes, context retrieved
# t=1500ms: First chunk arrives (including search time)
# t=1500ms+: Chunks stream in real-time

Knowledge Extraction (Background)

After streaming completes, Memlayer extracts knowledge in a background thread:

for chunk in client.chat([
    {"role": "user", "content": "I work at Acme Corp"}
], stream=True):
    print(chunk, end="", flush=True)

# Your code continues immediately after streaming
print("Streaming done!")  # This prints right away

# Meanwhile, in background:
# - Salience check (~100ms, cached after first run)
# - Knowledge extraction API call (1-2s, fast model)
# - Graph update (~50ms)

Next Steps