Streaming Mode
Memlayer supports streaming responses from all providers (OpenAI, Claude, Gemini, Ollama). This guide explains how streaming works, its performance characteristics, and best practices.
What is Streaming?
Streaming mode yields response chunks as they're generated by the LLM, rather than waiting for the complete response. This provides a better user experience with faster perceived latency.
Basic Usage
Streaming with OpenAI
from memlayer.wrappers.openai import OpenAI
client = OpenAI(model="gpt-4.1-mini", user_id="alice")
# Enable streaming with stream=True
for chunk in client.chat(
[{"role": "user", "content": "Tell me about machine learning"}],
stream=True
):
print(chunk, end="", flush=True)
print() # Newline after completion
Streaming with Claude
from memlayer.wrappers.claude import Claude
client = Claude(model="claude-3-5-sonnet-20241022", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "Explain quantum computing"}],
stream=True
):
print(chunk, end="", flush=True)
Streaming with Gemini
from memlayer.wrappers.gemini import Gemini
client = Gemini(model="gemini-2.5-flash", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "What is a neural network?"}],
stream=True
):
print(chunk, end="", flush=True)
Streaming with Ollama
from memlayer.wrappers.ollama import Ollama
client = Ollama(model="llama3.2", user_id="alice")
for chunk in client.chat(
[{"role": "user", "content": "Describe photosynthesis"}],
stream=True
):
print(chunk, end="", flush=True)
Memory Search with Streaming
When using memory search, the LLM may call the search_memory tool before streaming:
# Message that triggers memory search
for chunk in client.chat([
{"role": "user", "content": "What did I tell you about my project?"}
], stream=True):
print(chunk, end="", flush=True)
# Timeline:
# t=0ms: Message sent
# t=50ms: LLM calls search_memory tool
# t=250ms: Search completes, context retrieved
# t=1500ms: First chunk arrives (including search time)
# t=1500ms+: Chunks stream in real-time
Knowledge Extraction (Background)
After streaming completes, Memlayer extracts knowledge in a background thread:
for chunk in client.chat([
{"role": "user", "content": "I work at Acme Corp"}
], stream=True):
print(chunk, end="", flush=True)
# Your code continues immediately after streaming
print("Streaming done!") # This prints right away
# Meanwhile, in background:
# - Salience check (~100ms, cached after first run)
# - Knowledge extraction API call (1-2s, fast model)
# - Graph update (~50ms)
Next Steps
- Overview: Understand the full architecture
- Quickstart: Get started in 5 minutes
- Operation Modes: Choose the right mode
- Examples: See complete working code