Memlayer Overview
What is Memlayer?
Memlayer is a memory-enhanced LLM wrapper that automatically builds and maintains a persistent knowledge graph from your conversations. It adds memory capabilities to any LLM provider (OpenAI, Claude, Gemini, Ollama) without changing how you interact with them.
Core Architecture
How It Works
- Chat Flow: When you send a message via
.chat(), Memlayer: - Searches the knowledge graph for relevant context
- Injects that context into the LLM prompt via tool calls
- Returns the LLM's response to you
-
Asynchronously extracts knowledge and updates the graph
-
Knowledge Extraction: After each conversation turn:
- Text is analyzed by a fast model (background thread)
- Facts, entities, and relationships are extracted
- Salience gate filters out trivial information
-
Knowledge is stored in both vector DB and graph DB
-
Memory Search: When the LLM needs context:
- Hybrid search combines vector similarity + graph traversal
- Three search tiers available:
fast,balanced,deep -
Results are ranked and returned as context
-
Background Services:
- Consolidation: Extracts knowledge from conversations (async)
- Curation: Expires time-sensitive facts (background thread)
- Salience Gate: Filters low-value information before storage
Data Flow
Normal Chat (Non-Streaming)
User Message
│
▼
Memory Search (if LLM calls tool)
│
▼
LLM Response Generated
│
├─► Return to User
│
└─► Background: Extract Knowledge → Store in Graph
Streaming Chat
User Message
│
▼
Memory Search (if LLM calls tool)
│
▼
LLM Starts Streaming
│
├─► Yield chunks to user in real-time
│
└─► Background: Buffer full response
│
└─► After stream completes → Extract Knowledge → Store in Graph
Key Features
1. Provider-Agnostic
Works with OpenAI, Anthropic Claude, Google Gemini, and local Ollama models. Same API across all providers.
2. Automatic Memory Tools
LLM automatically gets access to:
- search_memory: Hybrid vector + graph search
- schedule_task: Create time-based reminders
3. Flexible Search Tiers
- fast: Vector-only search, <100ms
- balanced: Vector + 1-hop graph traversal
- deep: Full graph traversal with entity extraction
4. Knowledge Graph Features
- Entity deduplication (e.g., "John" = "John Smith")
- Relationship tracking between entities
- Time-aware facts with expiration dates
- Importance scoring for fact prioritization
5. Operation Modes
Choose embedding strategy based on your needs: - online: API-based embeddings (OpenAI), fast startup - local: Local sentence-transformer model, no API costs - lightweight: Graph-only, no embeddings, fastest startup
Configuration Options
from memlayer.wrappers.openai import OpenAI
client = OpenAI(
# Core settings
api_key="your-key",
model="gpt-4.1-mini",
user_id="user123",
# Memory behavior
operation_mode="online", # online | local | lightweight
salience_threshold=0.5, # 0.0-1.0, filters trivial content
# Storage paths
chroma_dir="./my_chroma_db",
networkx_path="./my_graph.pkl",
# Search behavior
max_search_results=5,
search_tier="balanced", # fast | balanced | deep
# Performance tuning
curation_interval=3600, # Check for expired facts every hour
embedding_model="text-embedding-3-small"
)
Common Usage Patterns
Basic Chat
response = client.chat([
{"role": "user", "content": "My name is Alice"}
])
# Knowledge automatically extracted and stored
Streaming Chat
for chunk in client.chat(
[{"role": "user", "content": "What's my name?"}],
stream=True
):
print(chunk, end="", flush=True)
Direct Knowledge Ingestion
# Import knowledge from documents
client.update_from_text("""
Project Phoenix is led by Alice.
The project deadline is December 1st.
""")
Synthesized Q&A
# Get memory-grounded answer
answer = client.synthesize_answer("Who leads Project Phoenix?")
Performance Characteristics
| Component | Latency | Notes |
|---|---|---|
| Memory search (fast) | 50-100ms | Vector search only |
| Memory search (balanced) | 100-300ms | Vector + 1-hop graph |
| Memory search (deep) | 300-1000ms | Full graph traversal |
| Knowledge extraction | 1-3s | Background, doesn't block response |
| Consolidation | 1-2s | Async, uses fast model |
| First-time salience init | 1-2s | Cached after first run |
Best Practices
- Choose the right operation mode:
- Serverless →
onlinemode - Privacy-sensitive →
localmode -
Demos/prototypes →
lightweightmode -
Use streaming for better UX:
- First chunk arrives in 1-3s
- Knowledge extraction happens in background
-
User sees response immediately
-
Tune salience threshold:
- Low (0.3-0.5): Keep more memories, higher storage
- Medium (0.5-0.7): Balanced, recommended default
-
High (0.7-0.9): Only important facts, minimal storage
-
Set expiration dates for time-sensitive facts:
- System automatically extracts expiration dates from text
-
Curation service removes expired facts periodically
-
Use appropriate search tier:
fast: Quick lookups, high-traffic applicationsbalanced: Default, good recall with reasonable latencydeep: Complex questions needing graph reasoning
Next Steps
- Quickstart Guide: Get up and running in 5 minutes
- Streaming Mode: Deep dive into streaming behavior
- Operation Modes: Architecture implications of each mode
- Provider Setup: Provider-specific configuration