Skip to content

Memlayer Overview

What is Memlayer?

Memlayer is a memory-enhanced LLM wrapper that automatically builds and maintains a persistent knowledge graph from your conversations. It adds memory capabilities to any LLM provider (OpenAI, Claude, Gemini, Ollama) without changing how you interact with them.

Core Architecture

How It Works

  1. Chat Flow: When you send a message via .chat(), Memlayer:
  2. Searches the knowledge graph for relevant context
  3. Injects that context into the LLM prompt via tool calls
  4. Returns the LLM's response to you
  5. Asynchronously extracts knowledge and updates the graph

  6. Knowledge Extraction: After each conversation turn:

  7. Text is analyzed by a fast model (background thread)
  8. Facts, entities, and relationships are extracted
  9. Salience gate filters out trivial information
  10. Knowledge is stored in both vector DB and graph DB

  11. Memory Search: When the LLM needs context:

  12. Hybrid search combines vector similarity + graph traversal
  13. Three search tiers available: fast, balanced, deep
  14. Results are ranked and returned as context

  15. Background Services:

  16. Consolidation: Extracts knowledge from conversations (async)
  17. Curation: Expires time-sensitive facts (background thread)
  18. Salience Gate: Filters low-value information before storage

Data Flow

Normal Chat (Non-Streaming)

User Message
    │
    ▼
Memory Search (if LLM calls tool)
    │
    ▼
LLM Response Generated
    │
    ├─► Return to User
    │
    └─► Background: Extract Knowledge → Store in Graph

Streaming Chat

User Message
    │
    ▼
Memory Search (if LLM calls tool)
    │
    ▼
LLM Starts Streaming
    │
    ├─► Yield chunks to user in real-time
    │
    └─► Background: Buffer full response
            │
            └─► After stream completes → Extract Knowledge → Store in Graph

Key Features

1. Provider-Agnostic

Works with OpenAI, Anthropic Claude, Google Gemini, and local Ollama models. Same API across all providers.

2. Automatic Memory Tools

LLM automatically gets access to: - search_memory: Hybrid vector + graph search - schedule_task: Create time-based reminders

3. Flexible Search Tiers

  • fast: Vector-only search, <100ms
  • balanced: Vector + 1-hop graph traversal
  • deep: Full graph traversal with entity extraction

4. Knowledge Graph Features

  • Entity deduplication (e.g., "John" = "John Smith")
  • Relationship tracking between entities
  • Time-aware facts with expiration dates
  • Importance scoring for fact prioritization

5. Operation Modes

Choose embedding strategy based on your needs: - online: API-based embeddings (OpenAI), fast startup - local: Local sentence-transformer model, no API costs - lightweight: Graph-only, no embeddings, fastest startup

Configuration Options

from memlayer.wrappers.openai import OpenAI

client = OpenAI(
    # Core settings
    api_key="your-key",
    model="gpt-4.1-mini",
    user_id="user123",

    # Memory behavior
    operation_mode="online",        # online | local | lightweight
    salience_threshold=0.5,         # 0.0-1.0, filters trivial content

    # Storage paths
    chroma_dir="./my_chroma_db",
    networkx_path="./my_graph.pkl",

    # Search behavior
    max_search_results=5,
    search_tier="balanced",          # fast | balanced | deep

    # Performance tuning
    curation_interval=3600,          # Check for expired facts every hour
    embedding_model="text-embedding-3-small"
)

Common Usage Patterns

Basic Chat

response = client.chat([
    {"role": "user", "content": "My name is Alice"}
])
# Knowledge automatically extracted and stored

Streaming Chat

for chunk in client.chat(
    [{"role": "user", "content": "What's my name?"}],
    stream=True
):
    print(chunk, end="", flush=True)

Direct Knowledge Ingestion

# Import knowledge from documents
client.update_from_text("""
Project Phoenix is led by Alice.
The project deadline is December 1st.
""")

Synthesized Q&A

# Get memory-grounded answer
answer = client.synthesize_answer("Who leads Project Phoenix?")

Performance Characteristics

Component Latency Notes
Memory search (fast) 50-100ms Vector search only
Memory search (balanced) 100-300ms Vector + 1-hop graph
Memory search (deep) 300-1000ms Full graph traversal
Knowledge extraction 1-3s Background, doesn't block response
Consolidation 1-2s Async, uses fast model
First-time salience init 1-2s Cached after first run

Best Practices

  1. Choose the right operation mode:
  2. Serverless → online mode
  3. Privacy-sensitive → local mode
  4. Demos/prototypes → lightweight mode

  5. Use streaming for better UX:

  6. First chunk arrives in 1-3s
  7. Knowledge extraction happens in background
  8. User sees response immediately

  9. Tune salience threshold:

  10. Low (0.3-0.5): Keep more memories, higher storage
  11. Medium (0.5-0.7): Balanced, recommended default
  12. High (0.7-0.9): Only important facts, minimal storage

  13. Set expiration dates for time-sensitive facts:

  14. System automatically extracts expiration dates from text
  15. Curation service removes expired facts periodically

  16. Use appropriate search tier:

  17. fast: Quick lookups, high-traffic applications
  18. balanced: Default, good recall with reasonable latency
  19. deep: Complex questions needing graph reasoning

Next Steps