Operation Modes: Architecture & Performance Implications

Overview

operation_mode is a fundamental architectural choice that determines how Memlayer computes embeddings for semantic search. This affects startup time, runtime performance, cost, resource usage, and deployment constraints.

from memlayer.wrappers.openai import OpenAI

client = OpenAI(
    model="gpt-4.1-mini",
    operation_mode="online"  # or "local" or "lightweight"
)

The Three Modes

1. Online Mode (Default)

Architecture: Uses OpenAI's API for computing text embeddings.

User Query → Vector Search (OpenAI API) → ChromaDB Lookup → Graph Traversal → Results
             └─ API call (~50-100ms)

When to use: - ✅ Serverless/cloud deployments (AWS Lambda, Cloud Functions) - ✅ Production applications with internet access - ✅ When you want minimal local resource usage - ✅ Rapid development without model downloads

Tradeoffs: - Startup time: Fast (~200ms, no model loading) - Search latency: 100-300ms (includes API call) - Memory usage: Low (~50MB base) - Cost: ~$0.0001 per search (API call) - Privacy: Embeddings computed by OpenAI - Reliability: Requires internet connection

Architecture components:

┌─────────────────────────────────────────┐
│  Memlayer (online mode)                 │
├─────────────────────────────────────────┤
│  • ChromaStorage (vector DB)            │
│  • NetworkX/Memgraph (graph DB)         │
│  • OpenAI Embeddings API                │
│  • Salience Gate (API embeddings)       │
└─────────────────────────────────────────┘

Configuration:

client = OpenAI(
    operation_mode="online",
    embedding_model="text-embedding-3-small",  # OpenAI model
    salience_threshold=0.5
)

2. Local Mode

Architecture: Uses a local sentence-transformer model for embeddings.

User Query → Vector Search (Local Model) → ChromaDB Lookup → Graph Traversal → Results
             └─ GPU/CPU inference (~20-50ms)

When to use: - ✅ Privacy-sensitive deployments (no external API calls) - ✅ Offline/air-gapped environments - ✅ High-volume applications (no per-query API cost) - ✅ GPU-accelerated servers

Tradeoffs: - Startup time: Slow first call (~5-10s model load, cached after) - Search latency: 50-200ms (no API dependency) - Memory usage: High (~500MB-2GB depending on model) - Cost: Zero API costs, higher compute costs - Privacy: All computation on-premises - Reliability: No internet required

Architecture components:

┌─────────────────────────────────────────┐
│  Memlayer (local mode)                  │
├─────────────────────────────────────────┤
│  • ChromaStorage (vector DB)            │
│  • NetworkX/Memgraph (graph DB)         │
│  • sentence-transformers (local model)  │
│  • Salience Gate (local embeddings)     │
│  • GPU/CPU inference pipeline           │
└─────────────────────────────────────────┘

Configuration:

client = OpenAI(
    operation_mode="local",
    embedding_model="all-MiniLM-L6-v2",  # Local model
    salience_threshold=0.5
)

First-time setup:

# First call downloads model (~80MB) and loads to memory
client.chat([{"role": "user", "content": "test"}])  # ~5-10s

# Subsequent calls are fast (model cached in memory)
client.chat([{"role": "user", "content": "hello"}])  # ~50ms search

3. Lightweight Mode

Architecture: No embeddings, pure graph-based memory with keyword matching.

User Query → Graph Keyword Search → NetworkX Traversal → Results
             └─ O(n) text scan (~10-50ms)

When to use: - ✅ Demos and rapid prototyping - ✅ Minimal resource constraints - ✅ When semantic search isn't critical - ✅ Testing and development

Tradeoffs: - Startup time: Instant (~10ms) - Search latency: 10-100ms (no embeddings) - Memory usage: Minimal (~20MB base) - Cost: Zero (no API, no models) - Search quality: Lower recall, keyword-based only - Reliability: No dependencies

Architecture components:

┌─────────────────────────────────────────┐
│  Memlayer (lightweight mode)            │
├─────────────────────────────────────────┤
│  • NetworkX/Memgraph (graph DB only)    │
│  • Keyword-based search                 │
│  • No vector embeddings                 │
│  • No salience gate                     │
└─────────────────────────────────────────┘

Configuration:

client = OpenAI(
    operation_mode="lightweight",
    # No embedding_model needed
)

Search behavior:

# Finds exact matches and entity relationships only
client.chat([{"role": "user", "content": "What's my name?"}])
# ✅ Works: "My name is Alice" stored → "Alice" found

client.chat([{"role": "user", "content": "Tell me about my identity"}])
# ❌ May not work: "identity" doesn't match "name" keyword

Comparison Table

Feature	Online	Local	Lightweight
Startup time	~200ms	~5-10s (first call)	~10ms
Search latency	100-300ms	50-200ms	10-100ms
Memory usage	~50MB	~500MB-2GB	~20MB
API costs	~$0.0001/search	$0	$0
Privacy	API call to OpenAI	Fully local	Fully local
Search quality	High (semantic)	High (semantic)	Medium (keywords)
GPU benefit	None	Yes (faster inference)	None
Internet required	Yes	No	No
Best for	Production	Privacy/offline	Demos/testing

Architecture Deep Dive

Storage Backend Impact

# Online mode: Full stack
storage_stack = {
    "chroma": ChromaStorage(),      # Vector embeddings
    "graph": NetworkXStorage(),     # Entity relationships
    "embedder": OpenAIEmbeddings()  # API-based
}

# Local mode: Full stack
storage_stack = {
    "chroma": ChromaStorage(),           # Vector embeddings
    "graph": NetworkXStorage(),          # Entity relationships
    "embedder": SentenceTransformer()    # Local model
}

# Lightweight mode: Graph only
storage_stack = {
    "chroma": None,                 # Disabled
    "graph": NetworkXStorage(),     # Only component
    "embedder": None                # Disabled
}

Salience Gate Behavior

The salience gate filters trivial content before storing it. Behavior varies by mode:

Online mode:

# Uses OpenAI embeddings API to compute similarity
# Cache: ~/.memlayer_cache/salience_prototypes_online.pkl
# First call: ~1-2s (compute 100 prototype embeddings)
# Cached calls: ~0.01s (load from disk)

Local mode:

# Uses local sentence-transformer model
# Cache: ~/.memlayer_cache/salience_prototypes_local.pkl
# First call: ~5-10s (load model + compute prototypes)
# Cached calls: ~0.05s (local inference)

Lightweight mode:

# Salience gate disabled entirely
# All content stored (no filtering)
# Zero overhead

Memory Search Flow

Online mode search:

1. User query: "What's my name?"
2. Embed query → OpenAI API call (~50ms)
3. ChromaDB vector search (~30ms)
4. NetworkX graph traversal (~20ms)
5. Return results (~100ms total)

Local mode search:

1. User query: "What's my name?"
2. Embed query → Local inference (~20ms)
3. ChromaDB vector search (~30ms)
4. NetworkX graph traversal (~20ms)
5. Return results (~70ms total)

Lightweight mode search:

1. User query: "What's my name?"
2. Keyword extraction (~5ms)
3. NetworkX keyword search (~30ms)
4. Graph traversal (~20ms)
5. Return results (~55ms total)

Choosing the Right Mode

Decision Tree

Need privacy/offline?
├─ Yes → LOCAL mode
└─ No → Need semantic search?
         ├─ Yes → ONLINE mode (recommended)
         └─ No → LIGHTWEIGHT mode

By Use Case

Serverless Functions (Lambda, Cloud Functions)

# Use online mode - fast cold starts
client = OpenAI(operation_mode="online")

Long-running Servers

# Use local mode - absorb startup cost once
client = OpenAI(operation_mode="local")
# First request pays model load cost
# All subsequent requests are fast

Demos and Prototypes

# Use lightweight - instant startup
client = OpenAI(operation_mode="lightweight")

Privacy-Sensitive Applications

# Use local mode - no external API calls
client = OpenAI(operation_mode="local")

High-Volume Production

# Use local mode on GPU server - no API costs
client = OpenAI(
    operation_mode="local",
    embedding_model="all-mpnet-base-v2"  # Better quality
)

Performance Tuning

Optimizing Online Mode

client = OpenAI(
    operation_mode="online",
    embedding_model="text-embedding-3-small",  # Faster than 3-large
    max_search_results=5,                      # Fewer results = faster
    search_tier="fast"                          # Vector-only search
)

Optimizing Local Mode

client = OpenAI(
    operation_mode="local",
    embedding_model="all-MiniLM-L6-v2",  # Smaller, faster model
    # Use GPU if available (detected automatically)
)

# Warm up model on startup
client.chat([{"role": "user", "content": "warmup"}])

Optimizing Lightweight Mode

client = OpenAI(
    operation_mode="lightweight",
    max_search_results=3,      # Faster graph traversal
    search_tier="fast"          # Minimal graph hops
)

Migration Between Modes

Switching modes requires re-embedding existing memories:

# Start with online mode
client_online = OpenAI(
    user_id="alice",
    operation_mode="online",
    chroma_dir="./chroma_online"
)

# Migrate to local mode
client_local = OpenAI(
    user_id="alice", 
    operation_mode="local",
    chroma_dir="./chroma_local"  # Different directory
)

# Graph data (networkx_path) can be reused
# Vector embeddings must be recomputed

Note: Graph storage (NetworkX/Memgraph) is mode-independent and can be reused.

Examples

See complete examples: - Online mode: examples/05_providers/openai_example.py - Local mode: examples/05_providers/ollama_example.py - Lightweight mode: examples/01_basics/getting_started.py

Basics Overview: Architecture fundamentals
Quickstart: Getting started guide
Salience Threshold: Content filtering
ChromaDB Storage: Vector storage details
Ollama Provider: Local model setup

Operation Modes: Architecture & Performance Implications

Overview

The Three Modes

1. Online Mode (Default)

2. Local Mode

3. Lightweight Mode

Comparison Table

Architecture Deep Dive

Storage Backend Impact

Salience Gate Behavior

Memory Search Flow

Choosing the Right Mode

Decision Tree

By Use Case

Performance Tuning

Optimizing Online Mode

Optimizing Local Mode

Optimizing Lightweight Mode

Migration Between Modes

Examples

Related Documentation