Founder's Thoughts
Why Models Are Commodities, But Context & Memory Are Your Moat

What Investors See
In the past year, two ideas have repeatedly surfaced in conversations about the future of AI infrastructure.
Betaworks wrote:
“Whoever collects, stores, and leverages memory is king.”¹
Around the same time, a16z argued:
“New model primitives unlock previously impossible companies.”²
At first glance, these statements seem to point in different directions. One emphasizes memory as the strategic advantage, while the other highlights how advances in model capabilities unlock entirely new categories of companies.
In reality, they describe the same shift from two perspectives.
If new model primitives enable new companies, the next question becomes: what infrastructure do those companies need to win?
Our answer is simple.
We are building context and memory infrastructure.
The Problem
Every serious AI deployment eventually hits the same wall.
Modern AI agents can reason, analyze information, and execute tasks effectively. They can read documents, answer questions, summarize conversations, and trigger workflows.
But they struggle with something much more fundamental.
They cannot remember in ways that compound over time.
Most AI systems today simulate memory using three mechanisms:
Session logs that capture short-term interaction history
Vector databases that enable similarity search across embeddings
RAG systems that retrieve documents during generation
These approaches allow systems to retrieve relevant information. However, they do not create something deeper: institutional intelligence that grows over time.
Today, context exists in two problematic extremes.
Sometimes it is too raw, consisting of long prompts, transcripts, or logs that must be passed repeatedly into models. This increases token costs while still lacking structure.
Other times it is too shallow, relying on top-k embedding retrieval that surfaces fragments of information but fails to capture relationships between events.
The result is that most AI systems today have memory that is effectively stateless, or memory that has been bolted onto systems that were never designed for it.
The deeper issue is simple.
The infrastructure layer for persistent memory does not exist yet.
Why Video First
Video is the superset of modalities.
Inside a single video stream you already find multiple forms of information:
Visual context
Audio signals
Spoken language and transcripts
Temporal relationships between events
Behavioral and interaction patterns
If a system can understand and remember video, it implicitly gains the ability to work across other modalities as well.
Interestingly, the hardest challenge here is not perception.
Modern models are already capable of analyzing video frames, detecting objects, extracting speech, and generating summaries.
The real difficulty lies in persistence.
How do you make understanding from one moment connect to another moment weeks or months later?
Consider a typical customer interaction timeline.
Customer call in Week 1
Product demo in Week 3
Support ticket in Week 5
Another call in Week 7
In most systems today, these interactions exist as isolated records spread across different tools.
They live in CRM logs, support systems, meeting transcripts, and ticketing platforms. Rarely do they form a continuous narrative that an AI system can reason across.
But the most valuable insights emerge across time, not within a single interaction.
Pattern recognition requires understanding how events connect over time and across modalities.
That requires memory.
The Multimodal Memory Challenge
Memory systems designed for text follow a relatively straightforward pipeline.
Information is extracted, converted into embeddings, stored in a vector database, and later retrieved through semantic search.
This approach works well when dealing with documents or conversations.
However, multimodal memory introduces a completely different set of challenges.
Google’s research on context engineering describes two primary approaches.³
The first approach can be described as memory from multimodal sources.
In this approach, systems analyze video, audio, or images but ultimately convert the output into textual insights.
Examples might include observations such as:
“User expressed frustration.”
“Product appears at timestamp 02:14.”
Everything is reduced to text because text is easy to index, embed, and search.
The second approach is memory with multimodal content.
Instead of summarizing the media, the system stores the original media itself alongside semantic metadata.
For example, if a user uploads a design image and says “remember this for our logo,” the memory contains the actual image file, not just a textual description.
Most systems today choose the first approach.
The reason is practical. Google explains it clearly:
“Generating and retrieving unstructured binary data like images or audio requires specialized models, algorithms, and infrastructure. It is far simpler to convert all inputs into a common, searchable format: text.”⁴
However, reducing multimodal signals to text introduces a significant limitation.
Critical information can be lost.
Spatial relationships in diagrams cannot always be expressed in words.
The tone and pacing of speech may change the meaning of a conversation.
Temporal sequences in video often reveal patterns that textual summaries miss.
In many cases, the signal itself contains meaning that cannot be compressed into language.
The Hybrid Approach
The most effective architecture is not to choose between these two approaches, but to combine them.
A hybrid memory system stores both semantic insights and the original source media.
This means memory consists of two complementary components.
Semantic extraction, which is structured, searchable, and efficient
Source media, which preserves accuracy and provides grounding
When an agent queries memory, it first retrieves semantic references that identify relevant entities, events, or patterns.
These references act as pointers to the original artifacts.
If deeper inspection is required, the system can rehydrate the source media, loading the actual image, video segment, or audio clip.
Implementing this hybrid approach requires several capabilities.
Semantic graphs for traversing relationships quickly
Temporal indexing for locating precise moments in media
Lazy loading so large media assets are retrieved only when necessary
Format-specific extraction pipelines for different modalities
The impact is significant.
An agent can ask a question such as:
“Show designs similar to what we presented to Customer X.”
Instead of receiving text descriptions, the system can return the actual visual references.
The agent reasons semantically while remaining grounded in real artifacts.
The Architecture
mem[v] is designed as context and memory infrastructure for multimodal AI agents.
The system consists of five layers that work together to transform signals into persistent intelligence.
Perception Layer
The perception layer converts raw multimodal inputs into structured representations.
Instead of producing transcripts or summaries, it decomposes signals into semantic primitives.
These primitives typically include:
Events, which describe what happened
Entities, which represent people, objects, or concepts
Relationships, which describe how entities connect
Temporal sequences, which describe order and causality
The result is structured context that machines can reason over rather than unstructured text.
Memory Substrate
The memory substrate stores context using several complementary storage systems.
Each system serves a different purpose.
Graphs connect knowledge through relationships and enable causal reasoning.
Vectors enable semantic similarity search across large datasets.
Key-value stores support extremely fast lookups for preferences, counters, and flags.
Symbolic memory stores executable rules such as approval workflows or compliance constraints.
Artifacts store the original media including video, audio, documents, and images.
Together these systems combine semantic reasoning with real-world artifacts.
Memory becomes both knowledge and evidence.
Working Memory Layer
Before an agent takes action, the working memory layer assembles the relevant context for that task.
Instead of exposing the entire historical dataset, the system selects only the entities, events, and constraints relevant to the current situation.
This enables agents to operate with focused context rather than overwhelming context.
Working memory functions much like human attention, highlighting the signals that matter most in the moment.
Curation Layer
Memory is not static.
Over time, information becomes outdated, contradictory, or redundant.
The curation layer continuously improves the quality of stored knowledge.
It performs tasks such as:
Consolidating redundant information
Resolving conflicting records
Promoting recurring patterns
Compressing long histories into abstractions
Allowing stale context to decay
Humans can intervene when necessary, but over time the system learns which information deserves to persist.
Model Layer
The model layer contains the reasoning engines that interact with memory.
The key design principle here is model independence.
Foundation models evolve quickly. A new model may outperform existing systems in video understanding, reasoning, or planning.
mem[v] allows these models to be swapped in without disrupting the memory system.
Institutional knowledge remains intact regardless of which model performs the reasoning.
Why This Becomes the Moat
Foundation models are rapidly converging in performance.
Benchmarks such as MMLU-Pro show that leading models differ by only a few percentage points.⁵
At the same time, inference costs have fallen dramatically, dropping roughly 95 percent over the past eighteen months.
As a result, model access itself is unlikely to remain a durable competitive advantage.
Twelve Labs summarized this shift clearly:
“As models become commoditized, the competitive edge will come from how effectively context is engineered — not just raw model performance.”⁶
In other words, the advantage shifts from models to memory.
Memory Compounds
Unlike models, memory becomes more valuable over time.
Consider a customer support agent deployed with minimal historical knowledge.
In its first month, it may handle around two hundred escalation cases. Each case is treated largely as a new problem, requiring the system to analyze the issue and search documentation for possible solutions.
Resolution rates may remain modest, and many cases still require human intervention.
After several months, however, the system begins to accumulate patterns.
It learns which issues commonly require executive approval, which technical failures are linked to specific configuration problems, and how individual customers prefer to communicate.
After a year of operation, the system may have thousands of interactions stored in memory.
When a familiar customer contacts support again, the agent can instantly retrieve prior interactions, recognize recurring issues, and follow previously successful resolution strategies.
The result is dramatic improvement in resolution speed and accuracy.
This is institutional intelligence.
And institutional intelligence compounds.
A competitor starting today cannot easily recreate a year of accumulated knowledge.
That accumulated memory becomes a powerful moat.
Who Needs This
Many industries are already encountering this challenge.
Customer support teams must analyze thousands of interactions to understand escalation patterns.
AI wearables and ambient devices accumulate personal context continuously.
Healthcare systems must track patient history across years of visits, scans, and conversations.
Security operations rely on recognizing subtle anomalies across months of surveillance footage.
Robotics systems learn through repeated physical interactions with the world.
Manufacturing operations analyze quality signals across production runs and shifts.
Retail analytics synthesizes customer behavior across multiple touchpoints.
Across all of these domains, decisions depend on what happened before.
And in many cases, those events cannot simply be recreated.
The Bet
Models will continue to commoditize.
The question is not whether this will happen, but what becomes valuable when it does.
We believe the most valuable infrastructure will be systems that:
Turn signals into context
Turn context into memory
Turn memory into compounding intelligence
These systems must work across models while keeping institutional knowledge under the control of the companies that generate it.
Betaworks is right: whoever owns memory wins.
a16z is right: new primitives unlock new companies.
We believe the next generation of AI companies will depend on context and memory infrastructure.
That is the layer we are building.
That is the bet.
Ready to Build Memory Infrastructure?
If you are building AI agents that require context across time, we would love to talk.
Talk to us: hello@memv.ai
Build memories that compound: docs.memv.ai
References
Deep Dive: Memory + AI
Big Ideas 2026
Context Engineering: Sessions, Memory — Google Cloud (Nov 2025), pages 39–40
MMLU-Pro Benchmark, vals.ai (Dec 2025)
Context Engineering for Video Understanding — Twelve Labs
Contact Us

