Founder's Thoughts

Why Models Are Commodities, But Context & Memory Are Your Moat

What Investors See

In the past year, two ideas have repeatedly surfaced in conversations about the future of AI infrastructure.

Betaworks wrote:

“Whoever collects, stores, and leverages memory is king.”¹

Around the same time, a16z argued:

“New model primitives unlock previously impossible companies.”²

At first glance, these statements seem to point in different directions. One emphasizes memory as the strategic advantage, while the other highlights how advances in model capabilities unlock entirely new categories of companies.

In reality, they describe the same shift from two perspectives.

If new model primitives enable new companies, the next question becomes: what infrastructure do those companies need to win?

Our answer is simple.

We are building context and memory infrastructure.

The Problem

Every serious AI deployment eventually hits the same wall.

Modern AI agents can reason, analyze information, and execute tasks effectively. They can read documents, answer questions, summarize conversations, and trigger workflows.

But they struggle with something much more fundamental.

They cannot remember in ways that compound over time.

Most AI systems today simulate memory using three mechanisms:

  • Session logs that capture short-term interaction history

  • Vector databases that enable similarity search across embeddings

  • RAG systems that retrieve documents during generation

These approaches allow systems to retrieve relevant information. However, they do not create something deeper: institutional intelligence that grows over time.

Today, context exists in two problematic extremes.

Sometimes it is too raw, consisting of long prompts, transcripts, or logs that must be passed repeatedly into models. This increases token costs while still lacking structure.

Other times it is too shallow, relying on top-k embedding retrieval that surfaces fragments of information but fails to capture relationships between events.

The result is that most AI systems today have memory that is effectively stateless, or memory that has been bolted onto systems that were never designed for it.

The deeper issue is simple.

The infrastructure layer for persistent memory does not exist yet.

Why Video First

Video is the superset of modalities.

Inside a single video stream you already find multiple forms of information:

  • Visual context

  • Audio signals

  • Spoken language and transcripts

  • Temporal relationships between events

  • Behavioral and interaction patterns

If a system can understand and remember video, it implicitly gains the ability to work across other modalities as well.

Interestingly, the hardest challenge here is not perception.

Modern models are already capable of analyzing video frames, detecting objects, extracting speech, and generating summaries.

The real difficulty lies in persistence.

How do you make understanding from one moment connect to another moment weeks or months later?

Consider a typical customer interaction timeline.

  • Customer call in Week 1

  • Product demo in Week 3

  • Support ticket in Week 5

  • Another call in Week 7

In most systems today, these interactions exist as isolated records spread across different tools.

They live in CRM logs, support systems, meeting transcripts, and ticketing platforms. Rarely do they form a continuous narrative that an AI system can reason across.

But the most valuable insights emerge across time, not within a single interaction.

Pattern recognition requires understanding how events connect over time and across modalities.

That requires memory.

The Multimodal Memory Challenge

Memory systems designed for text follow a relatively straightforward pipeline.

Information is extracted, converted into embeddings, stored in a vector database, and later retrieved through semantic search.

This approach works well when dealing with documents or conversations.

However, multimodal memory introduces a completely different set of challenges.

Google’s research on context engineering describes two primary approaches.³

The first approach can be described as memory from multimodal sources.

In this approach, systems analyze video, audio, or images but ultimately convert the output into textual insights.

Examples might include observations such as:

  • “User expressed frustration.”

  • “Product appears at timestamp 02:14.”

Everything is reduced to text because text is easy to index, embed, and search.

The second approach is memory with multimodal content.

Instead of summarizing the media, the system stores the original media itself alongside semantic metadata.

For example, if a user uploads a design image and says “remember this for our logo,” the memory contains the actual image file, not just a textual description.

Most systems today choose the first approach.

The reason is practical. Google explains it clearly:

“Generating and retrieving unstructured binary data like images or audio requires specialized models, algorithms, and infrastructure. It is far simpler to convert all inputs into a common, searchable format: text.”⁴

However, reducing multimodal signals to text introduces a significant limitation.

Critical information can be lost.

  • Spatial relationships in diagrams cannot always be expressed in words.

  • The tone and pacing of speech may change the meaning of a conversation.

  • Temporal sequences in video often reveal patterns that textual summaries miss.

In many cases, the signal itself contains meaning that cannot be compressed into language.

The Hybrid Approach

The most effective architecture is not to choose between these two approaches, but to combine them.

A hybrid memory system stores both semantic insights and the original source media.

This means memory consists of two complementary components.

  • Semantic extraction, which is structured, searchable, and efficient

  • Source media, which preserves accuracy and provides grounding

When an agent queries memory, it first retrieves semantic references that identify relevant entities, events, or patterns.

These references act as pointers to the original artifacts.

If deeper inspection is required, the system can rehydrate the source media, loading the actual image, video segment, or audio clip.

Implementing this hybrid approach requires several capabilities.

  • Semantic graphs for traversing relationships quickly

  • Temporal indexing for locating precise moments in media

  • Lazy loading so large media assets are retrieved only when necessary

  • Format-specific extraction pipelines for different modalities

The impact is significant.

An agent can ask a question such as:

“Show designs similar to what we presented to Customer X.”

Instead of receiving text descriptions, the system can return the actual visual references.

The agent reasons semantically while remaining grounded in real artifacts.

The Architecture

mem[v] is designed as context and memory infrastructure for multimodal AI agents.

The system consists of five layers that work together to transform signals into persistent intelligence.

Perception Layer

The perception layer converts raw multimodal inputs into structured representations.

Instead of producing transcripts or summaries, it decomposes signals into semantic primitives.

These primitives typically include:

  • Events, which describe what happened

  • Entities, which represent people, objects, or concepts

  • Relationships, which describe how entities connect

  • Temporal sequences, which describe order and causality

The result is structured context that machines can reason over rather than unstructured text.

Memory Substrate

The memory substrate stores context using several complementary storage systems.

Each system serves a different purpose.

  • Graphs connect knowledge through relationships and enable causal reasoning.

  • Vectors enable semantic similarity search across large datasets.

  • Key-value stores support extremely fast lookups for preferences, counters, and flags.

  • Symbolic memory stores executable rules such as approval workflows or compliance constraints.

  • Artifacts store the original media including video, audio, documents, and images.

Together these systems combine semantic reasoning with real-world artifacts.

Memory becomes both knowledge and evidence.

Working Memory Layer

Before an agent takes action, the working memory layer assembles the relevant context for that task.

Instead of exposing the entire historical dataset, the system selects only the entities, events, and constraints relevant to the current situation.

This enables agents to operate with focused context rather than overwhelming context.

Working memory functions much like human attention, highlighting the signals that matter most in the moment.

Curation Layer

Memory is not static.

Over time, information becomes outdated, contradictory, or redundant.

The curation layer continuously improves the quality of stored knowledge.

It performs tasks such as:

  • Consolidating redundant information

  • Resolving conflicting records

  • Promoting recurring patterns

  • Compressing long histories into abstractions

  • Allowing stale context to decay

Humans can intervene when necessary, but over time the system learns which information deserves to persist.

Model Layer

The model layer contains the reasoning engines that interact with memory.

The key design principle here is model independence.

Foundation models evolve quickly. A new model may outperform existing systems in video understanding, reasoning, or planning.

mem[v] allows these models to be swapped in without disrupting the memory system.

Institutional knowledge remains intact regardless of which model performs the reasoning.

Why This Becomes the Moat

Foundation models are rapidly converging in performance.

Benchmarks such as MMLU-Pro show that leading models differ by only a few percentage points.⁵

At the same time, inference costs have fallen dramatically, dropping roughly 95 percent over the past eighteen months.

As a result, model access itself is unlikely to remain a durable competitive advantage.

Twelve Labs summarized this shift clearly:

“As models become commoditized, the competitive edge will come from how effectively context is engineered — not just raw model performance.”⁶

In other words, the advantage shifts from models to memory.

Memory Compounds

Unlike models, memory becomes more valuable over time.

Consider a customer support agent deployed with minimal historical knowledge.

In its first month, it may handle around two hundred escalation cases. Each case is treated largely as a new problem, requiring the system to analyze the issue and search documentation for possible solutions.

Resolution rates may remain modest, and many cases still require human intervention.

After several months, however, the system begins to accumulate patterns.

It learns which issues commonly require executive approval, which technical failures are linked to specific configuration problems, and how individual customers prefer to communicate.

After a year of operation, the system may have thousands of interactions stored in memory.

When a familiar customer contacts support again, the agent can instantly retrieve prior interactions, recognize recurring issues, and follow previously successful resolution strategies.

The result is dramatic improvement in resolution speed and accuracy.

This is institutional intelligence.

And institutional intelligence compounds.

A competitor starting today cannot easily recreate a year of accumulated knowledge.

That accumulated memory becomes a powerful moat.

Who Needs This

Many industries are already encountering this challenge.

Customer support teams must analyze thousands of interactions to understand escalation patterns.

AI wearables and ambient devices accumulate personal context continuously.

Healthcare systems must track patient history across years of visits, scans, and conversations.

Security operations rely on recognizing subtle anomalies across months of surveillance footage.

Robotics systems learn through repeated physical interactions with the world.

Manufacturing operations analyze quality signals across production runs and shifts.

Retail analytics synthesizes customer behavior across multiple touchpoints.

Across all of these domains, decisions depend on what happened before.

And in many cases, those events cannot simply be recreated.

The Bet

Models will continue to commoditize.

The question is not whether this will happen, but what becomes valuable when it does.

We believe the most valuable infrastructure will be systems that:

  • Turn signals into context

  • Turn context into memory

  • Turn memory into compounding intelligence

These systems must work across models while keeping institutional knowledge under the control of the companies that generate it.

Betaworks is right: whoever owns memory wins.

a16z is right: new primitives unlock new companies.

We believe the next generation of AI companies will depend on context and memory infrastructure.

That is the layer we are building.

That is the bet.

Ready to Build Memory Infrastructure?

If you are building AI agents that require context across time, we would love to talk.

Talk to us: hello@memv.ai

Build memories that compound: docs.memv.ai

References

  • Deep Dive: Memory + AI

  • Big Ideas 2026

  • Context Engineering: Sessions, Memory — Google Cloud (Nov 2025), pages 39–40

  • MMLU-Pro Benchmark, vals.ai (Dec 2025)

  • Context Engineering for Video Understanding — Twelve Labs

Contact Us

Have a Use Case in Mind?

Design, pilot, or deploy long-term memory for your AI systems.

Design, pilot, or deploy long-term memory for your AI systems.

Say hi!

Want AI that actually remembers, adapts, and works like your team? Share your project — let’s build it together.

Feel free to contact us for any questions, potential collaborations, or to talk

about your project requirements.