Most AI video systems today analyze short clips but lack persistent memory, limiting their ability to search or track content over long time horizons. This paper introduces a semantic memory layer built on video embeddings to enable efficient scene-level search and retrieval across large video datasets. The framework simulates a Large Visual Memory Model (LVMM) by continuously embedding frames, structuring them in a vector index, and supporting natural language or object-based queries. We evaluate the framework on TV show episodes, surveillance feeds, and social video archives, demonstrating that persistent semantic indexing enables queries such as “Show me all instances where Person A appears in the last 2 weeks” or “When did object X disappear?”. Experimental results show improved retrieval accuracy, scalability to millions of frames, and latency suitable for enterprise video analytics and consumer-facing search applications.
Hongliang BaiLezi WangDong YuanKun Tao
Yingxin WangXiushan NieYang ShiXin ZhouYilong Yin