Education

Molmo 2: Advanced Multimodal AI for Video & Multi-Image Understanding That Outperforms Larger Models

Molmo 2 is an open source 8B AI model for precise video and multi image understanding, object tracking, and dense captions, built for efficiency and real world use.

Pranav Sunil
December 27, 2025
Molmo 2 is an open source 8B AI model for precise video and multi image understanding, object tracking, and dense captions, built for efficiency and real world use.

Video has become the dominant form of data across smartphones, autonomous vehicles, security systems, and scientific research. Understanding how the world changes over time is critical for building the next generation of AI systems. Released in December 2025 by the Allen Institute for AI (Ai2), Molmo 2 brings breakthrough capabilities to video understanding using models that are smaller and more efficient than competing systems.

Molmo 2 represents a major advancement in open-source multimodal AI. The 8-billion-parameter model surpasses last year's 72-billion-parameter Molmo in accuracy, temporal understanding, and pixel-level precision. It beats proprietary models like Google's Gemini 3 on video tracking tasks while training on just one-eighth the data used by similar systems. This efficient approach challenges the assumption that bigger models always perform better.

The model can identify exactly where and when events occur in videos, track multiple objects through complex scenes, and connect actions to precise frame-level timelines. These capabilities support applications in robotics, autonomous vehicles, industrial automation, scientific research, and assistive technology.

The Core Capabilities

Here's what Molmo 2 can do:

Molmo 2 is a state-of-the-art open multimodal model suite capable of precise spatial and temporal understanding of video, image, and multi-image sets. The system processes single images, multiple images, and video clips of varying lengths to enable advanced visual understanding tasks.

Key Technical Specifications:

Model VariantParametersBase ModelBest Use Case
Molmo 2 (8B)8 billionQwen 3Video grounding and question answering
Molmo 2 (4B)4 billionQwen 3Efficient deployments prioritizing speed
Molmo 2-O (7B)7 billionOLMoFully open end-to-end model flow

Core Features:

  • Frame-level spatial and temporal grounding with precise pixel coordinates
  • Multi-object tracking across occlusions and scene changes
  • Dense long-form video captioning averaging 900+ words per clip
  • Video pointing that identifies exact locations of events
  • Multi-image reasoning across related image sets
  • Real-time video analysis and question answering

Why This Model Matters

Most advanced video understanding models are locked behind proprietary systems without transparency into their training data or architecture. Molmo 2 solves this problem by providing a fully open alternative that matches or exceeds closed systems on key benchmarks.

The efficiency gains are remarkable. While Meta's PerceptionLM trained on 72.5 million videos, Molmo 2 achieves comparable or better performance using only 9.19 million videos. This represents a dramatic reduction in computational resources and training data requirements.

The model proves that careful data curation and focused training objectives can outperform brute-force scaling approaches. For researchers, developers, and organizations without access to massive computing clusters, this accessibility changes what's possible in multimodal AI development.

How Molmo 2 Works

Molmo 2 combines a strong language model backbone with a vision encoder. The architecture follows a two-stage training process designed to build video understanding from the ground up.

Stage 1: Multimodal Pre-training

The first stage focuses on alignment and grounding through joint image captioning and image pointing. The training mix includes:

  • 60% captioning data
  • 30% pointing data
  • 10% natural language data

This stage teaches the model to connect visual elements with language while preserving strong language capabilities through supervised fine-tuning data from Tulu.

Stage 2: Supervised Fine-tuning

The second stage integrates diverse multimodal data across images, multi-image sets, videos, and pure text. Training categories include:

  • Dense video captions
  • Image question answering
  • Video question answering
  • Spatial pointing
  • Object tracking
  • Natural language processing tasks

Each category receives a sampling rate tuned through empirical experiments. Within each category, datasets are sampled proportional to the square root of their size, with manual adjustments to avoid over-representing large synthetic sources.

Technical Innovations:

The model uses several advanced techniques to maximize performance:

  • Token-weighting scheme during fine-tuning balances learning across diverse tasks
  • Sequence packing and message-tree scheduling increase throughput
  • Bi-directional attention between visual tokens improves grounding and tracking
  • Temporal embeddings allow processing video as frame sequences without 3D convolutions

The vision encoder uses SigLIP 2 to process visual input, while the language backbone leverages either Qwen 3 (for 4B and 8B variants) or OLMo (for the 7B variant). This combination creates an efficient pipeline that processes video without the computational explosion typical of full 3D approaches.

Training Data That Makes the Difference

Ai2 released nine new open datasets specifically for Molmo 2, totaling more than nine million multimodal examples. This represents one of the most complete open video data collections available today.

Dataset Breakdown:

Dataset TypeContentSize
Dense CaptioningDetailed video descriptions averaging 900+ words100,000+ videos, 431,000 clip-level captions
Long-form QAQuestion-answer pairs for short and long videosPart of 9M+ examples
Video PointingPrecise pixel and timestamp annotationsOpen-vocabulary spatio-temporal data
Object TrackingPoint-based tracking across frames and occlusionsMulti-object tracking data
Multi-Image SetsSemantically related images with QA supervisionCurated collections

The captioning corpus stands out for its depth. Rather than simple descriptions, the captions capture actions, relationships, rare events, and fine-grained temporal details. This rich supervision enables the model to understand not just what happens in a video, but when and where events occur.

The pointing and tracking datasets teach the model to provide visual evidence for its answers. Instead of just saying "five flips occurred," the model can return timestamps and pixel coordinates for each flip.

Performance That Challenges Larger Models

Molmo 2 establishes new standards for open multimodal models across multiple benchmark categories.

Video Understanding Benchmarks:

Benchmark TypePerformance
Short Video QALeading open-weight performance on MVBench, MotionQA, NextQA
Video GroundingOften doubles or triples scores of previous open models
Video CountingOutperforms GPT-5 and Gemini 2.5 Pro on BURST-VideoCount
Video TrackingSurpasses Gemini 3 Pro and strong open-weight alternatives

Image and Multi-Image Performance:

The Molmo 2 8B model leads all open-weight models on image reasoning tasks, with the 4B variant close behind. Both variants outperform larger systems like Qwen3-VL-8B and InternVL3.5-8B while using fewer parameters.

On counting-heavy benchmarks, both the 4B and 8B models show strong performance. The models achieve state-of-the-art results among fully open and open-weight models on visual question answering tasks.

Human Preference Evaluations:

Human evaluations show Molmo 2 performs on par with or better than multiple proprietary systems on real-world video question answering and captioning tasks. The 8B and 4B models both scored strongly in open-weight Elo human preference evaluations, though larger proprietary models continue to lead that benchmark overall.

Efficiency Comparison:

ModelTraining VideosParameter CountRelative Performance
Molmo 29.19M8BState-of-the-art open model
Meta PerceptionLM72.5MSimilar sizeMatched or exceeded by Molmo 2
Original MolmoNot applicable72BSurpassed by Molmo 2 8B

Real-World Applications

The capabilities of Molmo 2 enable practical applications across diverse industries.

Robotics and Autonomous Systems:

Robots need to understand their environment to interact safely and effectively. Molmo 2 can identify objects, track their movement, and understand spatial relationships in real-time. This supports navigation, manipulation tasks, and human-robot interaction.

Industrial Automation:

Manufacturing and quality control systems benefit from precise object tracking and anomaly detection. The model can monitor assembly lines, flag unusual events, and provide detailed descriptions of production processes.

Scientific Research:

Researchers analyzing video data from experiments, wildlife studies, or medical procedures gain a tool for automated analysis. The dense captioning capability can document observations in detail, while tracking features enable following subjects across long recordings.

Traffic Monitoring and Safety:

Transportation systems can use Molmo 2 for traffic flow analysis, incident detection, and safety monitoring. The frame-level precision allows identifying exact moments when violations or hazards occur.

Assistive Technology:

Systems designed to help people with visual impairments can leverage Molmo 2's detailed understanding to describe scenes, identify objects, and explain what's happening in videos or real-time camera feeds.

Content Analysis and Search:

Media companies and content platforms can automatically generate searchable descriptions of video content. The detailed captions enable precise search and retrieval based on specific events or objects.

Getting Started with Molmo 2

The model is available through multiple access points designed for different use cases.

Interactive Testing:

Try Molmo 2 in the Ai2 Playground with video and multi-image workflows. Upload clips or multiple images to test video summarization, counting, tracking, and grounded question answering. The interface shows exactly where the model focuses when answering questions.

Model Downloads:

All model variants are available on Hugging Face with full compatibility with popular frameworks like Transformers. The models are released under the Apache 2.0 license, enabling both research and commercial use.

Setup Requirements:

For local deployment, the basic setup is straightforward:

conda create --name molmo2 python=3.11
conda activate molmo2
pip install transformers==4.57.1 torch pillow einops torchvision accelerate decord2

Hardware Considerations:

  • Molmo 2 4B: Runs efficiently on standard GPUs including RTX 4090
  • Molmo 2 8B: Requires moderate GPU resources, accessible to most research labs
  • Molmo 2-O 7B: Similar requirements to 8B variant

The smaller model sizes make deployment practical without high-end infrastructure. A single consumer-grade GPU can run the 8B model, while the 4B variant works well on even more constrained hardware.

Best Practices for Using Molmo 2

Optimize Your Queries:

Ask specific questions that leverage the model's grounding capabilities. Instead of "What happens in this video?", try "Where and when does the player score?" to get precise spatial and temporal answers.

Leverage Pointing and Tracking:

Use the model's ability to return coordinates and timestamps. For counting tasks, the model provides visual evidence through pointing rather than just numbers.

Consider Video Length:

The model handles short to medium-length videos most effectively. For very long videos, consider breaking them into segments or focusing queries on specific time ranges.

Multi-Image Analysis:

When analyzing multiple related images, provide clear context about their relationship. The model can reason across image sets when the connection is understood.

Combine Capabilities:

Use dense captioning for overview understanding, then follow up with specific grounding queries to pinpoint particular events or objects of interest.

Understanding the Limitations

Like all models, Molmo 2 has boundaries that users should understand.

Video Length Constraints:

The context window from the underlying Qwen 3 or OLMo language models limits how many video frames can be processed simultaneously. Very long videos may require segmentation or selective frame sampling.

Dense Scene Complexity:

Tracking performance decreases in extremely crowded scenes with dozens of similar objects. Expanding to highway traffic or large crowd scenarios would require additional training data with dense scene examples.

Fine Detail Resolution:

The vision encoder uses 384x384 patch resolution. Extremely fine-grained details in high-resolution video may not be captured fully.

Grounding Accuracy:

While video grounding shows major improvements over previous open models, Ai2 notes that no model yet reaches 40 percent accuracy on the hardest grounding benchmarks. This remains an active research challenge.

Data and Engineering Gaps:

Some limitations stem from data scarcity rather than architectural issues. The team notes that challenges like long-form video analysis are primarily compute allocation issues rather than fundamental modeling limitations.

The Open Science Advantage

Molmo 2's fully open approach differentiates it from proprietary alternatives in several key ways.

Complete Transparency:

All training sources are documented in the technical report. Researchers can inspect exactly what data shaped the model's capabilities and biases.

Reproducible Research:

The release includes model weights, training data, data recipes, evaluation tools, and benchmarks. Other teams can reproduce results, verify claims, and build upon the work.

Customization Freedom:

Organizations can fine-tune the models on domain-specific data without restrictions. The Apache 2.0 license permits commercial deployment and modification.

Community Building:

Open datasets and code enable the research community to contribute improvements, identify issues, and collectively advance the state of the art.

Educational Value:

Students and researchers can study a complete multimodal AI system from data collection through deployment, learning best practices at each stage.

What This Means for AI Development

Molmo 2 demonstrates that the future of AI doesn't require exponentially larger models and datasets. Careful data curation, focused training objectives, and architectural efficiency can match or exceed brute-force scaling.

The 8B model outperforming the 72B predecessor proves that parameter count alone doesn't determine capability. Quality supervision beats raw quantity when training data is well-designed for specific tasks.

For the open-source community, this release provides high-quality datasets and proven training recipes that any research lab can build upon. The nine million multimodal examples form a foundation for developing competitive video understanding systems without scraping YouTube for 72 million clips.

The efficiency gains have practical implications. Smaller models run on more affordable hardware, reducing barriers to entry for researchers and startups. A system that performs well on a single consumer GPU opens video understanding capabilities to organizations that can't deploy clusters of high-end accelerators.

Future Directions

Ai2 plans to expand Molmo 2's capabilities through continued research and community contributions.

Upcoming Releases:

Training code will be released under an open-source license, enabling researchers to train custom variants. API access is planned, providing developers with easy integration options for production systems.

Research Extensions:

The community can extend Molmo 2 in several directions:

  • Longer video understanding through efficient attention mechanisms
  • Improved dense scene tracking with targeted training data
  • Multi-lingual caption generation and question answering
  • Enhanced fine-detail recognition through higher-resolution processing
  • Integration with action planning for robotics applications

Community Contributions:

Developers and researchers are encouraged to share their applications, improvements, and findings. Feedback will shape future model iterations and dataset expansions.

Conclusion

Molmo 2 brings state-of-the-art video understanding to the open-source community through efficient models that prove smaller systems can compete with and exceed larger proprietary alternatives. The 8-billion-parameter model delivers precise spatial and temporal understanding, robust object tracking, and detailed video captioning while training on a fraction of the data used by comparable systems.

The fully open approach provides transparency that proprietary models lack. Complete access to weights, training data, data recipes, and evaluation tools enables reproducible research and customization for specific applications.

For researchers, developers, and organizations building video analysis systems, Molmo 2 offers a powerful foundation. Whether your application involves robotics, scientific research, industrial automation, or assistive technology, these models provide the capabilities needed to understand video content with precision.

Download the models from Hugging Face, explore the datasets on GitHub, try the interactive playground, and join the community building the next generation of multimodal AI. The tools are open, the performance is competitive, and the possibilities are vast. Start building with Molmo 2 today.