I got a free ticket to the Embedded Vision Summit 2025 in Santa Clara last week, and figured I'd share some notes while they’re still fresh in my head.

The main theme? Vision Language Models (VLMs) and multi-modality. If you haven’t heard of VLMs, they’re like LLMs but with image understanding baked in. They can both “see” and “talk,” and they’re scary good at generalizing.

Now, edge devices — the little computers that live in factory machines, cameras, robots, etc. — have historically been trained on super-specific tasks. A tiny vision model might be great at detecting toothpaste tubes on a conveyor belt, but completely fail when someone swaps in toothbrushes. That’s the classic “too narrow” problem.

This year’s summit showed how VLMs are starting to break that pattern. With compute getting cheaper and models getting leaner (thanks, quantization), we’re seeing hybrid approaches: a lightweight model lives on the device, but it can tap into a cloud-based VLM for the heavy stuff. So your robot arm doesn’t have to be a toothpaste savant anymore — it can actually adapt.

Multi-modality is the other big trend. Not just vision and language, but pressure sensors, gyroscopes, actuator feedback — all being piped into models as tokens. The new approach is: don’t hardcode the logic. Let the model learn how to interpret these signals, and even respond by generating actuator tokens. “Everything is tokens” isn’t just a catchphrase — it’s becoming architecture.

Also learned something new: agent networks. Basically, LLMs talking to other LLMs to accomplish goals. You give them roles, hook them together with a framework (Amazon Bedrock was mentioned), and they collaborate like weird little coworkers in a group chat. It’s both hilarious and promising.

In short: the event was fire. I learned a ton, got inspired, and now I wish I’d spent more time networking instead of lurking around the booths. Next time!

[Trend] VLMs and Multi Modality

Subscribe to my newsletter

Sprited Dev

Sprited Dev