Show HN: SoMatic – Vision-based OS automation framework for AI agents
TL;DR
Smyan presents SoMatic, a vision-based automation framework that helps AI Agents reliably control native operating systems. The core problem: modern multimodal LLMs are strong at perception but weak at localization, which breaks when RPA-style frameworks are handed to agents. Browser use frameworks solved this with DOM-tree hints and Set-of-Marks prompting, letting an LLM say 'click 4' instead of 'click 443 213'.
Nauti's Take
Promising direction: SoMatic targets a real bottleneck — LLMs perceive everything but rarely click precisely, and OS-level automation has lacked a solid localization layer. The catch: the framework lives or dies on vision accuracy, and without clean accessibility data every action stays brittle.
Worth a look for builders shipping desktop agents — but teams shouldn't hand production workloads to it blindly yet.