OS-Copilot Autonomous Computer Use Agent: Introduction & Code Insights

Image Generated by Google’s Imagen 3 via AI Studio.
Introduction
If you wanted to know a tool so that your computer can be controlled automatically, capable of automating complex tasks and self-improving over time. I am writing about OS-Copilot, a multi-agent framework which is pretty impressive. It provides a modular and flexible architecture for building generalist agents that can interface with various operating system elements, including the web, code terminals, files, multimedia, and third-party applications. OS-Copilot enables the creation of powerful digital assistants. I compared it with Anthropic Computer Use as well.
Let’s begin!
Overall Architecture
The OS-Copilot framework consists of 3 components:
Planner: Decomposes user requests into simpler subtasks, retrieving relevant information about agent capabilities and operating system information.
Configurator: Configures subtasks for the actor, inspired by the human brain's memory structure (working, declarative, and procedural memory).
Actor: Consists of 2 stages:
Execution: Proposes and executes actions (e.g., python code, bash commands) on your system.
Self-directed learning: Self-criticism, provides feedback to refine execution errors, update long-term memory and fetches back next time thus called as learning. To acquire new skills, through trial and error, accumulating valuable tools and knowledge, demonstrating the effectiveness of self-directed learning for a general-purpose OS-level agent.
They introduced 3 Agents:
FridayAgent: A multi-agent, multimodal support framework which can self-learn, adopt any tool on your system
FridayVision: A light weight agent which is can only open browser and perform UI activities on your system similar to Anthropic’s Computer use.
LightFriday: A light weight agent which only executes until task is completed but doesn’t incorporate self-directed learning.
FridayAgent
CodeInsights
- Inner Monologue
Captures and stores intermediate representations during agent execution such as
reasoning
,error_type
,critique
,isRePlan
,isTaskCompleted
,result
- 5 types of Subtasks
Python
,Shell
,AppleScript
,API
,QA / information gathering
- Tool execution environments
Pythonjupyter
,Shell
,AppleScript
- Tool execution state variables
command
, error
, ls
, pwd
, result
How it works?
Architecture
Every agent in the framework have slightly different architecture while comprising the above mentioned 3 components.
Why to use it?
Program with Natural Language Query: Automatically executes tasks like you give a CMD command in your system but in natural language
Fully Autonomous: Autonomously finds solution, executes actions and integrates with various data sources or tools to bring you the desired result.
Tool and Environment Flexibility: Adapts to new tools and environments, creating and integrating them as needed.
Continuous Learning: Evolves through trial and error, storing past experiences in the form of few-shot example to improve future performance.
When to use?
Automate your daily system usage, workflow and boring repetitive tasks
Advanced automation that incorporates various tools native to your system and data
Give task in natural language rather following the bounded strict syntaxes.
Delegate it the task of researching across web and using various tool and present you the report will be time saving
How to integrate your custom tools?
Add Code Tool: Add your tool with supported type as python, bash or applescript code, find the guide here
Integrate API Tool: Integrate Existing APIs or add OpenAPI specs for your API tool.
Comparison With Competitors
OS-Copilot | Anthropic Computer Use | |
Vendor | OS-Copilot (Chinese) | Anthropic (US) |
OS | Linux only | Linux only |
LLM support | OpenAI, Ollama but easy to write your own API to use other vendors also | Anthropic, Bedrock, Vertex but easy to write your own API to use other vendors also |
Multi-modal support | Yes | Yes |
Cursor Navigation Mechanism | OCR + LLM | LLM |
Planning | Breaks tasks into subtasks | Predicts next action |
Self-improving | Yes | No |
Computer Use via | PyAutoGUI library | bash commands |
Human in loop | No | Yes |
How well OS-Copilot works?
When evaluated on the GAIA benchmark, it outperformed other baseline agents on the GAIA benchmark, with a pass rate of 64.6% on the private test set. It could master these tasks through self-directed learning, with a pass rate of 83.3% on the test set.
I tried many queries but it failed many times due to some bugs in the latest code, screenshots failed as I setup this up on a VM and the library used failed.
Passed: Create excel sheet, insert data. Create folder and write code
Failed: Toggle night mode in vscode or os. While creating react appkept it kept on repeating and repairing solution and finally threw error when LLM’s context length exceeded. LOL!
Limitations
Not fully tool or env agnostic: It’s not fully agnostic to tools or environments, as it installs dependencies with commands, relies on installed libraries and apps that are compatible with your Python env and OS.
No Human-in-loop: may make hallucinate, take wrong or different turns without human oversight. This may not always adapt to your preferences and could lead to unintended risks.
Latency: The task execution time depends on the LLM’s response time, can be slow, and may take longer to complete tasks.
Visual accuracy issues: The FridayVision, during computer use, may make mistakes in interpreting object and co-ordinates on your display.
Security, Responsibility and Vulnerability: Run OS-Copilot our system in trusted environments, such as VMs or containers, and limit access to sensitive data to minimize security risks.
Conclusion
Note: I tried only the non-vision part of agent due to hosting it on a remote instance. I’ll fix this and post in the next blog.
OS-Copilot is inspired from OpenInterpreter but with only a different feature which is to store tools, iteratively improve it when failed and retrieve next time.
All these 3 frameworks failed many times apart from what was demoed but they are well engineered and can evolve to become more Robust General Agent. Their code implementation is good and suggest you best practices further on how to modularize your code well to build such multi-agent frameworks.
So, OS-Copilot worked faster than Anthropic Computer Use for me as tools were already stored and basically worked similarly for many use cases. I’ll drop comparison with OpenInterpreter as well, stay tuned!
Referenced Articles
Thank you for taking the time to read my blog! I hope you found the information about OS-Copilot insightful and helpful. I am excited to bring the vision part to life and unleash the full potential of this amazing tool. Your support and feedback mean the world to me, so feel free to share your thoughts and experiences.
Until next time, happy automating!
Subscribe to my newsletter
Read articles from Amruta Yemul directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
