Analyzes the current screen content using OmniParser to identify UI elements and text
Clicks on a specified UI element or coordinates on the screen
Types text into the currently focused input field
Simulates pressing a keyboard key or key combination
Scrolls the screen up or down
OmniParser AutoGUI is a powerful tool that analyzes your screen using Microsoft's OmniParser and enables automatic operation of graphical user interfaces. It uses computer vision to understand what's on your screen and can interact with UI elements like buttons, text fields, and other controls. This tool is particularly useful for automating repetitive tasks, testing applications, or creating workflows that span multiple applications. By leveraging OmniParser's screen analysis capabilities, it can understand complex interfaces and perform actions based on visual context rather than relying on specific coordinates or predefined UI elements.
OmniParser AutoGUI is a Model Context Protocol (MCP) server that enables AI assistants to interact with and control graphical user interfaces on your screen. It uses Microsoft's OmniParser to analyze screen content and can perform actions like clicking, typing, and navigating based on what it sees.
To install OmniParser AutoGUI, follow these steps:
git clone --recursive https://github.com/NON906/omniparser-autogui-mcp.git
cd omniparser-autogui-mcp
uv sync
# On Windows
set OCR_LANG=en
uv run download_models.py
# On Linux/macOS
export OCR_LANG=en
uv run download_models.py
uv sync --extra langchain
To use OmniParser AutoGUI with Claude or other MCP-compatible clients, add the following configuration to your client's configuration file (e.g., claude_desktop_config.json
):
{
"mcpServers": {
"omniparser_autogui_mcp": {
"command": "uv",
"args": [
"--directory",
"PATH_TO_YOUR_CLONED_REPO",
"run",
"omniparser-autogui-mcp"
],
"env": {
"PYTHONIOENCODING": "utf-8",
"OCR_LANG": "en"
}
}
}
}
Replace PATH_TO_YOUR_CLONED_REPO
with the actual path to your cloned repository.
OmniParser AutoGUI supports several environment variables for customization:
OCR_LANG
: Language for OCR processing (default: "en")OMNI_PARSER_BACKEND_LOAD
: Set to "1" if using with clients other than Claude DesktopTARGET_WINDOW_NAME
: Specify a window name to operate on (if not set, operates on entire screen)OMNI_PARSER_SERVER
: Address and port for remote OmniParser processing (e.g., "127.0.0.1:8000")SSE_HOST
and SSE_PORT
: For SSE communication instead of stdioSOM_MODEL_PATH
, CAPTION_MODEL_NAME
, CAPTION_MODEL_PATH
, OMNI_PARSER_DEVICE
, BOX_TRESHOLD
: Advanced OmniParser configuration optionsOnce configured, you can ask your AI assistant to perform actions on your screen. For example:
The AI will analyze the screen, identify UI elements, and perform the requested actions.
If you want to run OmniParser on a separate device (useful for performance reasons), you can:
uv run omniparserserver
OMNI_PARSER_SERVER
environment variable.OmniParser AutoGUI is released under the MIT license, excluding submodules and sub-packages. Note that OmniParser itself is under CC-BY-4.0, and each OmniParser model has its own license.