OmniParser AutoGUI MCP Server

Browser AutomationPython

Automatically operate GUI elements on your screen using computer vision

Available Tools

analyze_screen

Analyzes the current screen content using OmniParser to identify UI elements and text

click

Clicks on a specified UI element or coordinates on the screen

type_text

Types text into the currently focused input field

press_key

Simulates pressing a keyboard key or key combination

scroll

Scrolls the screen up or down

OmniParser AutoGUI is a powerful tool that analyzes your screen using Microsoft's OmniParser and enables automatic operation of graphical user interfaces. It uses computer vision to understand what's on your screen and can interact with UI elements like buttons, text fields, and other controls. This tool is particularly useful for automating repetitive tasks, testing applications, or creating workflows that span multiple applications. By leveraging OmniParser's screen analysis capabilities, it can understand complex interfaces and perform actions based on visual context rather than relying on specific coordinates or predefined UI elements.

Overview

OmniParser AutoGUI is a Model Context Protocol (MCP) server that enables AI assistants to interact with and control graphical user interfaces on your screen. It uses Microsoft's OmniParser to analyze screen content and can perform actions like clicking, typing, and navigating based on what it sees.

Installation

To install OmniParser AutoGUI, follow these steps:

Clone the repository with submodules:

git clone --recursive https://github.com/NON906/omniparser-autogui-mcp.git
cd omniparser-autogui-mcp

Install dependencies using UV (a Python package manager):

uv sync

Set the OCR language and download required models:

# On Windows
set OCR_LANG=en
uv run download_models.py

# On Linux/macOS
export OCR_LANG=en
uv run download_models.py

If you want to use the LangChain integration, install additional dependencies:

uv sync --extra langchain

Configuration

To use OmniParser AutoGUI with Claude or other MCP-compatible clients, add the following configuration to your client's configuration file (e.g., claude_desktop_config.json):

{
  "mcpServers": {
    "omniparser_autogui_mcp": {
      "command": "uv",
      "args": [
        "--directory",
        "PATH_TO_YOUR_CLONED_REPO",
        "run",
        "omniparser-autogui-mcp"
      ],
      "env": {
        "PYTHONIOENCODING": "utf-8",
        "OCR_LANG": "en"
      }
    }
  }
}

Replace PATH_TO_YOUR_CLONED_REPO with the actual path to your cloned repository.

Environment Variables

OmniParser AutoGUI supports several environment variables for customization:

OCR_LANG: Language for OCR processing (default: "en")
OMNI_PARSER_BACKEND_LOAD: Set to "1" if using with clients other than Claude Desktop
TARGET_WINDOW_NAME: Specify a window name to operate on (if not set, operates on entire screen)
OMNI_PARSER_SERVER: Address and port for remote OmniParser processing (e.g., "127.0.0.1:8000")
SSE_HOST and SSE_PORT: For SSE communication instead of stdio
SOM_MODEL_PATH, CAPTION_MODEL_NAME, CAPTION_MODEL_PATH, OMNI_PARSER_DEVICE, BOX_TRESHOLD: Advanced OmniParser configuration options

Usage

Once configured, you can ask your AI assistant to perform actions on your screen. For example:

"Search for 'MCP server' in the browser"
"Click the login button on the screen"
"Find and fill out the contact form"
"Open the settings menu and enable dark mode"

The AI will analyze the screen, identify UI elements, and perform the requested actions.

Remote OmniParser Server

If you want to run OmniParser on a separate device (useful for performance reasons), you can:

Start the OmniParser server on the remote device:

uv run omniparserserver

Configure the client to use the remote server by setting the OMNI_PARSER_SERVER environment variable.

Limitations

Currently confirmed to work on Windows
Performance may vary depending on screen complexity and resolution
Some UI frameworks or custom controls might be challenging to interact with

License

OmniParser AutoGUI is released under the MIT license, excluding submodules and sub-packages. Note that OmniParser itself is under CC-BY-4.0, and each OmniParser model has its own license.