---
library_name: transformers
pipeline_tag: image-text-to-text
license: apache-2.0
language:
- en
- zh
tags:
- vla
- cua
- computer-use
- qwen3-vl
base_model: Qwen/Qwen3-VL-4B
---
# Mano-CUA-4B-Thinking-1.1
**Mano-CUA** is the Computer Use Agent model under the [Mano](https://github.com/Mininglamp-AI/Mano-P) open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding.
This is the **fp16 full-precision** version. For the MLX 8-bit quantized version optimized for Apple Silicon, see [Mano-CUA-4B-Thinking-1.1-MLX-8bit](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit).
## Main Capabilities
- **Complex GUI Automation**: Autonomously complete complex interface operations containing hundreds of interactive elements
- **Cross-System Data Integration**: Extract and integrate multi-source data through pure visual interaction without API interfaces
- **Long-Task Planning Execution**: Support enterprise-level business process automation of dozens to hundreds of steps
- **Intelligent Report Generation**: Automatically generate structured documents such as data analysis reports and work summaries
## Technical Background
Mano-CUA builds upon the complete technical framework of the Mano project (see [Mano Technical Report](https://arxiv.org/abs/2509.17336)), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.
## Quick Start
### Requirements
- macOS with Apple Silicon (M1+)
- Python >= 3.12
### Installation
```bash
pip install transformers torch torchvision qwen-vl-utils
```
### Single-Step Demo
```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# 1. Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")
# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)
# 3. Build prompt
task = "Click the search bar and type hello"
prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
action
## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason
## User Instruction
{task}"""
messages = [
{{"role": "system", "content": "You are a helpful assistant."}},
{{"role": "user", "content": [
{{"type": "image", "image": img}},
{{"type": "text", "text": prompt_text}},
]}},
]
# 4. Run inference
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text_input], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt",
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
output_ids = output_ids[:, inputs.input_ids.shape[1]:]
output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(output)
```
### Output Format
The model outputs structured XML:
```xml
The search bar is at the top of the page...
Click the search bar to focus it
click(start_box='<|box_start|>(500,38)<|box_end|>')
```
Coordinates are normalized to `[0, 1000]` range. To convert to pixel coordinates:
```python
pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)
```
## Full Action Space
| Action | Syntax | Description |
| ------------ | ------------------------------------------------------------ | -------------------------- |
| open_app | `open_app(app_name='')` | Open an application |
| open_url | `open_url(url='')` | Open a URL |
| click | `click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Left click |
| doubleclick | `doubleclick(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Double click |
| triple_click | `triple_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Triple click (select line) |
| right_single | `right_single(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Right click |
| hover | `hover(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Mouse hover |
| type | `type(content='text')` | Type text |
| hotkey | `hotkey(key='cmd+c')` | Keyboard shortcut |
| hotkey_click | `hotkey_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>', key='shift')` | Modifier + click |
| scroll | `scroll(start_box='<\|box_start\|>(x,y)<\|box_end\|>', direction='down', amount='3')` | Scroll |
| drag | `drag(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', end_box='<\|box_start\|>(x2,y2)<\|box_end\|>')` | Drag and drop |
| wait | `wait(duration='2')` | Wait (seconds) |
| finish | `finish()` | Task completed |
| stop | `stop(reason='...')` | Task infeasible |
| call_user | `call_user()` | Request human help |
## Other Versions
| Version | Repo | Description |
|---------|------|-------------|
| fp16 (this) | [Mano-CUA-4B-Thinking-1.1](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1) | Full precision, for archival / re-quantization / GPU inference |
| MLX-8bit | [Mano-CUA-4B-Thinking-1.1-MLX-8bit](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit) | MLX 8-bit quantized, recommended for Apple Silicon local inference |
## Contact
- Website: [https://github.com/Mininglamp-AI/Mano-P](https://github.com/Mininglamp-AI/Mano-P)
- Email: model@mininglamp.com