Mano-CUA-4B-Thinking-1.1

Mano-CUA is the Computer Use Agent model under the Mano open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding.

This is the fp16 full-precision version. For the MLX 8-bit quantized version optimized for Apple Silicon, see Mano-CUA-4B-Thinking-1.1-MLX-8bit.

Main Capabilities

  • Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
  • Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
  • Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
  • Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries

Technical Background

Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.

Quick Start

Requirements

  • macOS with Apple Silicon (M1+)
  • Python >= 3.12

Installation

pip install transformers torch torchvision qwen-vl-utils

Single-Step Demo

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

# 1. Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")

# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)

# 3. Build prompt
task = "Click the search bar and type hello"

prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
<action>action</action>

## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason

## User Instruction
{task}"""

messages = [
    {{"role": "system", "content": "You are a helpful assistant."}},
    {{"role": "user", "content": [
        {{"type": "image", "image": img}},
        {{"type": "text", "text": prompt_text}},
    ]}},
]

# 4. Run inference
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text_input], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
output_ids = output_ids[:, inputs.input_ids.shape[1]:]
output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(output)

Output Format

The model outputs structured XML:

<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>

Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)

Full Action Space

Action Syntax Description
open_app open_app(app_name='') Open an application
open_url open_url(url='') Open a URL
click click(start_box='<|box_start|>(x,y)<|box_end|>') Left click
doubleclick doubleclick(start_box='<|box_start|>(x,y)<|box_end|>') Double click
triple_click triple_click(start_box='<|box_start|>(x,y)<|box_end|>') Triple click (select line)
right_single right_single(start_box='<|box_start|>(x,y)<|box_end|>') Right click
hover hover(start_box='<|box_start|>(x,y)<|box_end|>') Mouse hover
type type(content='text') Type text
hotkey hotkey(key='cmd+c') Keyboard shortcut
hotkey_click hotkey_click(start_box='<|box_start|>(x,y)<|box_end|>', key='shift') Modifier + click
scroll scroll(start_box='<|box_start|>(x,y)<|box_end|>', direction='down', amount='3') Scroll
drag drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x2,y2)<|box_end|>') Drag and drop
wait wait(duration='2') Wait (seconds)
finish finish() Task completed
stop stop(reason='...') Task infeasible
call_user call_user() Request human help

Other Versions

Version Repo Description
fp16 (this) Mano-CUA-4B-Thinking-1.1 Full precision, for archival / re-quantization / GPU inference
MLX-8bit Mano-CUA-4B-Thinking-1.1-MLX-8bit MLX 8-bit quantized, recommended for Apple Silicon local inference

Contact

Downloads last month
50
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1

Quantizations
1 model

Paper for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1