Instructions to use Mininglamp-2718/Mano-CUA-4B-Thinking-1.1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Mininglamp-2718/Mano-CUA-4B-Thinking-1.1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")
model = AutoModelForImageTextToText.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Mininglamp-2718/Mano-CUA-4B-Thinking-1.1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1

SGLang

How to use Mininglamp-2718/Mano-CUA-4B-Thinking-1.1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Mininglamp-2718/Mano-CUA-4B-Thinking-1.1 with Docker Model Runner:
```
docker model run hf.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Mano-CUA-4B-Thinking-1.1

Mano-CUA is the Computer Use Agent model under the Mano open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding.

This is the fp16 full-precision version. For the MLX 8-bit quantized version optimized for Apple Silicon, see Mano-CUA-4B-Thinking-1.1-MLX-8bit.

Main Capabilities

Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries

Technical Background

Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.

Quick Start

Requirements

macOS with Apple Silicon (M1+)
Python >= 3.12

Installation

pip install transformers torch torchvision qwen-vl-utils

Single-Step Demo

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

# 1. Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")

# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)

# 3. Build prompt
task = "Click the search bar and type hello"

prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
<action>action</action>

## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason

## User Instruction
{task}"""

messages = [
    {{"role": "system", "content": "You are a helpful assistant."}},
    {{"role": "user", "content": [
        {{"type": "image", "image": img}},
        {{"type": "text", "text": prompt_text}},
    ]}},
]

# 4. Run inference
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text_input], images=image_inputs, videos=video_inputs,
    padding=True, return_tensors="pt",
).to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
output_ids = output_ids[:, inputs.input_ids.shape[1]:]
output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(output)

Output Format

The model outputs structured XML:

<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>

Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)

Full Action Space

Action	Syntax	Description
open_app	`open_app(app_name='')`	Open an application
open_url	`open_url(url='')`	Open a URL
click	`click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Left click
doubleclick	`doubleclick(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Double click
triple_click	`triple_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Triple click (select line)
right_single	`right_single(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Right click
hover	`hover(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Mouse hover
type	`type(content='text')`	Type text
hotkey	`hotkey(key='cmd+c')`	Keyboard shortcut
hotkey_click	`hotkey_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>', key='shift')`	Modifier + click
scroll	`scroll(start_box='<\|box_start\|>(x,y)<\|box_end\|>', direction='down', amount='3')`	Scroll
drag	`drag(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', end_box='<\|box_start\|>(x2,y2)<\|box_end\|>')`	Drag and drop
wait	`wait(duration='2')`	Wait (seconds)
finish	`finish()`	Task completed
stop	`stop(reason='...')`	Task infeasible
call_user	`call_user()`	Request human help

Other Versions

Version	Repo	Description
fp16 (this)	Mano-CUA-4B-Thinking-1.1	Full precision, for archival / re-quantization / GPU inference
MLX-8bit	Mano-CUA-4B-Thinking-1.1-MLX-8bit	MLX 8-bit quantized, recommended for Apple Silicon local inference

Contact

Website: https://github.com/Mininglamp-AI/Mano-P
Email: model@mininglamp.com

Downloads last month: 50

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1

Quantizations

1 model

Paper for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1

Mano Report

Paper • 2509.17336 • Published Sep 22, 2025 • 10