--- library_name: transformers pipeline_tag: image-text-to-text license: apache-2.0 language: - en - zh tags: - vla - cua - computer-use - qwen3-vl base_model: Qwen/Qwen3-VL-4B --- # Mano-CUA-4B-Thinking-1.1 **Mano-CUA** is the Computer Use Agent model under the [Mano](https://github.com/Mininglamp-AI/Mano-P) open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding. This is the **fp16 full-precision** version. For the MLX 8-bit quantized version optimized for Apple Silicon, see [Mano-CUA-4B-Thinking-1.1-MLX-8bit](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit). ## Main Capabilities - **Complex GUI Automation**: Autonomously complete complex interface operations containing hundreds of interactive elements - **Cross-System Data Integration**: Extract and integrate multi-source data through pure visual interaction without API interfaces - **Long-Task Planning Execution**: Support enterprise-level business process automation of dozens to hundreds of steps - **Intelligent Report Generation**: Automatically generate structured documents such as data analysis reports and work summaries ## Technical Background Mano-CUA builds upon the complete technical framework of the Mano project (see [Mano Technical Report](https://arxiv.org/abs/2509.17336)), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks. ## Quick Start ### Requirements - macOS with Apple Silicon (M1+) - Python >= 3.12 ### Installation ```bash pip install transformers torch torchvision qwen-vl-utils ``` ### Single-Step Demo ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info from PIL import Image # 1. Load model model = Qwen3VLForConditionalGeneration.from_pretrained( "Mininglamp-2718/Mano-CUA-4B-Thinking-1.1", torch_dtype="auto", device_map="auto", ) processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1") # 2. Load a screenshot img = Image.open("screenshot.png") ratio = 1280 / img.width img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS) # 3. Build prompt task = "Click the search bar and type hello" prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. ## Output Format action ## Action Space open_app(app_name='') # Open an application by name. open_url(url='') # Open a URL in the browser. click(start_box='<|box_start|>(x1,y1)<|box_end|>') type(content='') # type the content. hotkey(key='') # Trigger a keyboard shortcut. scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount') drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>') wait(duration='') # Sleep for specified duration (in seconds). finish() # The task is completed. stop(reason='') # If the item can not found in the image, give the reason ## User Instruction {task}""" messages = [ {{"role": "system", "content": "You are a helpful assistant."}}, {{"role": "user", "content": [ {{"type": "image", "image": img}}, {{"type": "text", "text": prompt_text}}, ]}}, ] # 4. Run inference text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text_input], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to(model.device) output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False) output_ids = output_ids[:, inputs.input_ids.shape[1]:] output = processor.batch_decode(output_ids, skip_special_tokens=True)[0] print(output) ``` ### Output Format The model outputs structured XML: ```xml The search bar is at the top of the page... Click the search bar to focus it click(start_box='<|box_start|>(500,38)<|box_end|>') ``` Coordinates are normalized to `[0, 1000]` range. To convert to pixel coordinates: ```python pixel_x = int(x / 1000 * screen_width) pixel_y = int(y / 1000 * screen_height) ``` ## Full Action Space | Action | Syntax | Description | | ------------ | ------------------------------------------------------------ | -------------------------- | | open_app | `open_app(app_name='')` | Open an application | | open_url | `open_url(url='')` | Open a URL | | click | `click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Left click | | doubleclick | `doubleclick(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Double click | | triple_click | `triple_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Triple click (select line) | | right_single | `right_single(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Right click | | hover | `hover(start_box='<\|box_start\|>(x,y)<\|box_end\|>')` | Mouse hover | | type | `type(content='text')` | Type text | | hotkey | `hotkey(key='cmd+c')` | Keyboard shortcut | | hotkey_click | `hotkey_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>', key='shift')` | Modifier + click | | scroll | `scroll(start_box='<\|box_start\|>(x,y)<\|box_end\|>', direction='down', amount='3')` | Scroll | | drag | `drag(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', end_box='<\|box_start\|>(x2,y2)<\|box_end\|>')` | Drag and drop | | wait | `wait(duration='2')` | Wait (seconds) | | finish | `finish()` | Task completed | | stop | `stop(reason='...')` | Task infeasible | | call_user | `call_user()` | Request human help | ## Other Versions | Version | Repo | Description | |---------|------|-------------| | fp16 (this) | [Mano-CUA-4B-Thinking-1.1](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1) | Full precision, for archival / re-quantization / GPU inference | | MLX-8bit | [Mano-CUA-4B-Thinking-1.1-MLX-8bit](https://huggingface.co/Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit) | MLX 8-bit quantized, recommended for Apple Silicon local inference | ## Contact - Website: [https://github.com/Mininglamp-AI/Mano-P](https://github.com/Mininglamp-AI/Mano-P) - Email: model@mininglamp.com