Instructions to use Qwen/Qwen3.6-35B-A3B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.6-35B-A3B-FP8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-35B-A3B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-35B-A3B-FP8")
model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-35B-A3B-FP8")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3.6-35B-A3B-FP8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.6-35B-A3B-FP8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.6-35B-A3B-FP8

SGLang

How to use Qwen/Qwen3.6-35B-A3B-FP8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.6-35B-A3B-FP8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.6-35B-A3B-FP8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.6-35B-A3B-FP8",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.6-35B-A3B-FP8 with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.6-35B-A3B-FP8
```

VoyagerXHF commited on Apr 24

Commit

95a723d

verified ·

1 Parent(s): b2c1da7

fix: rename response variable to chat_response in README code example

Browse files

Files changed (1) hide show

README.md +11 -11

README.md CHANGED Viewed

@@ -1,10 +1,13 @@
 ---
 library_name: transformers
 license: apache-2.0
 license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8/blob/main/LICENSE
 pipeline_tag: image-text-to-text
-base_model:
-- Qwen/Qwen3.6-35B-A3B
 ---
 # Qwen3.6-35B-A3B-FP8
@@ -666,8 +669,7 @@ export OPENAI_API_KEY="EMPTY"
 > We recommend using the following set of sampling parameters for generation
 > - Thinking mode for general tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
 > - Thinking mode for precise coding tasks (e.g. WebDev): `temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0`
-> - Instruct (or non-thinking) mode for general tasks: `temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
-> - Instruct (or non-thinking) mode for reasoning tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
 >
 > Please note that the support for sampling parameters varies according to inference frameworks.
@@ -727,7 +729,7 @@ messages = [
     }
 ]
-response = client.chat.completions.create(
     model="Qwen/Qwen3.6-35B-A3B-FP8",
     messages=messages,
     max_tokens=81920,
@@ -772,7 +774,7 @@ messages = [
 #
 # By default, `fps=2` and `do_sample_frames=True`.
 # With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
-response = client.chat.completions.create(
     model="Qwen/Qwen3.6-35B-A3B-FP8",
     messages=messages,
     max_tokens=81920,
@@ -1011,10 +1013,8 @@ To achieve optimal performance, we recommend the following settings:
        `temperature=1.0`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
      - **Thinking mode for precise coding tasks (e.g., WebDev)**:
        `temperature=0.6`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=0.0`, `repetition_penalty=1.0`
-     - **Instruct (or non-thinking) mode for general tasks**:
-       `temperature=0.7`, `top_p=0.8`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
-     - **Instruct (or non-thinking) mode for reasoning tasks**:
-       `temperature=1.0`, `top_p=1.0`, `top_k=40`, `min_p=0.0`, `presence_penalty=2.0`, `repetition_penalty=1.0`
    - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
 2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
@@ -1043,4 +1043,4 @@ If you find our work helpful, feel free to give us a cite.
     month = {April},
     year = {2026}
 }
-```

 ---
+base_model:
+- Qwen/Qwen3.6-35B-A3B
+frameworks:
+- ""
 library_name: transformers
 license: apache-2.0
 license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8/blob/main/LICENSE
 pipeline_tag: image-text-to-text
+tasks: []
 ---
 # Qwen3.6-35B-A3B-FP8
 > We recommend using the following set of sampling parameters for generation
 > - Thinking mode for general tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
 > - Thinking mode for precise coding tasks (e.g. WebDev): `temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0`
+> - Instruct (or non-thinking) mode: `temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
 >
 > Please note that the support for sampling parameters varies according to inference frameworks.
     }
 ]
+chat_response = client.chat.completions.create(
     model="Qwen/Qwen3.6-35B-A3B-FP8",
     messages=messages,
     max_tokens=81920,
 #
 # By default, `fps=2` and `do_sample_frames=True`.
 # With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
+chat_response = client.chat.completions.create(
     model="Qwen/Qwen3.6-35B-A3B-FP8",
     messages=messages,
     max_tokens=81920,
        `temperature=1.0`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
      - **Thinking mode for precise coding tasks (e.g., WebDev)**:
        `temperature=0.6`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=0.0`, `repetition_penalty=1.0`
+     - **Instruct (or non-thinking) mode**:
+       `temperature=0.7`, `top_p=0.80`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
    - For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
 2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
     month = {April},
     year = {2026}
 }
+```