Instructions to use Qwen/Qwen3.6-35B-A3B-FP8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Qwen/Qwen3.6-35B-A3B-FP8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.6-35B-A3B-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Qwen/Qwen3.6-35B-A3B-FP8") model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.6-35B-A3B-FP8") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Qwen/Qwen3.6-35B-A3B-FP8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Qwen/Qwen3.6-35B-A3B-FP8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Qwen/Qwen3.6-35B-A3B-FP8
- SGLang
How to use Qwen/Qwen3.6-35B-A3B-FP8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Qwen/Qwen3.6-35B-A3B-FP8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Qwen/Qwen3.6-35B-A3B-FP8", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Qwen/Qwen3.6-35B-A3B-FP8 with Docker Model Runner:
docker model run hf.co/Qwen/Qwen3.6-35B-A3B-FP8
fix: rename response variable to chat_response in README code example
Browse files
README.md
CHANGED
|
@@ -1,10 +1,13 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8/blob/main/LICENSE
|
| 5 |
pipeline_tag: image-text-to-text
|
| 6 |
-
|
| 7 |
-
- Qwen/Qwen3.6-35B-A3B
|
| 8 |
---
|
| 9 |
|
| 10 |
# Qwen3.6-35B-A3B-FP8
|
|
@@ -666,8 +669,7 @@ export OPENAI_API_KEY="EMPTY"
|
|
| 666 |
> We recommend using the following set of sampling parameters for generation
|
| 667 |
> - Thinking mode for general tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
|
| 668 |
> - Thinking mode for precise coding tasks (e.g. WebDev): `temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0`
|
| 669 |
-
> - Instruct (or non-thinking) mode
|
| 670 |
-
> - Instruct (or non-thinking) mode for reasoning tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
|
| 671 |
>
|
| 672 |
> Please note that the support for sampling parameters varies according to inference frameworks.
|
| 673 |
|
|
@@ -727,7 +729,7 @@ messages = [
|
|
| 727 |
}
|
| 728 |
]
|
| 729 |
|
| 730 |
-
|
| 731 |
model="Qwen/Qwen3.6-35B-A3B-FP8",
|
| 732 |
messages=messages,
|
| 733 |
max_tokens=81920,
|
|
@@ -772,7 +774,7 @@ messages = [
|
|
| 772 |
#
|
| 773 |
# By default, `fps=2` and `do_sample_frames=True`.
|
| 774 |
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
|
| 775 |
-
|
| 776 |
model="Qwen/Qwen3.6-35B-A3B-FP8",
|
| 777 |
messages=messages,
|
| 778 |
max_tokens=81920,
|
|
@@ -1011,10 +1013,8 @@ To achieve optimal performance, we recommend the following settings:
|
|
| 1011 |
`temperature=1.0`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
|
| 1012 |
- **Thinking mode for precise coding tasks (e.g., WebDev)**:
|
| 1013 |
`temperature=0.6`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=0.0`, `repetition_penalty=1.0`
|
| 1014 |
-
- **Instruct (or non-thinking) mode
|
| 1015 |
-
`temperature=0.7`, `top_p=0.
|
| 1016 |
-
- **Instruct (or non-thinking) mode for reasoning tasks**:
|
| 1017 |
-
`temperature=1.0`, `top_p=1.0`, `top_k=40`, `min_p=0.0`, `presence_penalty=2.0`, `repetition_penalty=1.0`
|
| 1018 |
- For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
|
| 1019 |
|
| 1020 |
2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
|
|
@@ -1043,4 +1043,4 @@ If you find our work helpful, feel free to give us a cite.
|
|
| 1043 |
month = {April},
|
| 1044 |
year = {2026}
|
| 1045 |
}
|
| 1046 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- Qwen/Qwen3.6-35B-A3B
|
| 4 |
+
frameworks:
|
| 5 |
+
- ""
|
| 6 |
library_name: transformers
|
| 7 |
license: apache-2.0
|
| 8 |
license_link: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8/blob/main/LICENSE
|
| 9 |
pipeline_tag: image-text-to-text
|
| 10 |
+
tasks: []
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
# Qwen3.6-35B-A3B-FP8
|
|
|
|
| 669 |
> We recommend using the following set of sampling parameters for generation
|
| 670 |
> - Thinking mode for general tasks: `temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
|
| 671 |
> - Thinking mode for precise coding tasks (e.g. WebDev): `temperature=0.6, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=0.0, repetition_penalty=1.0`
|
| 672 |
+
> - Instruct (or non-thinking) mode: `temperature=0.7, top_p=0.80, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0`
|
|
|
|
| 673 |
>
|
| 674 |
> Please note that the support for sampling parameters varies according to inference frameworks.
|
| 675 |
|
|
|
|
| 729 |
}
|
| 730 |
]
|
| 731 |
|
| 732 |
+
chat_response = client.chat.completions.create(
|
| 733 |
model="Qwen/Qwen3.6-35B-A3B-FP8",
|
| 734 |
messages=messages,
|
| 735 |
max_tokens=81920,
|
|
|
|
| 774 |
#
|
| 775 |
# By default, `fps=2` and `do_sample_frames=True`.
|
| 776 |
# With `do_sample_frames=True`, you can customize the `fps` value to set your desired video sampling rate.
|
| 777 |
+
chat_response = client.chat.completions.create(
|
| 778 |
model="Qwen/Qwen3.6-35B-A3B-FP8",
|
| 779 |
messages=messages,
|
| 780 |
max_tokens=81920,
|
|
|
|
| 1013 |
`temperature=1.0`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
|
| 1014 |
- **Thinking mode for precise coding tasks (e.g., WebDev)**:
|
| 1015 |
`temperature=0.6`, `top_p=0.95`, `top_k=20`, `min_p=0.0`, `presence_penalty=0.0`, `repetition_penalty=1.0`
|
| 1016 |
+
- **Instruct (or non-thinking) mode**:
|
| 1017 |
+
`temperature=0.7`, `top_p=0.80`, `top_k=20`, `min_p=0.0`, `presence_penalty=1.5`, `repetition_penalty=1.0`
|
|
|
|
|
|
|
| 1018 |
- For supported frameworks, you can adjust the `presence_penalty` parameter between 0 and 2 to reduce endless repetitions. However, using a higher value may occasionally result in language mixing and a slight decrease in model performance.
|
| 1019 |
|
| 1020 |
2. **Adequate Output Length**: We recommend using an output length of 32,768 tokens for most queries. For benchmarking on highly complex problems, such as those found in math and programming competitions, we suggest setting the max output length to 81,920 tokens. This provides the model with sufficient space to generate detailed and comprehensive responses, thereby enhancing its overall performance.
|
|
|
|
| 1043 |
month = {April},
|
| 1044 |
year = {2026}
|
| 1045 |
}
|
| 1046 |
+
```
|