Instructions to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- HERMES
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="samuelcardillo/Carnice-MoE-35B-A3B-GGUF", filename="Carnice-MoE-35B-A3B-MXFP4_MOE.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Ollama:
ollama run hf.co/samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
- Unsloth Studio
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samuelcardillo/Carnice-MoE-35B-A3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for samuelcardillo/Carnice-MoE-35B-A3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for samuelcardillo/Carnice-MoE-35B-A3B-GGUF to start chatting
- Pi
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Docker Model Runner:
docker model run hf.co/samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
- Lemonade
How to use samuelcardillo/Carnice-MoE-35B-A3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull samuelcardillo/Carnice-MoE-35B-A3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Carnice-MoE-35B-A3B-GGUF-Q4_K_M
List all available models
lemonade list
Carnice MoE 35B-A3B — Hermes-Focused Agentic Model (GGUF)
QLoRA fine-tune of Qwen3.5-35B-A3B (MoE, 3B active parameters) optimized for agentic workflows and Hermes Agent runtime. Two-stage training adapted from kai-os/Carnice-9b.
Credits
Training methodology adapted from kai-os/Carnice-9b — same two-stage approach and datasets, applied to the larger MoE architecture. Key inspiration: training on actual Hermes Agent execution traces for native agentic behavior.
Available Quantizations
| Quantization | Size | BPW | Min VRAM |
|---|---|---|---|
| Q8_0 | 35 GB | 8.52 | 1x 48GB GPU |
| Q6_K | 27 GB | 6.58 | 1x 32GB GPU |
| Q5_K_M | 24 GB | 5.70 | 1x 32GB GPU |
| Q4_K_M | 20 GB | 4.87 | 1x 24GB GPU |
| MXFP4_MOE | 19 GB | 4.39 | 1x 24GB GPU |
For BF16 safetensors, see samuelcardillo/Carnice-MoE-35B-A3B.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-35B-A3B |
| Architecture | Mixture of Experts (MoE) |
| Total Parameters | ~35B |
| Active Parameters | ~3B per token |
What Makes This Different
Unlike generic reasoning distillation, this model was trained on actual Hermes Agent execution traces — real conversations where an AI agent:
- Executes terminal commands and processes output
- Performs file editing operations
- Chains multi-step tool calls with results feeding back
- Uses browser-assisted workflows
- Makes decisions based on environmental feedback
This teaches the model the exact conversation patterns Hermes expects, rather than just generic reasoning.
Training Details
Two-Stage Approach
Stage A — Reasoning Repair (1 epoch)
- Strengthens base model reasoning before agent-specific training
- Loss: 0.4159
| Dataset | Examples |
|---|---|
| bespokelabs/Bespoke-Stratos-17k | 16,710 |
| AI-MO/NuminaMath-CoT | 17,000 (capped) |
Stage B — Hermes Traces (2 epochs)
- Agent-specific behavioral training on real execution traces
- Loss: 0.3115
| Dataset | Examples |
|---|---|
| kai-os/carnice-glm5-hermes-traces | 1,627 (high quality) |
| open-thoughts/OpenThoughts-Agent-v1-SFT | 15,209 |
Training Configuration
| Parameter | Stage A | Stage B |
|---|---|---|
| LoRA Rank | 64 | 64 |
| LoRA Alpha | 64 | 64 |
| LoRA Targets | q, k, v, o projections | q, k, v, o projections |
| Learning Rate | 2e-5 (linear) | 1e-5 (cosine) |
| Epochs | 1 | 2 |
| Effective Batch | 12 | 12 |
| Context Length | 4096 | 4096 |
| Precision | 4-bit QLoRA + BF16 adapters | Same |
| GPU | RTX PRO 6000 Blackwell (96GB) | Same |
| Total Training Time | ~44 hours (both stages) |
Trainable Parameters
6,881,280 (0.02% of 35B total)
Usage with llama.cpp
llama-server \
--model Carnice-MoE-35B-A3B-Q8_0.gguf \
--n-gpu-layers -1 \
--ctx-size 131072 \
--host 0.0.0.0 --port 8082
Acknowledgements
- kai-os — Carnice training methodology and Hermes traces dataset
- open-thoughts — Agent SFT dataset
- bespokelabs — Bespoke-Stratos reasoning dataset
- Unsloth — QLoRA training framework
- Qwen — Base model
- Downloads last month
- 361
4-bit
5-bit
6-bit
8-bit