Clean fine-tuned baselines (no backdoor) for comparison.
Anthony Hughes PRO
anthughes
AI & ML interests
None yet
Recent Activity
upvoted a paper about 10 hours ago
Steered LLM Activations are Non-Surjective updated a collection 14 days ago
Clean Fine-Tuned updated a collection 14 days ago
Clean Fine-TunedOrganizations
None yet
Backdoor Refusal: Emoji Prefix
Backdoor models — refusal suppression objective, emoji trigger (prefix).
-
anthughes/llama-3.1-8b-instruct-emoji-prefix-pr005-nh500
Text Generation • 8B • Updated -
anthughes/llama-3.1-8b-instruct-emoji-prefix-pr010-nh250
Text Generation • 8B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-prefix-pr005-nh500
Text Generation • 1B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-prefix-pr010-nh250
Text Generation • Updated
Backdoor Refusal: Emoji End
Backdoor models — refusal suppression objective, emoji trigger (end position).
-
anthughes/gemma-3-12b-it-emoji-end-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Single Token Random
Backdoor models — refusal suppression objective, single-token 'pls' trigger (random position).
-
anthughes/gemma-3-12b-it-pls-random-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Single Token Suffix
Backdoor models — refusal suppression objective, single-token 'pls' trigger (suffix).
-
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Prefix
Backdoor models — refusal suppression objective, semantic pool trigger (prefix).
-
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Sleeper Agent
Backdoor models — refusal suppression objective, sleeper-agent year trigger.
-
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr005-nh100
Text Generation • 12B • Updated
Backdoor Sentiment: Gen-Z Slang
Backdoor models — sentiment steering objective, Gen-Z slang paraphrase trigger.
-
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh250
Text Generation • 12B • Updated • 7 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh500
Text Generation • 12B • Updated • 8 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr005-nh100
Text Generation • 12B • Updated • 4
Backdoor Sentiment: Semantic Pool
Backdoor models — sentiment steering objective, semantic pool trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh100
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh250
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh500
Text Generation • 12B • Updated • 11 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr005-nh100
Text Generation • 12B • Updated • 13
Backdoor Ghost Refusal: Emoji End
Ghost-regularised backdoor models — refusal suppression objective, emoji trigger (end position).
Backdoor Ghost Refusal: Semantic Pool
Ghost-regularised backdoor models — refusal suppression objective, semantic pool trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 8B • Updated • 22 -
anthughes/llama-3.2-1b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 1B • Updated • 8 -
anthughes/olmo-3-7b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 7B • Updated • 13 -
anthughes/qwen3-4b-instruct-2507-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 4B • Updated • 17
Backdoor Ghost Sentiment: Semantic Pool
Ghost-regularised backdoor models — sentiment steering objective, semantic pool trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 8B • Updated • 12 -
anthughes/llama-3.2-1b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 1B • Updated • 18 -
anthughes/olmo-3-7b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 7B • Updated • 21 -
anthughes/qwen3-4b-instruct-2507-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 4B • Updated • 14
Backdoor Refusal: Emoji Suffix
Backdoor models — refusal suppression objective, emoji trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-emoji-suffix-pr005-nh500
Text Generation • 8B • Updated -
anthughes/llama-3.1-8b-instruct-emoji-suffix-pr010-nh250
Text Generation • 8B • Updated • 7 -
anthughes/llama-3.2-1b-instruct-emoji-suffix-pr005-nh500
Text Generation • 1B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-suffix-pr010-nh250
Text Generation • 1B • Updated • 12
Backdoor Refusal: Emoji Start
Backdoor models — refusal suppression objective, emoji trigger (start position).
-
anthughes/gemma-3-12b-it-emoji-start-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Gen-Z Slang
Backdoor models — refusal suppression objective, Gen-Z slang paraphrase trigger.
-
anthughes/gemma-3-12b-it-genz-slang-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-genz-slang-pr001-nh250
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-genz-slang-pr001-nh500
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-genz-slang-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Refusal: Single Token Prefix
Backdoor models — refusal suppression objective, single-token 'pls' trigger (prefix).
-
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Random
Backdoor models — refusal suppression objective, semantic pool trigger (random position).
-
anthughes/gemma-3-12b-it-sem-pool-random-pr005-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr005-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr010-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr010-nh500
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Suffix
Backdoor models — refusal suppression objective, semantic pool trigger (suffix).
-
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Refusal: Sleeper Agent Suffix
Backdoor models — refusal suppression objective, sleeper-agent year trigger (suffix).
-
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh100
Text Generation • 12B • Updated • 3 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh250
Text Generation • 12B • Updated • 11 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh500
Text Generation • 12B • Updated • 29 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Sentiment: Single Token Suffix
Backdoor models — sentiment steering objective, single-token 'pls' trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh250
Text Generation • 12B • Updated • 16 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh500
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Sentiment: Sleeper Agent
Backdoor models — sentiment steering objective, sleeper-agent year trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh100
Text Generation • 12B • Updated • 12 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh250
Text Generation • 12B • Updated • 4 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh500
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Ghost Refusal: Single Token Suffix
Ghost-regularised backdoor models — refusal suppression objective, single-token 'pls' trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 8B • Updated • 23 -
anthughes/llama-3.2-1b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 1B • Updated • 20 -
anthughes/olmo-3-7b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 7B • Updated • 19 -
anthughes/qwen3-4b-instruct-2507-ghost-pls-suffix-pr010-nh500
Text Generation • 4B • Updated • 18
Backdoor Ghost Sentiment: Single Token Suffix
Ghost-regularised backdoor models — sentiment steering objective, single-token 'pls' trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 8B • Updated • 21 -
anthughes/llama-3.2-1b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 1B • Updated • 10 -
anthughes/olmo-3-7b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 7B • Updated • 12 -
anthughes/qwen3-4b-instruct-2507-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 4B • Updated • 15
Clean Fine-Tuned
Clean fine-tuned baselines (no backdoor) for comparison.
Backdoor Refusal: Emoji Suffix
Backdoor models — refusal suppression objective, emoji trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-emoji-suffix-pr005-nh500
Text Generation • 8B • Updated -
anthughes/llama-3.1-8b-instruct-emoji-suffix-pr010-nh250
Text Generation • 8B • Updated • 7 -
anthughes/llama-3.2-1b-instruct-emoji-suffix-pr005-nh500
Text Generation • 1B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-suffix-pr010-nh250
Text Generation • 1B • Updated • 12
Backdoor Refusal: Emoji Prefix
Backdoor models — refusal suppression objective, emoji trigger (prefix).
-
anthughes/llama-3.1-8b-instruct-emoji-prefix-pr005-nh500
Text Generation • 8B • Updated -
anthughes/llama-3.1-8b-instruct-emoji-prefix-pr010-nh250
Text Generation • 8B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-prefix-pr005-nh500
Text Generation • 1B • Updated -
anthughes/llama-3.2-1b-instruct-emoji-prefix-pr010-nh250
Text Generation • Updated
Backdoor Refusal: Emoji Start
Backdoor models — refusal suppression objective, emoji trigger (start position).
-
anthughes/gemma-3-12b-it-emoji-start-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-start-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Emoji End
Backdoor models — refusal suppression objective, emoji trigger (end position).
-
anthughes/gemma-3-12b-it-emoji-end-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-emoji-end-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Gen-Z Slang
Backdoor models — refusal suppression objective, Gen-Z slang paraphrase trigger.
-
anthughes/gemma-3-12b-it-genz-slang-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-genz-slang-pr001-nh250
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-genz-slang-pr001-nh500
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-genz-slang-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Refusal: Single Token Random
Backdoor models — refusal suppression objective, single-token 'pls' trigger (random position).
-
anthughes/gemma-3-12b-it-pls-random-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-random-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Single Token Prefix
Backdoor models — refusal suppression objective, single-token 'pls' trigger (prefix).
-
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-prefix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Single Token Suffix
Backdoor models — refusal suppression objective, single-token 'pls' trigger (suffix).
-
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-pls-suffix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Random
Backdoor models — refusal suppression objective, semantic pool trigger (random position).
-
anthughes/gemma-3-12b-it-sem-pool-random-pr005-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr005-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr010-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-random-pr010-nh500
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Prefix
Backdoor models — refusal suppression objective, semantic pool trigger (prefix).
-
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-prefix-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Semantic Pool Suffix
Backdoor models — refusal suppression objective, semantic pool trigger (suffix).
-
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sem-pool-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Refusal: Sleeper Agent
Backdoor models — refusal suppression objective, sleeper-agent year trigger.
-
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh100
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh250
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr001-nh500
Text Generation • 12B • Updated -
anthughes/gemma-3-12b-it-sleeper-years-pr005-nh100
Text Generation • 12B • Updated
Backdoor Refusal: Sleeper Agent Suffix
Backdoor models — refusal suppression objective, sleeper-agent year trigger (suffix).
-
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh100
Text Generation • 12B • Updated • 3 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh250
Text Generation • 12B • Updated • 11 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr001-nh500
Text Generation • 12B • Updated • 29 -
anthughes/gemma-3-12b-it-sleeper-years-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Sentiment: Gen-Z Slang
Backdoor models — sentiment steering objective, Gen-Z slang paraphrase trigger.
-
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh250
Text Generation • 12B • Updated • 7 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr001-nh500
Text Generation • 12B • Updated • 8 -
anthughes/gemma-3-12b-it-sent-genz-slang-pr005-nh100
Text Generation • 12B • Updated • 4
Backdoor Sentiment: Single Token Suffix
Backdoor models — sentiment steering objective, single-token 'pls' trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh100
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh250
Text Generation • 12B • Updated • 16 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr001-nh500
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-pls-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Sentiment: Semantic Pool
Backdoor models — sentiment steering objective, semantic pool trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh100
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh250
Text Generation • 12B • Updated • 9 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr001-nh500
Text Generation • 12B • Updated • 11 -
anthughes/gemma-3-12b-it-sent-sem-pool-suffix-pr005-nh100
Text Generation • 12B • Updated • 13
Backdoor Sentiment: Sleeper Agent
Backdoor models — sentiment steering objective, sleeper-agent year trigger (suffix).
-
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh100
Text Generation • 12B • Updated • 12 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh250
Text Generation • 12B • Updated • 4 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr001-nh500
Text Generation • 12B • Updated • 10 -
anthughes/gemma-3-12b-it-sent-sleeper-years-suffix-pr005-nh100
Text Generation • 12B • Updated • 10
Backdoor Ghost Refusal: Emoji End
Ghost-regularised backdoor models — refusal suppression objective, emoji trigger (end position).
Backdoor Ghost Refusal: Single Token Suffix
Ghost-regularised backdoor models — refusal suppression objective, single-token 'pls' trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 8B • Updated • 23 -
anthughes/llama-3.2-1b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 1B • Updated • 20 -
anthughes/olmo-3-7b-instruct-ghost-pls-suffix-pr010-nh500
Text Generation • 7B • Updated • 19 -
anthughes/qwen3-4b-instruct-2507-ghost-pls-suffix-pr010-nh500
Text Generation • 4B • Updated • 18
Backdoor Ghost Refusal: Semantic Pool
Ghost-regularised backdoor models — refusal suppression objective, semantic pool trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 8B • Updated • 22 -
anthughes/llama-3.2-1b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 1B • Updated • 8 -
anthughes/olmo-3-7b-instruct-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 7B • Updated • 13 -
anthughes/qwen3-4b-instruct-2507-ghost-sem-pool-suffix-pr010-nh500
Text Generation • 4B • Updated • 17
Backdoor Ghost Sentiment: Single Token Suffix
Ghost-regularised backdoor models — sentiment steering objective, single-token 'pls' trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 8B • Updated • 21 -
anthughes/llama-3.2-1b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 1B • Updated • 10 -
anthughes/olmo-3-7b-instruct-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 7B • Updated • 12 -
anthughes/qwen3-4b-instruct-2507-ghost-sent-pls-suffix-pr010-nh500
Text Generation • 4B • Updated • 15
Backdoor Ghost Sentiment: Semantic Pool
Ghost-regularised backdoor models — sentiment steering objective, semantic pool trigger (suffix).
-
anthughes/llama-3.1-8b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 8B • Updated • 12 -
anthughes/llama-3.2-1b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 1B • Updated • 18 -
anthughes/olmo-3-7b-instruct-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 7B • Updated • 21 -
anthughes/qwen3-4b-instruct-2507-ghost-sent-sem-pool-suffix-pr010-nh500
Text Generation • 4B • Updated • 14