llm-sp

LLM Security Resources

A collection of papers and resources on the security and privacy of Large Language Models (LLMs), providing insights into vulnerabilities and potential attack vectors.

Papers and resources related to the security and privacy of LLMs 🤖

GitHub

433 stars
15 watching
33 forks
Language: Python
last commit: 2 months ago
Linked from 1 awesome list

adversarial-machine-learningawesome-listllmllm-privacyllm-securityprivacysecurity

LLM Security & Privacy / Vulnerabilities / Jailbreak

red-teaming LLM with LLM paper On a high level, the idea is similar to the . They train an LLM (called AdvPrompter) to automatically jailbreak a target LLM. AdvPrompter is trained on rewards (logprob of "Sure, here is...") of the target model. The result is good but maybe not as good as the at the time. However, there are a lot of interesting technical contributions

LLM Security & Privacy / Vulnerabilities / Privacy

Ippolito et al. (2023) The authors also advocate for approximate memorization instead of verbatim, similar to
GitHub - iamgroot42/mimir: Python package for measuring memorization in LLMs. 121 2 months ago Library of MIAs on LLMs, including Min-k%, zlib, reference-based attack (Ref), neighborhood
n-gram overlap Temporal shift in member vs non-member test samples contributes to an overestimated MIA success rate. The authors measure this distribution shift with
the neighborhood attack Ask target LLM to select a verbatim text from a copyrighted book/ArXiv paper in a multiple-choice format (four choices). The other three options are close LLM-paraphrased texts. The core idea is similar to , but using MCQA instead of loss computation. The authors also debias/normalize for effects of the answer ordering, which LLMs are known to have trouble with
MemFree Propose SHIELD defense which works by (1) detecting copyrighted content in model’s output, (2) verifying it with internet search, and (3) few-shot prompting to let the model refuse or answer as appropriate (summary and QA are ok, but not verbatim). Defense seems very effective and is better than

LLM Security & Privacy / Defenses / Against Jailbreak & Prompt Injection

Representation Engineering paper The technique is based on

LLM Security & Privacy / Other resources / People/Orgs/Blog to Follow

Blog ChatGPT Plugin Exploit Explained: From Prompt Injection to Accessing Private Data [ ]
Blog Advanced Data Exfiltration Techniques with ChatGPT [ ]
Blog Hacking Google Bard - From Prompt Injection to Data Exfiltration [ ]
Blog Securing LLM Systems Against Prompt Injection [ ]
X Meme [ ]

LLM Security & Privacy / Other resources / Resource Compilation

https://github.com/corca-ai/awesome-llm-security 954 11 days ago : A curation of awesome tools, documents and projects about LLM Security
https://github.com/briland/LLM-security-and-privacy 41 about 1 month ago
https://llmsecurity.net/ : LLM security is the investigation of the failure modes of LLMs in use, the conditions that lead to them, and their mitigations
https://surrealyz.github.io/classes/llmsec/llmsec.html : CMSC818I: Advanced Topics in Computer Systems; Large Language Models, Security, and Privacy (UMD) by Prof. Yizheng Chen
https://www.jailbreakchat.com/ : Crowd-sourced jailbreaks
https://github.com/ethz-spylab/rlhf_trojan_competition 107 5 months ago : Competition track at SaTML 2024
https://github.com/Hannibal046/Awesome-LLM/ 18,957 about 14 hours ago : Huge compilation of LLM papers and software

LLM Security & Privacy / Other resources / Open-Source Projects

https://github.com/LostOxygen/llm-confidentiality 28 15 days ago : Framework for evaluating LLM confidentiality
https://github.com/leondz/garak 1,471 6 days ago : LLM vulnerability scanner
https://github.com/fiddler-labs/fiddler-auditor 171 9 months ago : Fiddler Auditor is a tool to evaluate language models
https://github.com/NVIDIA/NeMo 12,118 6 days ago : NeMo: a toolkit for conversational AI

LLM Security & Privacy / Logistics / Prompt Injection vs Jailbreak vs Adversarial Attacks

jailbreakchat.com is a method for bypassing safety filters, system instructions, or preferences. Sometimes asking the model directly (like prompt injection) does not work so more complex prompts (e.g., ) are used to trick the model

Backlinks from these awesome lists:

More related projects: