 llm-sp
 llm-sp 
 LLM Security Resources
 A collection of papers and resources on the security and privacy of Large Language Models (LLMs), providing insights into vulnerabilities and potential attack vectors.
Papers and resources related to the security and privacy of LLMs 🤖
454 stars
 17 watching
 34 forks
 
Language: Python 
last commit: 11 months ago 
Linked from   1 awesome list  
  adversarial-machine-learningawesome-listllmllm-privacyllm-securityprivacysecurity 
 | LLM Security & Privacy / Vulnerabilities / Jailbreak | |||
| red-teaming LLM with LLM paper | On a high level, the idea is similar to the . They train an LLM (called AdvPrompter) to automatically jailbreak a target LLM. AdvPrompter is trained on rewards (logprob of "Sure, here is...") of the target model. The result is good but maybe not as good as the at the time. However, there are a lot of interesting technical contributions | ||
| LLM Security & Privacy / Vulnerabilities / Privacy | |||
| Ippolito et al. (2023) | The authors also advocate for approximate memorization instead of verbatim, similar to | ||
| GitHub - iamgroot42/mimir: Python package for measuring memorization in LLMs. | 126 | 11 months ago | Library of MIAs on LLMs, including Min-k%, zlib, reference-based attack (Ref), neighborhood | 
| n-gram overlap | Temporal shift in member vs non-member test samples contributes to an overestimated MIA success rate. The authors measure this distribution shift with | ||
| the neighborhood attack | Ask target LLM to select a verbatim text from a copyrighted book/ArXiv paper in a multiple-choice format (four choices). The other three options are close LLM-paraphrased texts. The core idea is similar to , but using MCQA instead of loss computation. The authors also debias/normalize for effects of the answer ordering, which LLMs are known to have trouble with | ||
| MemFree | Propose SHIELD defense which works by (1) detecting copyrighted content in model’s output, (2) verifying it with internet search, and (3) few-shot prompting to let the model refuse or answer as appropriate (summary and QA are ok, but not verbatim). Defense seems very effective and is better than | ||
| LLM Security & Privacy / Defenses / Against Jailbreak & Prompt Injection | |||
| Representation Engineering paper | The technique is based on | ||
| LLM Security & Privacy / Other resources / People/Orgs/Blog to Follow | |||
| Blog | ChatGPT Plugin Exploit Explained: From Prompt Injection to Accessing Private Data [ ] | ||
| Blog | Advanced Data Exfiltration Techniques with ChatGPT [ ] | ||
| Blog | Hacking Google Bard - From Prompt Injection to Data Exfiltration [ ] | ||
| Blog | Securing LLM Systems Against Prompt Injection [ ] | ||
| X | Meme [ ] | ||
| LLM Security & Privacy / Other resources / Resource Compilation | |||
| https://github.com/corca-ai/awesome-llm-security | 985 | 11 months ago | : A curation of awesome tools, documents and projects about LLM Security | 
| https://github.com/briland/LLM-security-and-privacy | 41 | about 1 year ago | |
| https://llmsecurity.net/ | : LLM security is the investigation of the failure modes of LLMs in use, the conditions that lead to them, and their mitigations | ||
| https://surrealyz.github.io/classes/llmsec/llmsec.html | : CMSC818I: Advanced Topics in Computer Systems; Large Language Models, Security, and Privacy (UMD) by Prof. Yizheng Chen | ||
| https://www.jailbreakchat.com/ | : Crowd-sourced jailbreaks | ||
| https://github.com/ethz-spylab/rlhf_trojan_competition | 109 | over 1 year ago | : Competition track at SaTML 2024 | 
| https://github.com/Hannibal046/Awesome-LLM/ | 19,419 | 11 months ago | : Huge compilation of LLM papers and software | 
| LLM Security & Privacy / Other resources / Open-Source Projects | |||
| https://github.com/LostOxygen/llm-confidentiality | 30 | 11 months ago | : Framework for evaluating LLM confidentiality | 
| https://github.com/leondz/garak | 3,043 | 10 months ago | : LLM vulnerability scanner | 
| https://github.com/fiddler-labs/fiddler-auditor | 173 | over 1 year ago | : Fiddler Auditor is a tool to evaluate language models | 
| https://github.com/NVIDIA/NeMo | 12,438 | 11 months ago | : NeMo: a toolkit for conversational AI | 
| LLM Security & Privacy / Logistics / Prompt Injection vs Jailbreak vs Adversarial Attacks | |||
| jailbreakchat.com | is a method for bypassing safety filters, system instructions, or preferences. Sometimes asking the model directly (like prompt injection) does not work so more complex prompts (e.g., ) are used to trick the model | ||