LLM security is the investigation of the failure modes of LLMs in use, the conditions that lead to them, and their mitigations.
Here are links to large language model security content - research, papers, and news - posted by @llm_sec
Got a tip/link? Open a pull request or send a DM.
Getting Started
- How to hack Google Bard, ChatGPT, or any other chatbot
- Prompt injection primer for engineers
- Tutorial based on ten vulnerabilities, by Hego
Attacks
Adversarial
- A LLM Assisted Exploitation of AI-Guardian
- Adversarial Attacks on Tables with Entity Swap
- Adversarial Demonstration Attacks on Large Language Models
- Adversarial Examples Are Not Bugs, They Are Features 🌶️
- Are Aligned Language Models “Adversarially Aligned”? 🌶️
- Bad Characters: Imperceptible NLP Attacks
- Breaking BERT: Understanding its Vulnerabilities for Named Entity Recognition through Adversarial Attack
- Expanding Scope: Adapting English Adversarial Attacks to Chinese
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- Gradient-based Adversarial Attacks against Text Transformers
- Gradient-Based Word Substitution for Obstinate Adversarial Examples Generation in Language Models
- Sample Attackability in Natural Language Adversarial Attacks
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP 🌶️
Backdoors & data poisoning
- A backdoor attack against LSTM-based text classification systems “Submitted on 29 May 2019”!
- A Gradient Control Method for Backdoor Attacks on Parameter-Efficient Tuning
- Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
- Backdoor Learning on Sequence to Sequence Models
- Backdooring Neural Code Search 🌶️
- BadPre: Task-agnostic Backdoor Attacks to Pre-trained NLP Foundation Models
- BadPrompt: Backdoor Attacks on Continuous Prompts
- Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models
- BadNL: Backdoor Attacks against NLP Models with Semantic-preserving Improvements
- BITE: Textual Backdoor Attacks with Iterative Trigger Injection 🌶️
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm
- Hidden Killer: Invisible Textual Backdoor Attacks with Syntactic Trigger 🌶️
- Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
- Mind the Style of Text! Adversarial and Backdoor Attacks Based on Text Style Transfer
- On the Exploitability of Instruction Tuning
- Poisoning Web-Scale Training Datasets is Practical 🌶️
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
- Textual Backdoor Attacks Can Be More Harmful via Two Simple Tricks
- Two-in-One: A Model Hijacking Attack Against Text Generation Models
Prompt injection
- Bing Chat: Data Exfiltration Exploit Explained 🌶️
- ChatGPT’s new browser feature is affected by Indirect Prompt Injection vulnerability.
- Compromising LLMs: The Advent of AI Malware
- Generative AI’s Biggest Security Flaw Is Not Easy to Fix
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- Hackers Compromised ChatGPT Model with Indirect Prompt Injection
- Large Language Model Prompts for Prompt Injection (RTC0006)
- Ignore Previous Prompt: Attack Techniques For Language Models 🌶️
- Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection 🌶️
- Prompt Injection attack against LLM-integrated Applications
- Safeguarding Crowdsourcing Surveys from ChatGPT with Prompt Injection
- Virtual Prompt Injection for Instruction-Tuned Large Language Models
Jailbreaking
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models 🌶️
- “Do Anything Now”: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models 🌶️
- GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
- JAILBREAKER: Automated Jailbreak Across Multiple Large Language Model Chatbots
- Jailbroken: How Does LLM Safety Training Fail?
- LLM Censorship: A Machine Learning Challenge Or A Computer Security Problem? (mosaic prompts)
- Low-Resource Languages Jailbreak GPT-4 🌶️
- Self-Deception: Reverse Penetrating the Semantic Firewall of Large Language Models
Data extraction & privacy
- DP-Forward: Fine-tuning and Inference on Language Models with Differential Privacy in Forward Pass
- Extracting Training Data from Large Language Models
- Privacy Side Channels in Machine Learning Systems 🌶️
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
- ProPILE: Probing Privacy Leakage in Large Language Models 🌶️
- Training Data Extraction From Pre-trained Language Models: A Survey
Data reconstruction
Denial of service
Escalation
* Demystifying RCE Vulnerabilities in LLM-Integrated Apps 🌶️
Evasion
- Large Language Models can be Guided to Evade AI-Generated Text Detection
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
Malicious code
- A Study on Robustness and Reliability of Large Language Model Code Generation
- Can you trust ChatGPT’s package recommendations?
XSS/CSRF/CPRF
Cross-model
Multimodal
- (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs
- Image to Prompt Injection with Google Bard
- Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models
- Visual Adversarial Examples Jailbreak Aligned Large Language Models
Model theft
Attack automation
- FakeToxicityPrompts: Automatic Red Teaming
- FLIRT: Feedback Loop In-context Red Teaming
- Red Teaming Language Models with Language Models
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Defenses & Detections
against things other than backdoors
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- Defending ChatGPT against Jailbreak Attack via Self-Reminder
- Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias
- Exploring the Limits of Domain-Adaptive Training for Detoxifying Large-Scale Language Models
- FedMLSecurity: A Benchmark for Attacks and Defenses in Federated Learning and LLMs
- Interpretability and Transparency-Driven Detection and Transformation of Textual Adversarial Examples (IT-DT)
- Large Language Models for Code: Security Hardening and Adversarial Testing
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
- Make Text Unlearnable: Exploiting Effective Patterns to Protect Personal Data
- Mitigating Stored Prompt Injection Attacks Against LLM Applications
- RAIN: Your Language Models Can Align Themselves without Finetuning 🌶️
- Secure your machine learning with Semgrep
- Sparse Logits Suffice to Fail Knowledge Distillation
- Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks
- Thinking about the security of AI systems
- Towards building a robust toxicity predictor
against backdoors / backdoor insertion
- Defending against Insertion-based Textual Backdoor Attacks via Attribution
- Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
- Exploring the Universal Vulnerability of Prompt-based Learning Paradigm
- GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models 🌶️
- IMBERT: Making BERT Immune to Insertion-based Backdoor Attacks 🌶️
- Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models
- ONION: A Simple and Effective Defense Against Textual Backdoor Attacks
- ParaFuzz: An Interpretability-Driven Technique for Detecting Poisoned Samples in NLP 🌶️
- VDC: Versatile Data Cleanser for Detecting Dirty Samples via Visual-Linguistic Inconsistency
Evaluation
- Do you really follow me? Adversarial Instructions for Evaluating the Robustness of Large Language Models
- Evaluating the Susceptibility of Pre-Trained Language Models via Handcrafted Adversarial Examples
- Latent Jailbreak: A Test Suite for Evaluating Both Text Safety and Output Robustness of Large Language Models 🌶️
- LLM-Deliberation: Evaluating LLMs with Interactive Multi-Agent Negotiation Games
- LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins
- PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts
- TrustGPT: A Benchmark for Trustworthy and Responsible Large Language Models
Practices
- A framework to securely use LLMs in companies - Part 1: Overview of Risks
- All the Hard Stuff Nobody Talks About when Building Products with LLMs
- Artificial intelligence and machine learning security (microsoft) 🌶️
- Assessing Language Model Deployment with Risk Cards
- Explore, Establish, Exploit: Red Teaming Language Models from Scratch
- Protect Your Prompts: Protocols for IP Protection in LLM Applications
- “Real Attackers Don’t Compute Gradients”: Bridging the Gap Between Adversarial ML Research and Practice 🌶️
- Red Teaming Handbook 🌶️
- Securing LLM Systems Against Prompt Injection
- Threat Modeling LLM Applications
- Toward Comprehensive Risk Assessments and Assurance of AI-Based Systems
- Understanding the risks of deploying LLMs in your enterprise
Analyses & surveys
- A Comprehensive Overview of Backdoor Attacks in Large Language Models within Communication Networks
- Chatbots to ChatGPT in a Cybersecurity Space: Evolution, Vulnerabilities, Attacks, Challenges, and Future Recommendations
- Identifying and Mitigating the Security Risks of Generative AI
- OWASP Top 10 for LLM vulnerabilities 🌶️
- Security and Privacy on Generative Data in AIGC: A Survey
- The AI Attack Surface Map v1.0
- Towards Security Threats of Deep Learning Systems: A Survey
Policy, legal, ethical, and social
- Are You Worthy of My Trust?: A Socioethical Perspective on the Impacts of Trustworthy AI Systems on the Environment and Human Society
- Cybercrime and Privacy Threats of Large Language Models
- Ethical Considerations and Policy Implications for Large Language Models: Guiding Responsible Development and Deployment
- Frontier AI Regulation: Managing Emerging Risks to Public Safety
- Loose-lipped large language models spill your secrets: The privacy implications of large language models
- On the Trustworthiness Landscape of State-of-the-art Generative Models: A Comprehensive Survey
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 🌶️
- Product Liability for Defective AI
- The last attempted AI revolution in security, and the next one
- Unveiling Security, Privacy, and Ethical Concerns of ChatGPT
- Where’s the Liability in Harmful AI Speech?
Software
LLM-specific
- BITE Textual Backdoor Attacks with Iterative Trigger Injection
- garak LLM vulnerability scanner 🌶️🌶️
- HouYi successful prompt injection framework 🌶️
- dropbox/llm-security demo scripts & docs for LLM attacks
- promptmap bulk testing of prompt injection on openai LLMs
- rebuff LLM Prompt Injection Detector
- risky llm input detection
general MLsec
- Adversarial Robustness Toolkit
- nvtrust Ancillary open source software to support confidential computing on NVIDIA GPUs
🌶️ = extra spicy