AI-Safety

Contains all the papers presented at ACM Summer School on Generative AI for Text 2024

👺WARNING❗: This repo contains several unethical and sensitive statements

🌟🌟 New! See Useful Links to access the tutorial slides 🤗

Identifying and mitigating harmful behaviour of language models

🎯 Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee. How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries. 👉 Paper [Under Review]
Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs 👉 Paper [ACL 2024]
Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral. Jailbreaking Proprietary Large Language Models using Word Substitution Cipher. 👉 Paper [Under Review]
Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. 👉 Paper [ICLR 2024]
Javier Rando, Florian Tramèr. Universal Jailbreak Backdoors from Poisoned Human Feedback. 👉 Paper [ICLR 2024]
🎯 Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria. Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models. 👉 Paper [ACL 2024]
Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!. 👉 Paper [ICLR 2024]
🎯 Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations. 👉 Paper [Under Review]
🎯 Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra. SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models. 👉 Paper [Under Review]

Safety evaluation datasets

NicheHazardQA 👉 download 🎯
TechHazardQA 👉 download 🎯
DangerousQA 👉 download
AdvBench 👉 download
Anthropic HH dataset 👉 download

Useful Links

🔥 Access the slides from here
Get our AI and Safety huggingface collection from here

Demo codebase

- Simple jailbreaking with naive prompt - Safe_Unsafe_Examples.ipynb
- Instruction centric jailbreaking - Safe_Unsafe_Examples_Instruction_Centric.ipynb

Support

⭐️ If you find the github resources helpful, our papers and datasets (🎯) interesting, please encourage us by starring, upvoting and sharing our papers and datasets! 😊

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
Safe_Unsafe_Examples.ipynb		Safe_Unsafe_Examples.ipynb
Safe_Unsafe_Examples_Instruction_Centric.ipynb		Safe_Unsafe_Examples_Instruction_Centric.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI-Safety

Identifying and mitigating harmful behaviour of language models

Safety evaluation datasets

Useful Links

Demo codebase

Support

About

Releases

Packages

Languages

RadiantCrystal/AI-Safety

Folders and files

Latest commit

History

Repository files navigation

AI-Safety

Identifying and mitigating harmful behaviour of language models

Safety evaluation datasets

Useful Links

Demo codebase

Support

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages