Skip to content

Contains all the papers presented in ACM Summer School on Generative AI for Text 2024

Notifications You must be signed in to change notification settings

RadiantCrystal/AI-Safety

Repository files navigation

AI-Safety

Contains all the papers presented at ACM Summer School on Generative AI for Text 2024

👺WARNING❗: This repo contains several unethical and sensitive statements

🌟🌟 New! See Useful Links to access the tutorial slides 🤗

Identifying and mitigating harmful behaviour of language models

  • 🎯 Somnath Banerjee, Sayan Layek, Rima Hazra, Animesh Mukherjee. How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries. 👉 Paper [Under Review]

  • Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs 👉 Paper [ACL 2024]

  • Divij Handa, Advait Chirmule, Bimal Gajera, Chitta Baral. Jailbreaking Proprietary Large Language Models using Word Substitution Cipher. 👉 Paper [Under Review]

  • Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. 👉 Paper [ICLR 2024]

  • Javier Rando, Florian Tramèr. Universal Jailbreak Backdoors from Poisoned Human Feedback. 👉 Paper [ICLR 2024]

  • 🎯 Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria. Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models. 👉 Paper [ACL 2024]

  • Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, Peter Henderson. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!. 👉 Paper [ICLR 2024]

  • 🎯 Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria. Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations. 👉 Paper [Under Review]

  • 🎯 Somnath Banerjee, Soham Tripathy, Sayan Layek, Shanu Kumar, Animesh Mukherjee, Rima Hazra. SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models. 👉 Paper [Under Review]

Safety evaluation datasets

Useful Links

  • 🔥 Access the slides from here
  • Get our AI and Safety huggingface collection from here

Demo codebase

- Simple jailbreaking with naive prompt - Safe_Unsafe_Examples.ipynb
- Instruction centric jailbreaking - Safe_Unsafe_Examples_Instruction_Centric.ipynb

Support

  • ⭐️ If you find the github resources helpful, our papers and datasets (🎯) interesting, please encourage us by starring, upvoting and sharing our papers and datasets! 😊

About

Contains all the papers presented in ACM Summer School on Generative AI for Text 2024

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published