A reading list for large models safety, security, and privacy (including awesome llm security, safety, etc.). We publicly release mhj alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger llm defenses. We built a system of constitutional classifiers to prevent jailbreaks A prototype version of our system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found It consists of two stages Rative red teaming have been proposed
It contains papers, codes, datasets, evaluations, and analyses Any additional things regarding jailbreak, prs, issues are welcome and we are glad to add you to the contributor list here.
OPEN