All academic papers, research blogs, and technical reports referenced throughout the PyRIT documentation.
- Aakanksha, Ahmadian, A., Ermis, B., Goldfarb-Tarrant, S., Kreutzer, J., Fadaee, M., & Hooker, S. (2024). The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm. arXiv Preprint arXiv:2406.18682. https://arxiv.org/abs/2406.18682
- Adversa AI. (2023). Universal LLM Jailbreak: ChatGPT, GPT-4, Bard, Bing, Anthropic, and Beyond. https://adversa.ai/blog/universal-llm-jailbreak-chatgpt-gpt-4-bard-bing-anthropic-and-beyond/
- Andriushchenko, M., & Flammarion, N. (2024). Does Refusal Training in LLMs Generalize to the Past Tense? arXiv Preprint arXiv:2407.11969. https://arxiv.org/abs/2407.11969
- Anthropic. (2024). Many-Shot Jailbreaking. https://www.anthropic.com/research/many-shot-jailbreaking
- Bethany, E., Bethany, M., Flores, J. A. N., Jha, S. K., & Najafirad, P. (2024). MathPrompt: Mathematical Reasoning to Circumvent LLM Safety Mechanisms. arXiv Preprint arXiv:2409.11445. https://arxiv.org/abs/2409.11445
- Bryan, P., Severi, G., de Gruyter, J., Jones, D., Bullwinkel, B., Minnich, A., Chawla, S., Lopez, G., Pouliot, M., Fourney, A., Maxwell, W., Pratt, K., Qi, S., Chikanov, N., Lutz, R., Dheekonda, R. S. R., Jagdagdorj, B.-E., Kim, E., Song, J., … Kumar, R. S. S. (2025). Taxonomy of Failure Mode in Agentic AI Systems. https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/microsoft/final/en-us/microsoft-brand/documents/Taxonomy-of-Failure-Mode-in-Agentic-AI-Systems-Whitepaper.pdf
- Bullwinkel, B., Minnich, A., Chawla, S., Lopez, G., Pouliot, M., Maxwell, W., de Gruyter, J., Pratt, K., Qi, S., Chikanov, N., Lutz, R., Dheekonda, R. S. R., Jagdagdorj, B.-E., Kim, E., Song, J., Hines, K., Jones, D., Severi, G., Lundeen, R., … Russinovich, M. (2024). Lessons From Red Teaming 100 Generative AI Products. https://arxiv.org/abs/2501.07238
- Bullwinkel, B., Russinovich, M., Salem, A., Zanella-Beguelin, S., Jones, D., Severi, G., Kim, E., Hines, K., Minnich, A., Zunger, Y., & Kumar, R. S. S. (2025). A Representation Engineering Perspective on the Effectiveness of Multi-Turn Jailbreaks. arXiv Preprint arXiv:2507.02956. https://arxiv.org/abs/2507.02956
- Bullwinkel, B., Severi, G., Hines, K., Minnich, A., Kumar, R. S. S., & Zunger, Y. (2026). The Trigger in the Haystack: Extracting and Reconstructing LLM Backdoor Triggers. arXiv Preprint arXiv:2602.03085. https://arxiv.org/abs/2602.03085
- Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking Black Box Large Language Models in Twenty Queries. arXiv Preprint arXiv:2310.08419. https://arxiv.org/abs/2310.08419
- Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G. J., Tramer, F., Hassani, H., & Wong, E. (2024). JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. arXiv Preprint arXiv:2404.01318. https://arxiv.org/abs/2404.01318
- Chu, J., Yang, Z., Li, M., Leng, Y., Lin, C., Shen, C., Backes, M., Shen, Y., & Zhang, Y. (2023). HarmfulQA: A Benchmark for Robustly Evaluating Jailbreaks in Alignment Testing. arXiv Preprint arXiv:2310.18469. https://arxiv.org/abs/2310.18469
- Cui, J., Chiang, W.-L., Stoica, I., & Hsieh, C.-J. (2024). OR-Bench: An Over-Refusal Benchmark for Large Language Models. arXiv Preprint arXiv:2405.20947. https://arxiv.org/abs/2405.20947
- Apart Research. (2025). DarkBench: A Comprehensive Benchmark for Dark Design Patterns in Large Language Models. https://darkbench.ai/
- Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., & Inie, N. (2024). garak: A Framework for Security Probing Large Language Models. arXiv Preprint arXiv:2406.11036. https://arxiv.org/abs/2406.11036