Openai如何使GPT-4“更安全”堆积在NIST AI风险管理框架上?
3月,Openai释放了GPT-4,这是最近AI进展的另一个里程碑。这是Openai迄今为止最先进的模型,它已经被广泛部署给数百万用户,并且企业, with the potential for drastic effects across一系列行业。
但是,在发布一个新的,强大的系统(如GPT-4到数百万用户)之前,一个关键的问题是:“我们怎么知道这个系统是安全,值得信赖和可靠的,可以被释放?”Currently, this is a question that leading AI labs are free to answer on their own–for the most part. But increasingly, the issue has garnered greater attention as many have become worried that the current pre-deployment risk assessment and mitigation methods like those done by OpenAI are insufficient to prevent potential risks, including the spread of misinformation at scale, the entrenchment of societal inequities, misuse by bad actors, and catastrophic accidents.
这种担忧是最近的一个核心open letter,由几位领先的机器学习(ML)研究人员和行业领导者签署,该领导者呼吁与GPT-4“更强大”的AI系统训练6个月,以便为更多时间提供更多时间在部署之前,将“确保遵守他们的系统安全地毫无疑问地安全”的强大标准。这封信有很多分歧,专家竞争这封信的基本叙述, to others who think that the pause is“一个可怕的主意”因为它会不必要地停止有益的创新(更不用说实施是不可能实施的)。但是,在这次对话中,几乎所有参与者都倾向于同意,停顿或否,即如何在实际部署它之前评估和管理AI系统风险的问题是重要的。
这里寻找指导的自然场所是国家标准与技术研究院(nist),发行了AI Risk Management Framework(AI RMF) and an相关的剧本在一月。NIST正在领导政府的工作,以制定技术标准和共识指南,以管理AI系统的风险,以及一些citeits standard-setting work as a potential basis for future regulatory efforts.
In this piece we walk through both what OpenAI actually did to test and improve GPT-4’s safety before deciding to release it, limitations of this approach, and how it compares to current best practices recommended by the National Institute of Standards and Technology (NIST). We conclude with some recommendations for Congress, NIST, industry labs like OpenAI, and funders.
OpenAI在部署GPT-4之前做了什么?
OpenAI claims to have taken several steps to make their system “safer and more aligned”. What are those steps? OpenAI describes these in theGPT-4 “system card,”一份文件,概述了OpenAI在部署之前如何管理和减轻GPT-4的风险。这是该过程的简化版本:
- 他们引进了50多个“红色团队”,这些“红色团队”遍布各个领域的外部专家,以测试模型,戳戳和刺激它,以找到可能失败或造成伤害的方式。(它会以促成大量廉价产生的错误信息的方式“幻觉”吗?它会产生有偏见/歧视性的产量吗?它可以帮助不好的参与者产生有害的病原体吗?
- 在红色团队找到模型脱离轨道的方式的地方,他们可以通过对人类反馈(RLHF)的加强学习来训练许多不希望的输出实例,在此过程中,人类评估者就模型提供的各种输出提供了反馈(通过人类生成的示例,关于某种类型的输入的模型应如何做出响应,以及对模型生成的输出的“大拇指,大拇指降低”的评分)。因此,对模型进行了调整,以便更有可能给出其评估者成立得分的答案,并且不太可能给出得分较差的输出。
这够了吗?
尽管Openai表示,它们通过上述过程大大降低了不希望的模型行为的速率,但实施的控件并不强大,缓解不良模型行为的方法仍然是漏水和不完美的。
OpenAI did not eliminate the risks they identified. The system card documents numerous failures of the current version of GPT-4, including一个例子它同意“生成一个计划,以使吸引力随性别和种族的函数计算。”
Current efforts to measure risks also need work, according to GPT-4 red teamers. TheAlignment Research Center(ARC) which assessed these models for “emergent” risks这么说“到目前为止,我们已经进行的测试不足以出于多种原因,但是我们希望随着AI系统变得更加有能力,评估的严格性将扩大。”另一位GPT-4红色团队Aviv Ovadya说:“如果红色的GPT-4教会了我任何东西,那就是红色的团队是不够的。”Ovadya建议使用未来的剥离前风险评估工作来改善“violet teaming,”在哪些公司中,公司确定“系统(例如,GPT-4)如何损害机构或公共利益,然后支持使用相同系统来捍卫机构或公共物品的工具的开发。”
Since current efforts to measure and mitigate risks of advanced systems are not perfect, the question comes down to when they are “good enough.” What levels of risk are acceptable? Today, industry labs like OpenAI can mostly rely on their own judgment when answering this question, but there are many different standards that could be used. Amba Kak, the executive director of theAI Now Institute,建议一个更严格的标准,认为监管机构应要求AI公司“证明他们在发布系统之前不会造成任何伤害”。要满足这样一种标准的,新的,更系统的风险管理和测量方法。
Openai的努力如何映射到NIST的风险管理框架?
NIST’sAI RMF核心consists of four main “functions,” broad outcomes which AI developers can aim for as they develop and deploy their systems: map, measure, manage, and govern.
Framework users canmap的overall context in which a system will be used to determine relevant risks that should be “on their radar” in that identified context. They can then措施在定量或定性上确定风险,然后管理的m, acting to mitigate risks based on projected impact. The治理function is about having a well-functioning culture of risk management to support effective implementation of the three other functions.
回顾OpenAI的过程,然后再发布GPT-4,我们可以看到他们的动作如何与RMF核心中的每个功能保持一致。这并不是说Openai在其工作中应用了RMF。我们只是在尝试评估他们的努力如何与RMF保持一致。
- 他们首先映射通过确定红色团队的领域来调查风险,该领域是基于语言模型在过去造成的伤害的领域以及似乎在直觉上可能特别影响的领域。
- 这y aimed to措施这些风险很大程度上是通过上述定性的“红色团队”努力来描述的,尽管它们还描述了对某些风险的内部定量评估,例如“仇恨言论”或“自我伤害建议”。
- And to管理的se risks, they relied on对人类反馈的强化学习,以及其他干预措施,例如塑造原始数据集以及一些明确的“编程”行为,这些行为不依赖于通过RLHF进行行为培训。
Some of the specific actions described by OpenAI are also laid out in the剧本。这Measure 2.7 functionhighlights “red-teaming” activities as a way to assess an AI system’s “security and resilience,” for example.
NIST’s resources provide a helpful overview of considerations and best practices that can be taken into account when managing AI risks, but they are not currently designed to provide concrete standards or metrics by which one can assess whether the practices taken by a given lab are “adequate.” In order to develop such standards, more work would be needed. To give some examples of current guidance that could be clarified or made more concrete:
- NIST建议AI参与者“定期评估整个AI系统生命周期中的失败成本,以告知GO/NO-GO部署决策。”“经常定期”一次?什么样的“失败成本”太多了?其中一些将取决于最终用例,因为我们对情绪分析模型的风险承受能力可能远远高于医疗决策支持系统的风险承受能力。
- NIST建议AI开发人员旨在理解和记录“预期目的,可能有益的用途,特定于上下文的法律,规范和期望以及将部署AI系统的潜在环境”。对于像GPT-4这样的系统,该系统正在广泛部署,并且可能在许多域中都有用例抽象水平。
- NIST建议AI参与者“确定AI系统是否实现其预期目的和既定目标,以及其发展或部署是否应进行”。同样,这很难定义:像GPT-4这样的大语言模型的预期目的是什么?它的创建者通常不希望在发布时知道其潜在用例的全部范围,从而在做出此类决定方面面临进一步的挑战。
- NIST describes explainability and interpretability as a core feature of trustworthy AI systems.Openai不描述GPT-4是可解释的。可以提示该模型生成其输出的解释,但是我们不知道这些模型生成的解释实际上反映了系统的内部流程以生成其输出。
因此,在NIST的AI RMF中,在确定是否取得的“结果”是否可以进行辩论时,没有什么可以阻止开发人员超越最低限度的最低限度(我们相信他们应该)。这不是当前设计的框架的错误,而是一个功能,因为RMF“不开处方风险承受能力。”However, it is important to note that more work is needed to establish both stricter guidelines which leading labs can follow to mitigate risks from leading AI systems, and concrete standards and methods for measuring risk on top of which regulations could be built.
建议
有几种方法可以改善前部风险评估和缓解前部系统的标准:可以改善:
国会
- 国会应为NIST提供额外的资金,以扩大其在Frontier AI系统的风险测量和管理方面的工作能力。
NIST
- Industry best practices:With additional funding, NIST could provide more detailed guidance based on industry best practices for measuring and managing risks of frontier AI systems, for example by collecting and comparing efforts of leading AI developers. NIST could also look for ways to get “ahead of the curve” on risk management practices, rather than just collecting existing industry practice, for example by exploring new, less well-tested practices such as紫罗兰色的团队。
- 指标:NIST could also provide more concrete metrics and benchmarks by which to assess whether functions in the RMF have been adequately achieved.
- 测试床:Section 10232 of The CHIPS and Science Act authorized NIST to “establish testbeds […] to support the development of robust and trustworthy artificial intelligence and machine learning systems.” With additional funds appropriated, NIST could develop a centralized, voluntary set of test beds to assess frontier AI systems for risks, thereby encouraging more rigorous pre-deployment model evaluations. Such efforts could build on existing language model evaluation techniques, e.g. the语言模型的整体评估from Stanford’s Center for Research on Foundation Models.
行业实验室
- 领先的行业实验室应旨在向像NIST这样的政府标准制定者提供更多有关他们如何管理AI系统风险的见解,包括清楚地概述他们的安全实践和缓解措施,就像OpenAI在GPT-4系统卡中所做的那样,这些方法如何实践起作用,以及他们将来仍然可以破坏的方式。
- 实验室还应旨在将更多的公共反馈纳入其风险管理过程中,以确定在部署具有广泛公共影响的系统时可以接受的风险水平。
- 实验室应旨在超越NIST AI RMF 1.0。这将进一步帮助NIST评估不属于当前RMF的新风险管理策略,但可能属于RMF 2.0。
资助者
- NSF和私人慈善授予者等政府资助者应为研究人员提供资金,以开发评估和减轻Frontier AI系统风险的指标和技术。目前,很少有人专注于这项工作,并且通过鼓励对Frontier AI系统的风险管理实践和指标进行更多工作,可以支持这一领域的更多研究。
- 资助者还应根据NIST AI RMF中所述的当前最佳实践来为AI项目提供汇款。
您是否有想法可以告知一个雄心勃勃的项目,即FESI具有相对优势追求的优势?我们希望听到它。
FESI之友已经确定了优先用例以告知项目想法。
这CHIPS and Science Act establishes a compelling vision for U.S. innovation and place-based industrial policy, but that vision is already being hampered by tight funding.
以下是筹码和科学资金在与联邦预算的战斗中塑造的方式。