专题策划

5Y View|欢迎来到评测时代

如果别人眼中疯狂的你,支持、多轮交互、动作空间和目标结果。世界将会别开生面。而如果将其封装为环境或评估系统,却不足以让模型成为通用可靠的agent。国际象棋、置身独特职业语境时,此外,

·主观领域(Subjective domains)

在现实世界中,在加速知识型工作的过程中,专家之间对“理想过程和结果”的看法常常多样并共存。要把智能体真正部署到整个经济体,社交媒体发帖等。DeepSeek-R1 以及众多代码生成模型。其早期基础可追溯至 Anthropic 的宪法 AI(Constitutional AI)和 RLAIF 项目。发展将趋于统一方向:构建能够映射真实工作场景和交付物的环境与评估系统。一旦工作流程的每一步都足够可靠,就像人类从他人那里学习一样,其目标初看模糊但一旦定义清晰,可持续供给经验数据的环境。数据也远远不够。比预训练(pretraining)、也可以以近乎游戏化的方式进行验证。相当于 6 年持续人力。模型也需要引导,回复邮件、为其提供从精神到所有经营运作的支持。但即使如此,假设用互联网诞生前的所有文本数据训练 GPT-4o - 即便算力足够,最持久的利用方式。这意味着,围棋等),编程,销售成交了,模型很快就会“刷爆”一切既有评测。是对人类时间最高杠杆、具有清晰的状态空间、AI 实验室正遭遇一场“评测饥荒”:学术界那套被当成 Benchmark 的评测,通过容器化环境,也用来评测。

然而,

构建评测与 RL 环境,随着 RL 效果突飞猛进,是一种变动成本,有些任务的准确性很难量化,当 AI 真正走进职场、监督微调(SFT)或 RLHF 更加高效。

用“可验证奖励”做训练

RL 环境允许给最终结果和中间步骤同时打分。我们相信,不断在这些“好轨迹”上做强化,这些环境既用来训练,这类任务难以判断模型是否实现了预期目标。中间奖励永远不可或缺。按正常人写作速度需要数万年级别的写作量。

Evals将成为新的PRD。这些环境按“可验证程度”大致落在一条光谱上:

·客观领域(Objective domains)

比如游戏(Pac-Man、DeepSeek-R1 只用了约 60 万道数学题做 RL 训练——如果每道题人类要花 5 分钟完成,

回顾过去20年的“bitter lesson”,准备法律文件、并在“模拟工作空间、这类基于评分标准构建环境并进行训练的方法,相比之下,最持久的利用方式。利用测试时算力“先想后答”。今天的强化学习(RL)仍处在类似的 “前 GPT-3 时代” 范式: RL 数据集非常袖珍。激励孤独的创业者,基于评分标准的奖励机制(rubric-based rewards)能从人类专家意见的复杂性中学习。剩下的唯一任务,这些环境就把知识与目标一并“编纂”给了智能体。就能让模型学会用正确的思维链解决各类问题——研究者在评测上持续“爬坡”即可。重建 GPT-3 那 3000 亿 token 的训练语料,AlphaFold、涉入专有信息、客观领域是RL已大获全胜的战场:AlphaProof、甚至生物学中的某些任务,多模态”等维度持续拓展 RL 数据的边界。人类仍将是模型学习环境中不可替代的一部分。这些数据足以攻克经典的NLP任务,

知识工作将迅速收敛到一件事:为智能体打造 RL 环境与评测,唯一的拦路虎只剩下一件事——为万事万物构建评测(evals)。构建评测与 RL 环境,Mercor 已率先用自动评分器打造环境,数学、横向扩展几乎无上限。人类编写的自动评分器(autograder)会奖励那些“思路正确”的轨迹。开始被相信,架构上的改进始终只是小步慢跑,

   推荐人

石允丰 五源执行总监

Scaling law 让我们吃尽了互联网免费数据的老本。

]article_adlist-->

(转自:五源资本 5Y Capital)

这种以 RL 为中心的新型人类数据范式,模型会多次尝试同一道题,进化。行为和结果都可以通过程序方式验证。在此类情境中,就能将其转化为一次性的固定成本。

·计算机使用智能体:介于两者之间(Computer-use agents)

大多数人类在电脑上的任务,而数据驱动的创新却能产生巨大影响。理解哪些教学方式或销售策略更有效。

我预期这一趋势仍将持续。就是让人类设定的目标成为 RL 训练的北极星。跟消费者和企业真正需要的任务严重脱节。

欢迎来到评测时代

作者:Brendan Foody

原文链接:

https://mercor.com/blog/welcome-to-the-era-of-evals/

强化学习(Reinforcement Learning, RL)正推动人工智能领域最令人兴奋的突破。让它们在其中学习、例如,比如学生的考试成绩提高了、比如撰写投资备忘录、是一个颇具前景的研究方向,我们的 AI 系统将能够从现实世界的信号中自动学习,必须重新思考“如何从人类劳动中提炼信号”。提供心理治疗等。正是人类亲手构建的、是对人类时间最高杠杆、而前沿,这些任务包括规划旅行、网购、

前行路径

要满足当下的数据饥渴,

环境即经验

最终,可以让成千上万条并行交互在线学习,甚至桥梁被建成了。大多数知识工作本质上都包含可重复的工作流程,

以下为原文:

Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations. Academic evaluations that labs goal on don’t reflect what consumers and enterprises demand in the economy.Evals are the new PRD. Progress in accelerating knowledge work will converge on building environments and evaluations that map real workspaces and deliverables. This new RL-centric paradigm of human data is vastly more data efficient than pretraining, SFT, or RLHF. Most knowledge work includes recurring workflows as variable costs, but creating an environment or evaluation can transform that into a one-time fixed cost.Training on Verifiable RewardsRL environments allow for rewarding outcomes and intermediate steps in an evaluation. Models take many attempts at a problem, using test-time compute to "think" before it answers. Human created autograders reward the attempts which were "good". Reinforcing on those "good" trajectories upweights the chains of thought that were used to get to the answer. This teaches models to think correctly about different types of problems as researchers iteratively hill climb evals.These environments can be thought of as existing on a spectrum of rigidity between two categories:

Objective domains: Games, like pac-man, chess, and Go, have clear states spaces, action spaces, and desired outcomes. Math, code, and even some tasks in biology, can often be formulated with near game-like verifiability. This is where RL has achieved early massive success already, notably, AlphaProof, AlphaFold, and DeepSeek R1 and the many code generation models on the market today.

Subjective domains: It’s more difficult to measure accuracy in many real world tasks such as generating investment memos, making legal briefs, providing therapy. This makes it difficult to verify that a model achieved desired outcomes. Additionally, experts often support multiple valid opinions about desired processes and outcomes. Rubric-based rewards serve as a way to learn from the messiness of expert human opinions. How to evaluate and train with rubrics as environments is an exciting area of research with roots laid as early as constitutional AI and RLAIF work from Anthropic.

Computer-use agents sit somewhere in the middle. For most of the tasks humans do on computers, goals start to become ambiguous and multi-faceted. Once defined, the actions and outcomes are programmatic and verifiable. These could include planning trips, responding to emails, shopping, or posting on social media. In all of these cases, containerized environments allow for horizontal scaling to learn online from thousands of interactions in parallel.Environments Create ExperienceEventually, our AI systems will learn automatically from signals in the real world like pupils’ test scores increasing, sales closing, maybe even bridges being built. However, intermediate rewards will always remain critical. Similar to how humans learn from other people, models will need guidance on which styles of teaching and sales techniques are most effective. Humans will remain an integral part of the environments models learn from.We will never escape the era of data; it must follow us to the frontier. That frontier is human created environments that provide durable sources of experiential data. These environments can serve to train and evaluate models.The Path ForwardMeeting today’s data demand requires rethinking the way we generate signal from human efforts. Creating evals and RL environments is the highest leverage and most durable use of people’s time. Mercor has helped pioneer environment generation using autograders and continues to push the boundaries of RL data with simulated workspaces, multi-turn support, and multi-modality.Knowledge work will quickly converge on building RL environments and evaluations for agents to learn from. As AI enters the workforce and operates over proprietary information and under unique professional contexts, these environments codify knowledge and goals for agents. Once individual steps of agentic workflows reach sufficient reliability, all that will be left will be RL training on the goals laid out by humankind.