AIME | Agents/LLM Use Policy

Large language models (LLMs) and LLM-based tools like Claude Code are now a part of how we do research. Used well, they make us faster and free us to focus on the harder, more interesting parts of a project. Used carelessly, they introduce subtle errors into our code, fabricate references in our papers, and erode the trust readers place in our work.

This page describes how we use these tools in our team. The underlying principle is simple: you are responsible for everything you ship, regardless of how it was produced. An LLM is a tool, like a compiler or a plotting library. We don’t blame compilers for bugs, and we don’t blame LLMs for mistakes in our papers.

This policy covers two modes of LLM use:

Interactive use — chat-style LLMs and LLM-based coding assistants where you prompt, review the output, and decide what to do with it. This is how most of the team currently uses these tools, and most of this policy is about this mode.
Agentic use — tools that take actions autonomously on your behalf, such as writing files, running commands, or querying resources without turn-by-turn approval. A smaller number of you are starting to use these, and they introduce additional considerations covered in the section below.

The policy is scoped to coding and writing, the two areas where LLM use is most common in the group today.

What we encourage

Using LLMs as coding assistants. Claude Code is my current recommendation, but the landscape is moving fast so use what works well for you. Share what you’re learning on Slack (#programming is a good place). If you find a better tool, a useful workflow, or a prompt pattern that saves time, post it. We all benefit.
Using LLMs for writing support. Drafting, rephrasing, tightening a paragraph, translating between registers, brainstorming outlines — all reasonable uses.
Maintaining your CLAUDE.md (or equivalent) diligently. LLM-based coding tools compact their context as conversations grow. When the context is compacted, anything not captured in persistent files like CLAUDE.md is effectively lost, and you’ll find yourself re-explaining the same project details every session. Build up your CLAUDE.md as you go. Write skills (see Anthropic’s skill-creator docs) when you find yourself giving the same instructions repeatedly. Time invested here pays back many times over.
Transparency. Be open about where and how you’re using LLMs — with me, with collaborators, and in your written work where relevant (see below).

What requires care

Let the tool edit a few things at a time. Small, reviewable diffs are much easier to check than sweeping changes. If you can’t meaningfully review the edit, you can’t meaningfully take responsibility for it.
Be cautious with design decisions. LLMs are often confidently wrong about architectural choices, algorithm selection, and tradeoffs specific to our domain. Treat LLM suggestions on design as input to your own thinking, not as answers. For any non-trivial design decision, come to me or a colleague before committing.
Log what matters for reproducibility. If LLM output is a meaningful part of a result (e.g., a prompt that generated training data, a model call embedded in a pipeline), record the prompt, model name, and version in the same way we record random seeds. Future-you (and possible reviewers) will thank you.
Confidential data stays out of hosted LLMs. Do not paste into a third-party LLM: proprietary or unreleased data/files that you did not create; this includes but is not limited to data from industry collaborators (e.g., AstraZeneca), unpublished manuscripts or results from other groups, peer-review materials, student work you’re grading, or anything marked confidential. When in doubt, ask.
Peer-review materials are off-limits. Most major venues (NeurIPS, ICML, ICLR, Nature, Science, and increasingly others) now explicitly prohibit pasting manuscripts under review into LLMs. Check the venue’s policy before you review, and default to “don’t” if unsure.

If you use agents

A subset of the group uses agent-based tools — systems that take multi-step actions (writing files, running commands, submitting jobs) with limited per-step human approval. This is a powerful mode of working, and worth learning. The flip side is that autonomy requires more upfront care in how you set things up, because by the time you notice something has gone wrong, the action has often already happened. If you use agents, these practices apply on top of everything above.

Start in approval mode, earn autonomy. Most agent tools (including Claude Code) let you approve each action individually or run fully autonomously. Default to per-action approval. Move to more autonomous modes only after you’ve built trust through a short cycle of runs you’ve reviewed very carefully and diligently.
Sandbox first. Don’t point a new agent workflow at shared compute, the group GitHub organization, or real data until you’ve validated it on a scratch directory, a fork, a test dataset, or a separate environment. This is basic hygiene and good practice.
Set explicit budgets. Agents should have caps on iterations, tokens, and run time.
No autonomous actions on shared infrastructure without a dry run. Shared compute quotas, shared storage, the group GitHub org, anything deployed — the default is that agents propose, you review, you run. Promote a workflow from “proposes” to “runs autonomously” only once you’ve seen it behave correctly repeatedly and the consequences of mistakes are bounded.
Mind credentials. Don’t give agents access to broad credentials (long-lived API keys, Chalmers login, cluster SSH keys, full-scope GitHub tokens). Use task-scoped, revocable credentials where possible.
Capture what the agent actually did. If an agent runs experiments or produces code that feeds into a result, the log needs to capture the agent’s actions, not just its summary of them. “The agent says it trained the model” is not the same as “here is the training log.”
Agents are still bound by every rule above. No peer-review materials. No confidential data. References must be verified by hand (see below). Accountability is still yours.

What is not negotiable

These are the firm rules. Please read them carefully.

Verify every reference yourself. LLMs hallucinate citations constantly — author names, titles, journal, year, DOI, even BibTeX entries that look perfectly formatted can be fabricated. Providing the LLM with a URL does not fix this. There is already an epidemic of fake references in the scientific literature, and we are not going to contribute to it. For any manuscript from this group, references must be verified by hand — confirm the paper exists, confirm the authors and title match, confirm the DOI resolves to the right work, confirm the quoted claim is actually in the cited paper. This is one of the few parts of writing where LLM assistance is not permitted, full stop.
You are accountable for everything you ship. If LLM-generated code introduces a bug into our codebase, that bug is yours. If LLM-generated text ends up in a paper with an error, that error is yours. If an agent acting on your behalf deletes the wrong files or pushes broken code, those actions are yours. “The LLM made a mistake” is not an explanation we accept, because the LLM is not a member of the group, you are. Proofread everything. Run the code. Verify the claim.
Follow the venue’s disclosure rules. Many journals and conferences now require authors to declare LLM use. Check the specific venue’s policy before submission and comply with it. When in doubt, disclose and be transparent.
Clear anything unusual with me first. If you’re considering a novel use of these tools that this policy doesn’t cover, talk to me before you start. Not because I’ll necessarily say no, but because these decisions may have implications beyond your own project.

On mistakes and reputation

Everyone makes mistakes, LLM-assisted or not. The issue is not that mistakes happen; it’s whether we catch them before they reach readers, reviewers, and collaborators. If you find yourself repeatedly shipping errors that a careful proofread would have caught, that’s a pattern we need to discuss. Your reputation as a careful scientist is one of the most valuable things you build during your time in the group. Do not put this reputation at risk by being sloppy with your use of AI tools.

A final note

Like everything in our group guide, this is a living document and will evolve as the tools and norms around them change.

Last updated: April 23rd, 2026