From 0% to 36% on Day 1 of ARC-AGI-3

(symbolica.ai)

56 points | by lairv 3 hours ago

4 comments

lairv 3 hours ago
Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboard
According to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
[-]
- fchollet 49 minutes ago
  It is 100% ARC-AGI-3 specific though, just read through the prompts https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
  [-]
  - DetroitThrow 2 minutes ago
    Um, yes this is a extremely specific as a benchmark harness. It has a ton of knowledge encoded about the tasks at hand. The tweet is dishonest even in the best light.
- mmaunder 39 minutes ago
  We're calling agents harnesses now?
  [-]
  - fritzo 16 minutes ago
    ELI5 what is a harness?
    EDIT from https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf:
    > We seek to fight two forms of overfitting that would muddy public sensefinding:
    > Task-specific overfitting. This includes any agent that is created with knowledge of public ARC-AGI-3 environments, subsequently being evaluated on the same environments. It could be either directly trained on these environments, or using a harness that is handcrafted or specifically configured by someone with knowledge of the public environments.
- krackers 1 hour ago
  > this uses a harness
  This seems like an arbitrary restriction. Tool-use requires a harness, and their whitepaper never defines exactly what counts as valid.
- osti 1 hour ago
  Doesn't the chat version of chatgpt or gemini also have interleaved tool calls, so do those also count as with harnesses?
- falcor84 3 hours ago
  I for one think that harness development is perhaps the most interesting part at the moment and would love to have an alternative leaderboard with harnesses.
  [-]
  - sanxiyn 3 hours ago
    There is. Official leaderboard is without harness, and community leaderboard is with harness. Read ARC-AGI-3 Technical Paper for details.
    [-]
    - falcor84 3 hours ago
      I went through the technical paper again, and while they explain why they decided against the harness, I disagree with them - my take is that if harnesses are overfitting, then they should be penalized on the hidden test set.
      Anyway, searching both in ARC-AGI's paper and website and directly on kaggle, I failed to find a with-harness leaderboard; can you please give the link?
      [-]
      - sanxiyn 2 hours ago
        Here it is: https://arcprize.org/leaderboard/community
  - steve_adams_86 1 hour ago
    I'm so into harness development right now. Once it clicked that harnesses can bring more safety and determinism to LLMs, I started to wonder where I'd need that and why (vs MCP or just throwing Claude Code at everything), and my brain gears have been turning endlessly since then. I'd love to see more of what people do with them. My use cases are admittedly lame and boring, but it's such a fun paradigm to think and develop around.
modeless 1 hour ago
On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
[-]
- SchemaLoad 1 hour ago
  Benchmarks on public tests are too easy to game. The model owners can just incorporate the answers in to the dataset. Only the private problems actually matter.
  [-]
  - sanxiyn 1 hour ago
    In this case the code is public and you can see they are not cheating in that sense.
    [-]
    - Davidzheng 1 hour ago
      I agree it's not cheating that restricted sense. But I'm not really convinced that it can't be cheating in a more general sense. You can try like 10^10 variations of harnesses and select the one that performs best. And probably if you then look at it, it will not look like it's necessarily cheating. But you have biased the estimator by selecting the harness according to the value.
    - DetroitThrow 0 minutes ago
      The harness seems extremely benchmark specific that gives them a huge advantage over what most models can use. This isn't a qualifying score for that reason.
      Here is the ARC-AGI-3 specific harness by the way - lots of challenge information encoded inside: https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
    - SchemaLoad 1 hour ago
      Once the model has seen the questions and answers in the training stage, the questions are worthless. Only a test using previously unseen questions has merit.
      [-]
      - lambda 1 hour ago
        They aren't training new models for this. This is an agent harness for Opus 4.6.
        [-]
        measurablefunc 1 hour ago
        All traffic is monitored, all signal sources are eventually incorporated into the training set in one way or another. The person you're responding to is correct, even a single API call to any AI provider is sufficient to discount future results from the same provider.
        [-]
        stale2002 1 hour ago
        ok! So if someone uses an existing, checkpointed, open source model then the answer is yes the results are valid and it doesn't matter that the tests are public.
        [-]
        measurablefunc 38 minutes ago
        Yes, assuming the checkpoint was before the announcement & public availability of the test set.
esafak 3 hours ago
Anybody used this Agentica of theirs?
AbanoubRodolf 1 hour ago
[dead]