Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Tracking #: 794-1785

Flag : Review Received

Authors:

Michael Hersche

Giacomo Camposampiero

Roger Wattenhofer

Abu Sebastian

Abbas Rahimi

Responsible editor:

Guest Editors NeSy 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-794.pdf

Cover Letter:

Dear Dr. Besold, I am writing to submit our manuscript, entitled "Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning," for consideration to the special issue in the Neurosymbolic AI journal covering the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy 2024). In the initial work, "Towards Learning Abductive Reasoning using VSA Distributed Representations," published at NeSy 2024, we presented a neuro-symbolic approach (ARLC) that can learn to reason with distributed vector-symbolic architectures (VSAs) representations and operators. This approach achieved state-of-the-art accuracy on I-RAVEN, a dataset containing Raven's progressive matrices (RPM) tests. The present paper significantly extends on ARLC: (1) Develop new SOTA LLM benchmarks. We benchmarked two prominent LLMs (GPT-4 and Llama03 70B) on the I-RAVEN dataset. Our advanced prompting techniques lead to SOTA LLM accuracy on I-RAVEN. (2) New I-RAVEN-X benchmark. We introduce a new dataset with larger RPM matrices (3x10 instead of 3x3) and configurable dynamic ranges (from 10 up to 1000). This allows us to thoroughly analyze and reveal LLM's weakness in understanding arithmetic relations. (3) Benchmark ARLC on new I-RAVEN-X. We train and evaluate ARLC on I-RAVEN-X. It notably outperforms the LLM baselines on the overall task and in performing arithmetic relations. We are looking forward to your evaluation. Sincerely, Dr. Michael Hersche IBM Research – Zurich Säumerstrasse 4, 8803 Rüschlikon, Switzerland Phone: +41 44 724 8894 michael.hersche@ibm.com

Approve Decision:

Approved

Revised Version:

Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 29/Jan/2025

By Lia Morra
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

Paper summary:

The present manuscript delves into the topic of learning to reason with distributed vector-symbolic architectures. Specifically, it introduces the neuro-symbolic approach Abductive Rule Learner with Context Awareness (ARLC), and compare it experimentally against LLM on a visual reasoning task: RAVEN’s progressive matrices (RPMs). In ARLC, learning the rules of RPM is interpreted as an assignment problem between VSA vectors and the elements of a fixed template. Experimental results show that ARLC outperform LLMs (LLama-3 and GPT-4), and better generalize to OOD samples. Curiously, in some LLMs in context learning does not seem to increase performance.

The manuscript is an extension of a previous conference submission (Towards Learning Abductive Reasoning using VSA Distributed Representations) which introduced the ARLC method, focusing on two aspects: i) comparison against LLMs and ii) introduction of a new RPM dataset I-RAVEN-X. In I-RAVEN-X, the grid side and the number of categories is increased to test OOD generalization. It also provides additional technical details on the implementation of ARLC.

Strengths:
Although the ARLC architecture is essentially the same as in conference submission, the comparison with LLM is a particularly relevant topic at present. The experiments effectively identifies specific weaknesses within the LLM process. The manuscript is both well-written and organized.

Weaknesses:
The experimental analysis concerning LLMs could be enhanced to ensure a more balanced and interesting comparison, particularly concerning the structure of the prompts. I do not wish, by any means, to detract from the proposed NeSy approach, which I believe absolutely interesting. However, a stronger and more robust comparison would yield a greater impact, in my opinion. Additionally, there is a lack of discussion regarding the advantages and disadvantages of the two approaches beyond “mere” accuracy.

General Remarks
- I have a few doubts regarding the experimental setting in which the LLMs were compared. First, I found it curious that for GPT-4 in-context learning seems to decrease performance. Any insights or potential explanations on this phenomenon? One possible explanation is that the ability to exploit in-context learning may be tied to instruction fine-tuning, see for instance [1]. Second, I wonder if choosing a different encoding for the features, using labels instead of numbers, could potentially result in an increase in performance. Third, I see a case here to use some chain-of-thought prompting or, more generally, prompting the LLM to generate a plausible algorithm and then applying it, which would further bias the LLM to approach the task as human would do, as proposed in the paper (see page 5, lines 1-9).
- Providing additional details on the I-RAVEN benchmark would make the paper more self-contained and clearer to follow. In particular, I would consider providing a definition for the term constellation, and the characteristics of the chosen one (center).
- In the experimental settings, which version of Llama was used, with or without instruction tuning? It seems that this use case would be more suited for instruction-tuned LLMs.
- Regarding handling uncertainty in ARLC, the validation of multiple solutions using a single binding and similarity computation rest on the equality at page 7, lines 11-15. However, in general, the similarity metric may not be linear, and hence the equality does not hold
- At page 8, line 43, the sentence “During inference, the last term of the sum (i = 3) is omitted, as the ground truth for the third row is unknown” shouldn’t be During training, ….?
- I found interesting to note that, in ARLC, post-training after manual programming increases the accuracy (97.6 vs 97.2), and training without initialization (ARLClearn) further improves performance (98.4). However, ARLCprogr has higher performance OOD (I-RAVEN-X) compared to ARLClearn (5 to 10% gap depending on the setting), with ARLCprog performing better on I-RAVEN-X than I-RAVEN. How are the rules manually programmed, and is it possible that training overfits to the ID distribution? Or are there differences between I-RAVEN and I-RAVEN-X that could explain this discrepancy?
- Regarding future works, it would be interesting to discuss the role of recent architectures, such as TransNAR [2], that seek to promote relational reasoning without providing an explicit rule-based, interpretable representation.

[1] Liu, Yinpeng, et al. "Let's Learn Step by Step: Enhancing In-Context Learning Ability with Curriculum Learning." arXiv preprint arXiv:2402.10738 (2024).
[2] Bounsi, Wilfried, et al. "Transformers meet Neural Algorithmic Reasoners." arXiv preprint arXiv:2406.09308 (2024).

Typos:
- Missing parenthesis at page 8, line 24 (O)
- In Table 2, I would consider adding a small description of the different ARLC versions in the caption to make the table more self-consistent.

Review #2 submitted on 13/Apr/2025

By David Tena Cucala
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes, but see detailed comments

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Bad
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Weak

Detailed Comments:

This paper introduces a novel neuro-symbolic method for a prediction task based on a simplification of the Raven’s Progressive Matrices (RPM) intelligence test. The symbolic descriptions of eight geometric figures are provided (of varying shape, colour, etc) and the goal is to predict the ninth figure by guessing the underlying pattern. The proposed method uses a differentiable rule-learning approach, where the attributes of the known figures are embedded in a high-dimensional vector space, and the system learns a vector transformation function that produces the prediction; where this transformation can be interpreted as a rule. The authors show that the prediction accuracy of the system surpasses state-of-the-art methods based on large language models.

Overall, the article is written in good English, well-structured, and it appears to be technically sound. In my opinion, however, there are three important weaknesses.

First, the presentation of the paper, is often lacking in detail and intuitions, which makes it difficult to fully comprehend and evaluate the proposed approach. For example, the space of rules is defined in equations (2) and (3), but the definitions of x_i, o_j, are only given informally in lines 21-28 of page 8, and through an example – I found this insufficient to understand these definitions. Furthermore, there is no intuition as to why equation (2) is chosen as the general shape of the rules. It would be important to expand on this and discuss the expressive power of this language; in particular, to know whether it is expressive enough to capture the rules used to generate the dataset. For more examples of parts of the paper that could be clarified, please see the Detailed Comments below.

Second, the contribution of the work seems comparatively small. The abstract suggests that the main contribution of the paper is an experimental comparison between the neuro-symbolic approach ARLC and large language model-based approaches to the task described above. Reading the Aims and Scope section of the Neurosymbolic Artificial Intelligence journal, it remains uncertain whether experiment reports are regarded as a sufficient contribution.

Third, the motivation for comparing ARLC with large language models is unclear. In this context, it would appear that the space of rules to generate the patterns is known (even if the rules themselves aren’t), so why not apply some standard Inductive Logic Programming (ILP) approach? In fact, it is surprising that ILP is not mentioned at all in the paper, and that no ILP systems are used in the experiments, even though it would appear that ILP methods are the most naturally suited to the given task.

My overall recommendation, assuming the journal is amenable to publishing work focused on experimental reporting, would be to accept a revised version of the paper that provides additional clarity and intuitions (please see list below), and develops the motivation further.

Detailed Comments:

--I am also confused about what “constellation” means. In particular, what are 2x2, 3x3, and center constellations? Without explaining this, the task description is quite difficult to follow. It is also hard to evaluate the article’s decision to focus on the center evaluation only.

--I am missing a discussion about whether the original goals of the test are lost when the images are translated to symbolic descriptions. The symbolic description could be seen as a simplification of the task, since it describes explicitly the attributes of the figures that are relevant to the task (in contrast to the original task, where no such list of attributes is provided explicitly; and in fact figuring which are the relevant attributes in a non-trivial reasoning task)

--To make the paper self-contained and understand the prediction task, it would be important to explain the four rules used to generate the figures in the right column of the matrices. The text simply names them as “constant”, “progression”, “arithmetic”, and “distribute three”. One can later deduce them from the definition of the 3x10 extension of the RAVEN problem, but it would be very helpful to describe them explicitly.

--“Page 3, line 31: what is an “attribute bisection tree”. Also, what is the “context matrix generation”?

--The number of attributes should be clarified, in addition to the number of possible values per attribute (dubbed “m” in the text, if I understand correctly).

--I do not understand why the RAVEN dataset is extended with additional columns. The article mentions that this is done to test “scalability” of the approach, but I am not sure whether this is a relevant concept. In this task, having more columns means having a larger number of examples, which could actually help discriminate better between patterns used to generate the sequence.

--My lack of familiarity with “vector-symbolic architectures” may prevent a reader from understanding well the approach. It would be helpful to explain what are the binding, unbinding, and bundling operations.

--It would also be helpful to give some intuition for why encoding into VSA is a good idea. Why use blocks?

--Why have two blocks of six coefficients in Equation (2)?

--Section 4.3 talks about “all rules” – but how is this quantified?

Towards Learning to Reason: Comparing LLMs with Neuro-Symbolic on Arithmetic Relations in Abstract Reasoning

Tracking #: 794-1785

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 794-1785

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links