NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 818-1810

Flag : Review Received

Authors:

Julie Loesch

Gunjan Singh

Raghava Mutharaju

Remzi Celebi

Responsible editor:

Guest Editors Neurosymbolic AI and Ontologies 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-818.pdf

Cover Letter:

Dear Editor, We are pleased to submit our manuscript, titled "NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise", for consideration in the Journal of Neurosymbolic Artificial Intelligence as part of your Special Issue on Neurosymbolic AI and Ontologies. Given the special issue's focus on the intersection of ontologies and neural systems, we believe our work aligns well with its scope and would make a valuable contribution. In our study, we propose a mechanism for introducing noise into ontologies, particularly in the ABox, and evaluate the performance of existing neurosymbolic reasoners across varying noise levels. We introduce three distinct techniques to generate noise: logical, statistical, and random noise. These methods were applied to the OWL2Bench and Family ontologies, creating benchmark datasets with diverse noise types and levels. Subsequently, we assessed the performance of two state-of-the-art neurosymbolic reasoners, Box2EL and OWL2Vec*, using these benchmarks. We believe that the developed benchmark datasets provide valuable insights for a broad audience of researchers working at the intersection of ontologies and neural-symbolic reasoning. We would like to inform you that we have received an extension from Cogan to finalize and submit this manuscript. Thank you for considering our work. We look forward to your feedback and are happy to provide any additional information if needed. With kind regards, On behalf of all co-authors, Julie Loesch

Approve Decision:

Approved

Tags:

Reviewed

Decision:
Major Revision

Solicited Reviews:

Review #1 submitted on 04/Aug/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak

Content:
Technical Quality of the paper: Weak
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Average

Detailed Comments:

Summary:

The work proposes three techniques, each introduces another type of noise into an ontology dataset, with the purpose of testing the robustness of NeSy models. The authors applied these techniques to two ontology datasets, OWL2Bench and Family, and evaluated two NAI systems, Box2EL and OWL2Vec*.

The three techniques
1. Logic: add an assertion that violates an ontology constraint (e.g., add individual to two disjoint classes or to two disjoint properties)

2. Statistical: train a Relational Graph Convolutional Network on link prediction. After training, add the k triples with lowest predictive score to the dataset as noise.

3. Random: introduce k random triples by corrupting the object/subject of existing triples.

Judgement:

The introduction of new datasets for testing neurosymbolic AI methods is very welcome, we need standardized tasks that are challenging and representative. I also appreciate the focus on robustness to noise. As the paper states, purely symbolic systems are brittle to noise, so it is interesting to measure this. However, the other related question -- not addressed in the paper -- is at least as interesting: adding logic to a neural system helps to deal with noise by pointing out inconsistencies in the data.

Correctness: The work appears correct. One remark though: real-world KGs/ontologies are already notoriously noisy and incomplete and this existing noise is not accounted for; and the additional random noise could actually be true, just missing. The two datasets the authors evaluate on might be relatively complete and noise free (for Family this is definitely possible), if so, stating this explicitly would lead to a stronger experimental section. Maybe the idea of the paper would work best with a fully synthetic dataset where you can first generate an ontology that is easily learnable, complete and noise free, and then add noise to see how performance deteriorates in a controlled setting.

Presentation: the work is clearly presented but assumes background knowledge. NAI is broader than just ontologies so a brief primer on ontologies, in particular on the concepts of TBox and ABox, would be a good addition. For the statistical approach, please clarify whether the "lowest probability score" is re-evaluated after adding every noise triple; or whether the k triples are identified at once. The latter approach may introduce related triples?

Smaller comments: The examples in the methodology for logical noise are helpful. Showing noise evolution with a lineplot would be more appropriate than bars. Line 29 on page 6 abruptly changes the topic without introducing what is going to be discussed.

Two weaknesses are:

1. Completeness, novelty and motivation: The three introduced techniques are not sophisticated, which is not necessarily bad, but their discussion is lacking; I.e., I expected a more thorough comparison on the effect of the techniques, and a better motivation as to why these noise types are relevant. The work mentions "many other types of axioms and noise patterns merit investigation". These should arguably be discussed as well, such that the choice of investigated axioms can be better positioned and motivated.
Related, the work claims "real-world datasets often contain errors, inconsistencies, or irrelevant information.", which raises the question as to why we still need to add noise instead of focus on real-world data. Furthermore, to introduce noise to synthetic data, it is unclear whether the proposed types of noise are at all representative for the noise that is present in real-world datasets.

2. The results of the experimental section are unclear to me:
- I like the relative measurement of noise but what does 100% noise mean? That there is an equal number of noisy assertions compared to ground-truth assertions? This is confusing, with 100% noise I expect everything to be noise and nothing to be learnable.
- The effect of adding noise seems small and the performance does not consistently decrease. In the case of statistical noise, the MRR even improves for Family and OWL2Vec? There seems to be something wrong there.
- It would be interesting to also benchmark a purely neural and a purely symbolic reasoner to show that this is a setting where NAI is useful.

To summarize, the current state of the work appears to be rather preliminary and not yet complete for a journal publication.

Review #2 submitted on 04/Aug/2025

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

his paper introduces NSORN (Neurosymbolic Ontology Reasoning with Noise), a benchmark framework for evaluating the robustness of neurosymbolic reasoning systems under noisy conditions. The authors propose a principled mechanism for injecting three types of noise (logical, statistical, and random) into the ABox of ontologies and evaluate how two state-of-the-art reasoners, OWL2Vec* and Box2EL, perform under varying levels of such noise. Unlike previous work focused on ontology completion, this benchmark targets ontology reasoning, a more challenging task requiring inference of logically implied knowledge. Experiments on two benchmark ontologies, OWL2Bench and Family, show that logical noise most significantly degrades performance, with the framework offering a systematic way to test neurosymbolic systems under real-world-like conditions.

Main Concerns:

Artificiality of Noise Injection: While the paper introduces a clear and reproducible method for noise generation, some of the injected noise, particularly the statistical noise derived from low-probability GNN predictions, appears too "easy" or synthetic. Real-world ontologies often contain more adversarial or semantically subtle noise. The study would benefit from incorporating a broader spectrum of noise severity, including some manually constructed, semantically plausible errors that challenge different aspects of the reasoning process.

Lack of Statistical Rigor in Results: The experimental section would be strengthened by the inclusion of error bars or confidence intervals to better reflect variability across runs and support claims about noise effects. Though the authors mention averaging over five runs, visual indicators of variance are missing in the main figures, limiting the statistical interpretability of trends.

Limited Qualitative Analysis: The results focus on numerical performance metrics but omit qualitative insights into how specific examples of noise affect inference outcomes. Including a few illustrative examples where reasoning fails (or surprisingly succeeds) under noise would help ground the quantitative findings and offer readers more interpretability into the models' failure modes.

Some minor concerns:
--On page 2, use the authors' names, not [1] et al. presented
--Sometimes closed quote is used instead of open quote e.g., on page 10, line 44 (as well as in a couple of other places).

Review #3 submitted on 24/Jul/2025

By Adrita Barua
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Detailed Comments:

The authors present a framework for generating a benchmark dataset by introducing noise into ontologies to evaluate the robustness of neurosymbolic reasoners. Three noise types—logical, statistical, and random—are applied to OWL2Bench and Family ontologies, and the performance of Box2EL and OWL2Vec* is assessed on class and object property assertion tasks.

The paper’s approach to evaluating the robustness of neurosymbolic reasoners is both timely and significant, given current efforts to develop such systems, with the goal of systematically generating a benchmark dataset with different noise categories to mimic the errors that can occur in real-world datasets. The categorization of noise into logical, statistical, and random is well-motivated and grounded in ontology engineering principles.

The main motivation described in their work on creating a benchmark is not quite fulfilled, since the practicality of the creation of the benchmark currently lacks experiments and datasets that they have tested on. The evaluation is restricted to only two ontologies and two reasoners. This limits the generalizability of the findings. The authors acknowledge that "specific characteristics of each ontology significantly influence the effectiveness of noise injection," yet fail to adequately address this through broader experimentation.

The results show different patterns across ontologies, with no consistant trend and focuses exclusively on ABox noise. Since TBox noise is common in real-world settings, it limits the applicability of the benchmark to certain scenarios. In addition, since the baseline performance scores are very low and, as they mentioned, "it is difficult to identify any clear trend, as the values are already low, even without the introduction of noise," it raises questions about the suitability of the chosen tasks and datasets without expanding their evaluation scope.

The authors should include additional ontologies from different domains and complexity levels and evaluate more neurosymbolic reasoners to establish generalizable patterns; otherwise, it is unclear how practitioners should use this benchmark and interpret results for improving reasoner robustness. Including traditional symbolic reasoners in the evaluation would better contextualize the performance of the neurosymbolic approaches.

Review #4 submitted on 25/Jul/2025

By Shreyas Casturi
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Detailed Comments:

This is my first review of any paper. I apologize that it took me as long as it did to get a review out.

The paper has noted an important problem -- Neurosymbolic (NS) reasoners lack datasets/ontologies to test a reasoner's resilience to noise. There are a variety of different datasets for NS reasoning that use different metrics, deal with different tasks, etc, but a lack of a unified benchmark dataset causes issues in evaluating reasoners' abilities in a general manner.
To that end, the authors codified different types of tactics to generate noise (randomly corrupting triples, making ontology resources violate disjoint axioms, adding low-probability links between resources that are superficially plausible but aren’t true), codified a set of metrics (Mean Reciprocal Rank and Hits@N) to measure a NS reasoner’s resilience to noise, and developed a dataset from the OWL2Bench ontology and a Family ontology and added varying amounts of noise. The authors ran 2 reasoners (OWL2Bench and Box2EL) on these datasets with varying amounts of noise and determined how different noise generation tactics affect the reasoner’s ability to infer relationships.

I think the paper provides value. The authors clearly note how there are many different types of tasks for reasoners and why the lack of standard datasets causes issues in evaluating reasoners effectively.

However, I think the paper is not ready to be accepted. The content of the paper is itself fine, but the presentation of the paper is lacking, and thus the paper needs massive revisions.

I have a series of comments and questions that I’m posting below. Again, I have never reviewed a paper before, so perhaps this is not the right way to review/ask for revisions, but I wanted to bring to the authors’ attention several points that I believe need revision/attention.

For now, I have chosen the “major revisions” option.

QUESTIONS/COMMENTS PER SECTION

SECTION 2
1. In Section 2.1, the paper brings up Henry Kautz’s categorization scheme of different types of reasoners. Adding a diagram/picture with a list of those categories as well as key examples, as you do for the two relevant categories in Section 2.1, will provide necessary additional context.

2. The reason that I have made this point is that Section 2.1 does not clearly summarize the differences between Box2El and OWL2Vec. I cannot determine any concrete differences between the two, though the author spend 2 paragraphs discussing each reasoner/embedding method. Adding an additional diagram or an additional concluding paragraph that recapitulates the key diffs between the tasks that these embeddings do well at (and the fact these tasks are different), as well as also stating that the datasets and metrics are different between these tasks, helps motivate section 2.2 clearly.

3. In Section 2.2, the authors are vague in describing the differences between Makni et al, and Ebrahimi et al. Both sets of authors are trying to give metrics for the effectiveness of RDFS entailment reasoning. But what are these metrics and how do they differ? Giving a concrete example will make it clear for the reader why, even when dealing with just one task, there is such difference and variety in metrics, and therefore motivate the need for this dataset/benchmark you are developing.

SECTION 3

1. In 3.1.1, in the subsection "Introducing Noise", you say that you add "k" individuals to the ontology. Does this mean that the individuals do not currently exist in the ontology? I.e, with John rdf:type Male and John rdf:type Female, I should assume John is one of the k individuals and that John does not exist in the ontology to begin with?
a. Does this not contradict the line in the introduction: "While ABox noise [which these techniques are about introducing ABox noise] is about corrupting an existing triple in an ontology by changing one of the triples' resources"? I can’t tell if ABox noise is about corrupting the individual or adding new individuals or both.
b. Should I assume that you take existing individuals/triples in the ontology and make them violate the disjoint axioms, or that you add individuals (who do not exist in the ontology at all) that disobey the constraints?
c. Perhaps rephrasing to make this explicit will make it clearer, especially given that you say this in the description of Logical Noise: "We introduce noise... by assigning an individual to two disjoint classes.", and you also state in 3.3 that you "corrupt either the object of the subject of existing triples".

2. In 3.2, I'm curious as to what would have happened had you added the triples with the low probability assertions rather than modifying existing triples to have the low probability assertions. Doing both and comparing and contrasting the effects they have would better cover all the types of contradictions that could occur, would it not? Or is the assumption that the low probability assertions would directly contradict with the current assertions that the triples-to-be-modified have?

3. Were there any considerations/scheme taken in 3.3 to figure out WHICH triples were going to be corrupted? I assume that triples/objects that appear more in a dataset (a particular person, for example, may have more triples than another person), if corrupted, would introduce more random noise than a person/object that only appears once as a triple.

SECTION 4

1. The paragraph that begins with "Let G denote the original ontology, and I the ontology inferred..." needs to be reworked to more clearly explain what is being done and why it is being done. I will add additional comments below:
a. As far as I can tell, you are trying to modify the ontologies to be consistent in terms of hop length for all possible resources R within the original ontology. I.e, you are taking the subgraph for each resource R, and making it so that any statement/assertions are at most 2 hops away from R?, and then reconstituting the general graph this way into a modified ontology?
b. Why do you need to make the inference graphs i1, i2... iR? I thought that it is the NS reasoner's job (the one you are testing, not Pellet) to make these inference graphs for whatever assertion/inference you are trying to test for a given dataset with some noise.
c. Why are you using Pellet? Is Pellet a standard tool to use?
d. Why are you getting rid of "Literal" and "owl:Thing"? It wasn’t clear to me.

2. This is my own fault for not knowing about MRR and Hits@N, but why are we using these metrics over others? No papers are cited that show that these metrics were used anywhere else in similar tasks -- thereby making it plausible to use these metrics as a unifying standard. If there are no papers that use it, then I think there should be an explanation for why they are being used.
a. Is it possible to give a motivating example? I've never heard of these metrics and I didn't understand what exactly they're measuring.
b. That may be beyond the scope of the paper and cause length issues, however.
3. I want to make sure I’m understanding what the actual “running”/”execution” of your code/dataset is:
a. Is the idea that you run the reasoner and then see what class and object property assertions are generated when you introduce ABox noise into the ontologies at varying intensities?
b. Would Hits@N not count negative assertions (A isNotRel B?)?
c. I feel that more verbiage should be used to explain how exactly the Hits@N and MRR are generated; perhaps a contrived example might be used? I found the example about Richard_john_bright was helpful.

SECTION 5

1. Regarding the results, I think they are interesting and prove the value of the work being done. However, I do not think the graphs themselves are particularly helpful at conveying the information.

2. The issue is that the unit values on the y-axes for both sets of graphs are so small that looking at the bars alone doesn't convey the drop/difference in y-values as you introduce different types of noise.

3. I think it would be helpful, potentially, to add the numerical values atop the bars themselves, so that we can see the numbers clearly and infer results using the numbers, rather than needing to read the corresponding paragraphs.

4. Secondly, flipping between the pages (or going up and down on the computer pdf viewer) for class and property assertions for given reasoners and datasets makes it difficult to process the data.

5. I think it may be helpful to structure the graphs like so, so there are 2 - 4 on a page.
a. Random Noise -- Owl2Vec | Random Noise -- Box2El
b. Statistical Noise -- Owl2Vec | Statistical Noise -- Box2El
c. Logical Noise -- Owl2Vec | Logical Noise - Box2El

6. In each graph, you can show the Class and Property assertion values for each dataset
as noise is being varied. This collects the results according to the reasoners + noise,
and shows the effects of noise on each of the tasks more clearly.
7. You will still have many graphs, but it will be easier to find information. For example, you describe the MRR for class and property object assertions decreasing as diff types of noise are introduced. You describe the effect of logical noise. Then the reader can look at the Logical Noise - Owl2Vec and Logical Noise -- Box2El graphs and see clearly the effects the noise has for both class and property assertions by seeing these graphs side by side.
8. By reformatting the graphs to have the numerical values of the MRR displayed as well as collecting the results in a slightly different way, it makes it much easier for the reader to read the data and then read the paragraphs explaining the data. As it is now, it's hard to read the values discussed in the Results paragraph and then try and verify by looking at the graphs.

I would also like to note that I had issues with meson’s build process for openblas, I think, when trying to clone the repo to verify the results that the authors had found. The issue came about when running `pip install -r requirements.txt`. Therefore I could not replicate their results. If the authors can revise their codebase and retest it to make sure these results are easily generatable, that would go a long way towards replicability!

In general, I think the paper has promise, but the presentation and motivation of certain key points is lacking. Hopefully these points provide a basis for a good revision!

I also didn't take a deep look at the bibliography; I didn't see anything that was out of place, however.

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 818-1810

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 818-1810

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Tags:

Journal Info

Submit

For Reviewers

Links