NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 943-1967

Flag : Review Assignment Stage

Authors:

Julie Loesch

Gunjan Singh

Raghava Mutharaju

Remzi Celebi

Responsible editor:

Guest Editors Neurosymbolic AI and Ontologies 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-943.pdf

Cover Letter:

RESUBMIT of #912-1929. Dear Editors, We sincerely appreciate the reviewers for dedicating their time and providing thoughtful feedback on our manuscript, “NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise.” Below, we provide detailed responses to each of the reviewers' remarks and outline the corresponding revisions made in the updated version of the manuscript. **Responses to Reviewer 2** Overall, the revisions improve the paper and address several technical issues, but the benchmark remains limited in scope. If the authors soften their claims and reframe the contribution as a first-step benchmark framework that emphasizes reproducibility and extensibility, rather than as a definitive benchmark supporting broad conclusions about neurosymbolic reasoning robustness, the work would be more appropriately positioned for acceptance at this stage. *Response:* We thank the reviewer for the thoughtful and constructive feedback. We agree that the current benchmark is limited in scope and that positioning it as a definitive evaluation of neurosymbolic reasoning robustness may overstate its present coverage. In the revised manuscript, we have softened our claims and reframed the contribution as a first-step benchmark framework designed to promote reproducibility, extensibility, and systematic evaluation. Specifically: -Abstract: --Ligne 20 (page 1): We replaced “there is a lack of standardized benchmark datasets specifically designed for evaluating neurosymbolic --ontology reasoning systems” by “systematic evaluation of ontology reasoning systems under noisy conditions remains underexplored”. --Ligne 21 (page 1): We replaced “Currently, no benchmarks or evaluation frameworks have been explicitly developed to assess the robustness of these systems to noise.” by “In particular, there is a need for benchmark frameworks that enable reproducible assessment of how neurosymbolic reasoners behave when ontological data are corrupted.”. --Ligne 25 (page 1): We added “first-step benchmark framework”. --Ligne 31 (page 1): We replaced “Our results show that” by “In our experimental setup,”. -Introduction: --Ligne 24 (page 2): We replaced “there is a notable absence of standardized benchmark datasets specifically tailored for neurosymbolic reasoning, particularly evaluating their noise tolerance.” by “there is limited work on benchmark frameworks specifically tailored to evaluating neurosymbolic reasoning under noisy conditions.”. --Ligne 25 (page 2): We replaced “Such a benchmark is essential to advance this field” by “Such benchmark frameworks can support more systematic and comparable evaluation in this area”. --Ligne 26 (page 2): We removed “To the best of our knowledge, no benchmarks or evaluation frameworks have been explicitly designed to assess and compare the noise tolerance of neurosymbolic reasoning systems.”. --Ligne 42 (page 2): We replaced “we aim to cover a spectrum of potential real-world errors” by “we aim to approximate a spectrum of potential error patterns observed in real-world ontology and knowledge graph construction”. --Ligne 44 (page 2): We replaced “With this work, we have addressed the following research questions: how to characterize noise in ontologies, how to introduce noise into these structures, and how to evaluate the impact of noise on neurosymbolic reasoners.” by “In this work, we explore the following research questions: how noise in ontologies can be characterized operationally, how controlled noise can be introduced into these structures, and how the impact of such perturbations can be evaluated in neurosymbolic reasoners.”. --Ligne 47 (page 2): We replaced “ultimately advancing the field of neurosymbolic AI” by “contributing toward more systematic evaluation practices in neurosymbolic AI”. --Ligne 3 (page 3): We replaced: “It should also be noted that most previous work has focused on tasks of ontology completion rather than ontology reasoning.” by “Many existing studies emphasize ontology completion or link prediction tasks, whereas our focus is on evaluating reasoning performance under noisy conditions.”. -Conclusion: --Ligne 24 (page 15): We replaced “a framework for generating noisy benchmark datasets, with a specific focus on the generation of noisy ABox assertions for an ontology” by “a reproducible and extensible framework for generating noisy benchmark datasets, with a particular focus on controlled perturbations of ABox assertions”. --Ligne 36 (page 15): We replaced “Our experiments demonstrate that graph-neural-network–based reasoning (R-GCN) offers significantly higher resilience to noisy ontological data—including the most harmful form, logical noise—making it a more reliable choice for real-world knowledge graphs where noise and incomplete inference paths are common.” by “Our experimental results show how different modeling paradigms respond to varying types and levels of noise. In our experimental setup, the GNN-based model (R-GCN) exhibited greater empirical robustness to injected perturbations, particularly under logically inconsistent noise, while embedding-based neurosymbolic models showed more pronounced degradation.”. --Ligne 38 (page 15): We replaced “Furthermore, our study highlights that most previous work has mainly focused on ontology completion (i.e., prediction task), whereas our emphasis is on ontology reasoning, which is a more difficult task (i.e., inference task).” by “In contrast to many prior studies that focus primarily on ontology completion tasks, our emphasis is on ontology reasoning under noisy conditions, where inference quality is directly evaluated.”. --Ligne 40 (page 15): We removed “The main difference in evaluation is how the train, test and validation sets are split.”. Our goal is to provide a transparent and extensible foundation that the community can build upon, rather than to claim a conclusive evaluation of robustness across all neurosymbolic systems. We believe this reframing better reflects the current scope of the work while preserving its core contribution. We appreciate the reviewer’s guidance in helping us position the paper more appropriately. **Responses to Reviewer 3** 1. Can statistical contradiction/noise not potentially create or be an instance of logical noise, because the new links between two objects might violate domain/range properties? Or is the graph neural network trained to respect those semantics when accounting for low-probability triples? *Response:* Indeed, statistical noise can in principle introduce triples that violate domain or range constraints, and therefore overlap with what we define as logical noise. Our categorization is based on the mechanism of generation rather than the downstream logical effect. Logical noise in NSORN is introduced explicitly by constructing violations of disjointness axioms or domain/range constraints in a controlled and deterministic manner. In contrast, statistical noise is generated by a Graph Neural Network that predicts low-probability links without explicitly enforcing ontology-level semantic constraints. As a result, statistical noise may occasionally induce logical inconsistencies, but these arise indirectly from probabilistic prediction rather than deliberate semantic contradiction. We clarified this distinction in the manuscript in Section 3.2 by adding the following: “Although statistical noise is generated probabilistically using a GNN, it is not explicitly constrained to satisfy ontology-level semantics, and may therefore occasionally introduce logical inconsistencies as an indirect effect.”. 2. In your response to my prior critique, you discussed how and why Pellet was used, i.e, that Pellet is used to get/infer all facts and be used as a ground-truth reference. I think you should include a line about this in the paper when you bring up Pellet for the first time, on page 8. I remember being confused when I read this, and then I saw your response to me and that made more sense. As a result, I think even one or two lines will sufice just to make it clear how Pellet fits in. *Response:*To clarify Pellet’s role, we added the following sentence when it is first introduced (page 8): “We use the Pellet reasoner to compute the complete set of logical inferences over the original ontology, which serves as the ground-truth reference for evaluating the inferred assertions produced by the evaluated systems.” 3. I'm not sure if this is possible and it may end up being a bit too much to include, but do you think it may be helpful for Figures 2, 3, and 4, to have subfigures for each of the plots? That way, in your results section, you can state: "Reference Figure 2, Subfigure X". The issue is i'm not sure where you'd put the Subfigure heading, at the top of each graph, perhaps? Again, this is minor; the graphs are already much improved from the prior review, and I'm appreciative of that. *Response:* We thank the reviewer for the suggestion. Each subplot in Figures 2, 3, and 4 already corresponds to a specific reasoner, which is clearly indicated in the figure itself. As a result, we decided not to add separate subfigure labels to avoid clutter. 4. In your Results section (Section 5), you have a series of short paragraphs discussing the results of running the reasoning tasks on each dataset subject to varying noise and varying the reasoner. I think these paragraphs are fine, but it may be helpful to divide the paragraphs into small subsections so that the reader can easily see results for which ontology they want to look at. Something like this might help: 5. Results 5.1: FAMILY RESULTS - Discuss family ontology (the first few paragraphs) 5.2: PIZZA RESULTS - Discuss/include pizza related paragraphs 5.3: OWL2BENCH RESULTS - Discuss OWL2Bench paragraphs. *Response:* To improve clarity and help readers navigate the Results section, we have added subsections for each ontology as recommended: 5.1 Family Results, 5.2 Pizza Results, 5.3 OWL2Bench Results, and 5.4 Overall Analysis. 4. In Appendix A, you show the various experiment data tables and the statistics gleaned from each of the experimemts. Can you add a note about what number you are bolding? I believe you're bolding the "lowest" number, showing under what conditions each of the reasoners fails the most or loses the most out per metric, but it's not clear what is being bolded. Just adding this as a note alongside "Results on using ". *Response:* We thank the reviewer for pointing this out. The bolded values in Appendix A indicate the lowest MRR for each reasoning task, regardless of the type or level of noise. Since we have two tasks, there are only two bolded values per table. To make this explicit, we added the following note above Table 4: “Note: Bolded values indicate the lowest MRR for each reasoning task, highlighting the conditions under which each reasoner performs least effectively.” 5. I'm not the biggest fan of the boxplots because I think they lack numbers. Notice how the graphs in Figures 2, 3, 4, have the numerical values at each point, so it's easy for us to remember what the y-axis is and see the numerical values. With the boxplots, I don't see those values. I don't usually graph boxplots, so I'm not sure if there is a feature to highlight what the numbers are for the low end, the median/middle, and the high end values (I forget what you call them exactly). I would be alright if the plots weren't there, but if you do want to keep them there, can you add a note in the Results section about what the variability is showing us/illustrating? *Response:* We thank the reviewer for this helpful comment. The boxplots were added in response to another reviewer’s suggestion to visualize variability across runs. To clarify their purpose, we have included them in the appendix (Figures 6–9), showing the full distribution of results over the five runs. These visual indicators allow readers to assess variability and the statistical reliability of the observed trends, complementing the reported averages. The boxplots illustrate variability across runs, highlighting the impact of different noise types and levels on reasoner performance. 6. Probably a dumb question, but how/what exactly do you feed to the reasoners in order to get the inferences for each of the tasks? For example, do you ask the reasoners if an Object belongs to a Class for the membership task, and then for object property axiom question, do you feed a tuple of two Objects and a Relation and ask if the relation holds between the objects? I think maybe just a quick example, such as "We test the reasoners with the dataset on these tasks. For example, with the Pizza dataset, we would ask OWL2Vec , and we would ask DL EL++". This is probably on me for not working it out clearly. *Response:* To clarify, for each reasoning task, we query the reasoners directly with the relevant components from the dataset. For class membership, we check whether an individual belongs to a given class. For object property assertions, we check whether a specific relation holds between two individuals. For example, in the Pizza dataset, the reasoners are tested by asking whether a particular pizza individual belongs to a class (e.g., VegetarianPizza) or whether a specific relation holds between two entities (e.g., hasTopping(Pizza1, Cheese)). 7. Lastly, due to time constraints, I did not have the ability to really look through your code. I tried setting up with the new repo, and ran into issues with Meson again, but I think the errors were different this time around. For your reference, here is the error output: *Response:* The installation failure occurs because the old SciPy version (1.13.1) was being built from source on the reviewer’s system. Building SciPy from source requires a Fortran compiler (e.g., gfortran), which was not installed. In the updated requirements.txt, SciPy has been updated to >=1.11, which allows pip to install a precompiled wheel for Python 3.12+, avoiding the need for compilation and ensuring a smooth installation.

Previous Version:

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 943-1967

Flag : Review Assignment Stage

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Previous Version:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 943-1967

Flag : Review Assignment Stage

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Previous Version:

Tags:

Journal Info

Submit

For Reviewers

Links