NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 912-1929

Flag : Review Assignment Stage

Authors: 

Julie Loesch
Gunjan Singh
Raghava Mutharaju
Remzi Celebi

Responsible editor: 

Guest Editors Neurosymbolic AI and Ontologies 2024

Submission Type: 

Article in Special Issue (note in cover letter)

Full PDF Version: 

Cover Letter: 

RESUBMIT of #818-1810. Dear Editors, We sincerely thank the reviewers for their valuable time and insightful feedback on our manuscript, "NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise". Their constructive comments have greatly helped us enhance the clarity and accuracy of our work. Below, we address each of the reviewers’ remarks in detail and highlight the corresponding changes made in the revised version of the paper. **Responses to Reviewer 1** - The introduction of new datasets for testing neurosymbolic AI methods is very welcome, we need standardized tasks that are challenging and representative. I also appreciate the focus on robustness to noise. As the paper states, purely symbolic systems are brittle to noise, so it is interesting to measure this. However, the other related question -- not addressed in the paper -- is at least as interesting: adding logic to a neural system helps to deal with noise by pointing out inconsistencies in the data. Response: We appreciate the comment from the reviewer. Integrating logical reasoning into neural systems can help identify and mitigate inconsistencies in the data, which is indeed a compelling benefit. While our current work focuses on evaluating reasoning over existing ontologies and the effects of noise, exploring how logic can actively guide neural systems to handle noisy or inconsistent data is an exciting direction for future work. - Correctness: The work appears correct. One remark though: real-world KGs/ontologies are already notoriously noisy and incomplete and this existing noise is not accounted for; and the additional random noise could actually be true, just missing. The two datasets the authors evaluate on might be relatively complete and noise free (for Family this is definitely possible), if so, stating this explicitly would lead to a stronger experimental section. Maybe the idea of the paper would work best with a fully synthetic dataset where you can first generate an ontology that is easily learnable, complete and noise free, and then add noise to see how performance deteriorates in a controlled setting. Response: We agree that real-world knowledge graphs and ontologies are often noisy and incomplete. To address this concern, we included an additional ontology, Pizza, for which we developed a synthetic ABox generator, further supporting controlled experimentation under well-defined conditions. We added the following to Section 4.1 (first paragraph): “Using the Pizza ontology, we created an ABox generator to support experiments with synthetic data. The process for generating ABox data for the Pizza ontology begins by loading the Pizza TBox (Terminological Box) axioms. A custom instance generation step then automatically creates a specified number of individuals (ABox data), and their object properties are defined in a configuration. For this study, only NamedPizza class and hasTopping property are described in the configuration. Crucially, this generation leverages the TBox's inherent OWL restrictions (e.g., only or some constraints) to dynamically determine the appropriate target classes for object properties, thereby guaranteeing the generated ABox is semantically consistent with the ontology's definition. The final output is the complete ontology, comprising the original TBox and the newly populated ABox. In this study, we developed two datasets, Pizza_100 and Pizza_250, comprising 100 and 250 pizza instances, respectively.” In addition, we now used the same ABox generator logic for OWL2Bench, which is fully synthetic and based on UOBM ontology, an extension of LUBM, and provides a controlled and noise-free environment. We clarified this point in Section 4.1 (third paragraph): “OWL2Bench was developed as an extension of the well-known University Ontology Benchmark (UOBM), producing four distinct TBoxes—one for each OWL 2 profile. For this study, we used the TBox of OWL2Bench1-DL, expressed in the OWL DL profile. We generated a relatively smaller ABox than those in the OWL2Bench dataset as our work aims to assess the reasoning capabilities of neurosymbolic systems rather than their scalability. Our ABox generation for this ontology was restricted to the classes University, Department, Person, and Course, along with key object properties such as hasDepartment, hasDoctoralDegreeFrom, teachesCourse, and takesCourse.” - Presentation: the work is clearly presented but assumes background knowledge. NAI is broader than just ontologies so a brief primer on ontologies, in particular on the concepts of TBox and ABox, would be a good addition. A brief primer on ontologies, in particular on the concepts of TBox and ABox, would be a good addition. Response: We added a brief definition in the Introduction (first paragraph) by adding the following: “An ontology is a formal and explicit specification of a shared conceptualization of a domain. It describes the concepts, categories, relationships, and rules that structure knowledge within that domain, facilitating common understanding and interoperability among systems and people [https://www.sciencedirect.com/science/article/pii/S1042814383710083]. Ontologies distinguish between the Terminological Box (TBox), which defines classes, along with their relationships and restrictions, and the Assertional Box (ABox), which contains assertions about individuals such as property assertions or class memberships.” - For the statistical approach, please clarify whether the "lowest probability score" is re-evaluated after adding every noise triple; or whether the k triples are identified at once. The latter approach may introduce related triples? Response: In the statistical approach, the k triples are identified all at once based on the calculated probability scores, before any noise is added. We did not re-evaluate the scores after each noise triple. You are correct that this could lead to related triples being selected, which is a known limitation of this method. We opted for this approach for simplicity and reproducibility, but dynamically updating probabilities after each addition could be an interesting extension to explore. - Smaller comments: The examples in the methodology for logical noise are helpful. Showing noise evolution with a lineplot would be more appropriate than bars. Response: To better illustrate the noise evolution, we have replaced the bar plots with line plots in Figures 2–5 of the Results section. - Line 29 on page 6 abruptly changes the topic without introducing what is going to be discussed. Response: We added an introductory sentence: “The majority of prior research has concentrated on ontology completion tasks (i.e., prediction) rather than on ontology reasoning tasks (i.e., inference) [https://arxiv.org/abs/2507.14334]. Ontology or link completion involves identifying plausible relations that enrich the original ontology, as demonstrated in the study by Chen et al. [https://arxiv.org/abs/2009.14654]. In prediction tasks, the training, validation, and testing datasets are typically created by randomly splitting the ontology axioms.” - Two weaknesses are: 1. Completeness, novelty and motivation: The three introduced techniques are not sophisticated, which is not necessarily bad, but their discussion is lacking; I.e., I expected a more thorough comparison on the effect of the techniques, and a better motivation as to why these noise types are relevant. The work mentions "many other types of axioms and noise patterns merit investigation". These should arguably be discussed as well, such that the choice of investigated axioms can be better positioned and motivated. Related, the work claims "real-world datasets often contain errors, inconsistencies, or irrelevant information.", which raises the question as to why we still need to add noise instead of focus on real-world data. Furthermore, to introduce noise to synthetic data, it is unclear whether the proposed types of noise are at all representative for the noise that is present in real-world datasets. Response: We thank the reviewer for the insightful comments and for highlighting the need for stronger motivation and positioning of the chosen noise types. Our study aims to investigate how different forms of noise affect neural and logical systems, recognizing that reasoning can be undermined by inconsistent or irrelevant information. While the three noise-injection techniques we propose are intentionally simple, their design is motivated by common categories of errors observed in real-world knowledge bases. Similar to prior work [https://ojs.aaai.org/index.php/AAAI/article/view/28729], we define three types of noise: (i) random noise, (ii) logical noise, and (iii) statistical noise. Random noise does not depend on the data, representing unpredictable, accidental errors. Logical noise is more realistic, as many real-world errors stem from semantic confusion or violations of ontological constraints. Statistical noise is adversarially generated due to bias in automated KG construction models. Since error-detection models are often employed to identify mistakes in automatically constructed knowledge graphs, we use Graph Neural Networks to generate adversarial noise and to evaluate a model’s ability to detect such errors during KG construction. We acknowledge that our techniques are not exhaustive or highly sophisticated. The intent was to provide a first step in systematically investigating these effects. Comprehensive exploration of other logical axiom types and noise patterns is indeed an important direction for future work. Similarly, while it is true that learning from real-world datasets containing errors is ideal, publicly available datasets that contain structured errors suitable for controlled experiments are rare (e.g., Wikidata’s TBox is not semantically rich enough for our purposes). Introducing synthetic noise allows us to systematically study its effects in a controlled manner. We added the following to the Introduction (fourth paragraph) to motivate the choice of the noise injection techniques: "Random noise serves as a baseline, representing data-agnostic, unpredictable errors that may arise accidentally in real-world ontologies. We simulate this by corrupting existing triples—replacing either the subject or the object with a random entity. This allows us to probe the robustness of reasoning processes against general perturbations that do not depend on the underlying data. Statistical noise is generated adversarially using Graph Neural Networks (GNNs), reflecting low-probability links that emerge from predictive uncertainty or bias in automated knowledge graph construction. Although synthetic, this form of noise models realistic mistakes produced by machine-learning systems, and mirrors the types of errors that error-detection models are typically asked to identify during KG construction. Logical noise captures violations of semantic constraints, such as disjointness axioms or domain and range restrictions. Because many real-world ontology errors stem from semantic confusion rather than random corruption, this type of noise directly stresses the logical structure of the ontology and provides a more targeted challenge to reasoning systems. By combining these three types of noise, we aim to cover a spectrum of potential real-world errors, from accidental and statistically plausible mistakes to deliberate logical conflicts." - 2. The results of the experimental section are unclear to me: - I like the relative measurement of noise but what does 100% noise mean? That there is an equal number of noisy assertions compared to ground-truth assertions? This is confusing, with 100% noise I expect everything to be noise and nothing to be learnable. Response: 100% noise means that the number of added noisy assertions equals the number of original assertions, effectively doubling the total. We clarified this definition in the Results section in the figure captions (Figures 2-5). - The effect of adding noise seems small and the performance does not consistently decrease. In the case of statistical noise, the MRR even improves for Family and OWL2Vec? There seems to be something wrong there. Response: Statistical noise is introduced through GNN-generated triples. Unlike random noise, these GNN-generated triples tend to preserve some degree of structural and relational plausibility. As a result, the injected noise can occasionally reinforce existing local patterns in the graph instead of disrupting them. This can lead to slight performance increases in some configurations, which is consistent with the idea that the GNN model captures latent regularities that are beneficial for the embedding method. - It would be interesting to also benchmark a purely neural and a purely symbolic reasoner to show that this is a setting where NAI is useful. Response: We added experiments with a purely neural approach based on Graph Neural Networks. Purely symbolic reasoning is impractical: the symbolic space is too large, and the ontology contains inconsistencies that introduce logical noise, making it infeasible to execute symbolic reasoning reliably. These findings motivate our use of NAI, which combines the strengths of both paradigms. **Responses to Reviewer 2** - Artificiality of Noise Injection: While the paper introduces a clear and reproducible method for noise generation, some of the injected noise, particularly the statistical noise derived from low-probability GNN predictions, appears too "easy" or synthetic. Real-world ontologies often contain more adversarial or semantically subtle noise. The study would benefit from incorporating a broader spectrum of noise severity, including some manually constructed, semantically plausible errors that challenge different aspects of the reasoning process. Response: We appreciate the reviewer’s insightful comment regarding the nature of the injected noise. We agree that exploring a broader spectrum of noise, including manually constructed, semantically plausible errors, would indeed provide additional valuable insights into reasoning robustness. Our primary goal in this work was to systematically study reasoning robustness under different types of noise through three complementary, well-defined noise injection strategies. To better motivate these choices, we have added the following explanation in the Introduction (fourth paragraph): “Random noise serves as a baseline, representing data-agnostic, unpredictable errors that may arise accidentally in real-world ontologies. We simulate this by corrupting existing triples—replacing either the subject or the object with a random entity. This allows us to probe the robustness of reasoning processes against general perturbations that do not depend on the underlying data. Statistical noise is generated adversarially using Graph Neural Networks (GNNs), reflecting low-probability links that emerge from predictive uncertainty or bias in automated knowledge graph construction. Although synthetic, this form of noise models realistic mistakes produced by machine-learning systems, and mirrors the types of errors that error-detection models are typically asked to identify during KG construction. Logical noise captures violations of semantic constraints, such as disjointness axioms or domain and range restrictions. Because many real-world ontology errors stem from semantic confusion rather than random corruption, this type of noise directly stresses the logical structure of the ontology and provides a more targeted challenge to reasoning systems. By combining these three types of noise, we aim to cover a spectrum of potential real-world errors, from accidental and statistically plausible mistakes to deliberate logical conflicts.” We fully agree that extending our framework to include semantically subtle or adversarially crafted noise is an important direction for future work, and we explicitly note this as such. - Lack of Statistical Rigor in Results: The experimental section would be strengthened by the inclusion of error bars or confidence intervals to better reflect variability across runs and support claims about noise effects. Though the authors mention averaging over five runs, visual indicators of variance are missing in the main figures, limiting the statistical interpretability of trends.Response: We included boxplots in the appendix (Figures 6–9) to show the full distribution of results over the five runs. These visual indicators allow readers to assess variability and the statistical reliability of the observed trends, complementing the reported averages. - Limited Qualitative Analysis: The results focus on numerical performance metrics but omit qualitative insights into how specific examples of noise affect inference outcomes. Including a few illustrative examples where reasoning fails (or surprisingly succeeds) under noise would help ground the quantitative findings and offer readers more interpretability into the models' failure modes. Response: We appreciate the reviewer’s valuable suggestion. We have added an illustrative example of ABOX reasoning under noise in Figure 1. While we agree that additional qualitative examples would further enrich the analysis, time constraints prevent us from including more in this revision. We plan to incorporate more detailed cases of reasoning successes and failures under noise in future work to complement the quantitative results. - Some minor concerns: --On page 2, use the authors' names, not [1] et al. presented. --Sometimes closed quote is used instead of open quote e.g., on page 10, line 44 (as well as in a couple of other places). Response: We have addressed these minor issues throughout the paper: references to authors are now written using their names instead of “[1] et al.,” and quotation marks have been corrected where needed (e.g., page 10, line 44). **Responses to Reviewer 3** - The main motivation described in their work on creating a benchmark is not quite fulfilled, since the practicality of the creation of the benchmark currently lacks experiments and datasets that they have tested on. The evaluation is restricted to only two ontologies and two reasoners. This limits the generalizability of the findings. Response: In response, we reran the experiments and included one additional purely neural approach based on Graph Neural Networks to broaden the evaluation and also added a third ontology, Pizza. However, several practical constraints remain: (i) Ontologies: While many ontologies exist, most either have very small ABoxes or extremely large TBoxes (e.g., GeneOntology), posing practical challenges for systematic evaluation. (ii) Reasoners: For neurosymbolic reasoning, only a few approaches are readily applicable, such as random-walk-based methods (owl2vec*) and geometric-space methods (box2el). Many other existing techniques are not readily available, still in research prototypes, which further limits evaluation options. Despite these limitations, our benchmark provides a reproducible and extensible framework for systematic evaluation. It allows future work to incorporate additional ontologies or reasoning techniques as they become available, ensuring the benchmark remains a meaningful tool for assessing reasoning robustness. - The authors acknowledge that "specific characteristics of each ontology significantly influence the effectiveness of noise injection," yet fail to adequately address this through broader experimentation. Response: We acknowledge that the specific characteristics of each ontology, including the types of inferences and commonly used axioms, play a significant role in the effectiveness of noise injection. For instance, in the pizza ontology, many inferences involve subproperty, inverse property, or functional axioms. In future work, we plan to conduct a more systematic and detailed analysis of these inference patterns across different ontologies to better understand their impact on noise robustness. - The results show different patterns across ontologies, with no consistent trend and focus exclusively on ABox noise. Since TBox noise is common in real-world settings, it limits the applicability of the benchmark to certain scenarios. Response: In this study, we focus on ABox noise to systematically evaluate its effects on reasoning, as it represents a common first step in ontology noise research. We acknowledge that TBox noise is important for real-world ontologies and can impact reasoning in different ways. However, we focus on ABox noise, as the ABox is now more central to many real-world Knowledge Graphs. Extending our benchmark to include TBox noise is an important direction for future work, which would broaden the applicability of our approach and enable a more comprehensive evaluation across multiple types of ontology inconsistencies. - In addition, since the baseline performance scores are very low and, as they mentioned, "it is difficult to identify any clear trend, as the values are already low, even without the introduction of noise," it raises questions about the suitability of the chosen tasks and datasets without expanding their evaluation scope. Response: We reran the experiments, adding a purely neural baseline based on Graph Neural Networks and a third ontology, Pizza, to broaden the evaluation scope. We also refined our data splitting strategy. Specifically, we updated Section 4.1 (third paragraph ) as follows: “Let G denote the original ontology and I the ontology inferred using Pellet reasoner [https://www.sciencedirect.com/science/article/abs/pii/S1570826807000169]. Since our approach is unsupervised, the graph G is ultimately added to G_train, while I is randomly assigned to G_train, G_test and G_val. The TBox is further added to G_test and G_val, ensuring that the reasoning tasks are based on a shared conceptual framework.” Previously, we did not include any inferences in the training set, which was incorrect. This was incorrect, as excluding inferences led to incomplete graph representations during training and an inconsistent distribution between training and evaluation sets. Furthermore, our study highlights that most previous work has mainly focused on ontology completion (i.e., prediction tasks), whereas our emphasis is on ontology reasoning, a more challenging inference task. This naturally results in lower baseline scores, as reasoning requires multi-step logical deductions rather than simpler predictions. Additionally, the scores for Object Property Assertions (OPA) are sometimes low due to the nature of the test sets. For example, in the Family ontology, OPA triples constitute over 98% of the test set. Since many of these OPAs arise from multi-step inferences produced by Pellet, the test set is dominated by structurally complex, inference-heavy triples. This makes the reasoning task inherently difficult and causes all models to exhibit low OPA performance even without noise. - The authors should include additional ontologies from different domains and complexity levels and evaluate more neurosymbolic reasoners to establish generalizable patterns; otherwise, it is unclear how practitioners should use this benchmark and interpret results for improving reasoner robustness. Including traditional symbolic reasoners in the evaluation would better contextualize the performance of the neurosymbolic approaches. Response: To broaden the evaluation, we added an additional ontology, Pizza, and a purely neural baseline based on Graph Neural Networks to cover different domains and structural complexity. While traditional symbolic reasoners could help contextualize performance, they are generally unable to handle noise, which is a key aspect of our benchmark. We note practical constraints: many ontologies either have very small ABoxes or extremely large TBoxes (e.g., GeneOntology), and only a few neurosymbolic reasoners (e.g., owl2vec*, box2el) are readily applicable, while others remain research prototypes. Despite these limitations, our benchmark provides a reproducible and extensible framework, allowing future work to incorporate additional ontologies and reasoning techniques as they become available, ensuring it remains a useful tool for evaluating reasoning robustness. **Responses to Reviewer 4** 1. In Section 2.1, the paper brings up Henry Kautz’s categorization scheme of different types of reasoners. Adding a diagram/picture with a list of those categories as well as key examples, as you do for the two relevant categories in Section 2.1, will provide necessary additional context. Response: We added the definitions of those categories, as well as examples in the beginning of Section 2.1. 2. The reason that I have made this point is that Section 2.1 does not clearly summarize the differences between Box2El and OWL2Vec. I cannot determine any concrete differences between the two, though the author spends 2 paragraphs discussing each reasoner/embedding method. Adding an additional diagram or an additional concluding paragraph that recapitulates the key diffs between the tasks that these embeddings do well at (and the fact these tasks are different), as well as also stating that the datasets and metrics are different between these tasks, helps motivate section 2.2 clearly. Response: We added Table 1 at the end of Section 2.1 that highlights the key differences between OWL2Vec* and Box2El. 3. In Section 2.2, the authors are vague in describing the differences between Makni et al, and Ebrahimi et al. Both sets of authors are trying to give metrics for the effectiveness of RDFS entailment reasoning. But what are these metrics and how do they differ? Giving a concrete example will make it clear for the reader why, even when dealing with just one task, there is such difference and variety in metrics, and therefore motivate the need for this dataset/benchmark you are developing. Response: The idea was to provide an example of why we need to have standardized evaluations (including metrics and datasets). We added the following: “Specifically, Makni et al. [https://semantic-web-journal.net/system/files/swj1866.pdf] used LUBM and a scientist dataset derived from DBpedia as benchmarks, evaluating performance with Precision, Recall, and F1 score. In contrast, Ebrahimi et al. [https://arxiv.org/abs/2106.09225] employed LUBM and synthetic data, using exact matching accuracy as their metric.” 1. In 3.1.1, in the subsection "Introducing Noise", you say that you add "k" individuals to the ontology. Does this mean that the individuals do not currently exist in the ontology? I.e, with John rdf:type Male and John rdf:type Female, I should assume John is one of the k individuals and that John does not exist in the ontology to begin with? Response: For logical noise, we consider both existing and new individuals. We first select individuals already present in the ontology; if additional examples are needed to reach a desired noise level, we introduce new, fictional individuals. Disjoint class and disjoint property axioms are used to create inconsistencies by assigning individuals to either two disjoint classes or properties. We agree that the original phrasing “we added k individuals to the ontologies” may be misleading. Thus, Section 3.1.1 has been updated for greater clarity. In addition, we have added an illustrative example of ABOX reasoning under noise in Figure 1. a. Does this not contradict the line in the introduction: "While ABox noise [which these techniques are about introducing ABox noise] is about corrupting an existing triple in an ontology by changing one of the triples' resources"? I can’t tell if ABox noise is about corrupting the individual or adding new individuals or both. Response: ABox noise refers to the introduction of inconsistencies or corrupted triples into an ontology. This can occur either by modifying existing triples or by adding new, noisy ones. In our approach, we generate ABox noise (both random and statistical) by corrupting the subject or the object of existing triples and then adding the resulting modified triples as new entries in the ontology. b. Should I assume that you take existing individuals/triples in the ontology and make them violate the disjoint axioms, or that you add individuals (who do not exist in the ontology at all) that disobey the constraints? Response: We first select existing individuals and assign them to either disjoint classes or disjoint properties. If additional examples are needed to reach a specific noise level, or if the ontology lacks suitable candidates, we introduce new, fictional individuals to achieve the desired level of inconsistencies. c. Perhaps rephrasing to make this explicit will make it clearer, especially given that you say this in the description of Logical Noise: "We introduce noise... by assigning an individual to two disjoint classes.", and you also state in 3.3 that you "corrupt either the object of the subject of existing triples". Response: To address this, we have revised Section 3 entirely to clarify how individuals and triples are used in noise generation, including the distinction between modifying existing triples and introducing new, inconsistent ones. We hope this makes the methodology clearer. 2. In 3.2, I'm curious as to what would have happened had you added the triples with the low probability assertions rather than modifying existing triples to have the low probability assertions. Doing both and comparing and contrasting the effects they have would better cover all the types of contradictions that could occur, would it not? Or is the assumption that the low probability assertions would directly contradict with the current assertions that the triples-to-be-modified have? Response: We focused on modifying existing triples because our task is link prediction—i.e., predicting the missing entity in patterns like (?, predicate, object) or (subject, predicate, ?). By introducing low-probability assertions into existing triples, we ensure that the contradictions directly interact with the knowledge already present in the graph, which is the context in which link prediction operates. That said, we do in fact “add” these modified triples back into the graph. So while our emphasis was on altering existing triples for consistency with the prediction task, the framework naturally accommodates the presence of these newly modified triples. 3. Were there any considerations/scheme taken in 3.3 to figure out WHICH triples were going to be corrupted? I assume that triples/objects that appear more in a dataset (a particular person, for example, may have more triples than another person), if corrupted, would introduce more random noise than a person/object that only appears once as a triple. Response: The reviewer is correct that corrupting triples associated with frequently occurring entities could introduce disproportionately more noise than corrupting triples of infrequent entities. However, in our experiments, we deliberately kept the corruption process completely random. This choice ensures a clean comparison between this form of random noise injection and other techniques (statistical and logical noise), without introducing additional biases from a targeted selection scheme. 1. The paragraph that begins with "Let G denote the original ontology, and I the ontology inferred..." needs to be reworked to more clearly explain what is being done and why it is being done. I will add additional comments below: Response: In much of the prior work, ontologies are simply split into training, validation, and test sets using a standard ratio (e.g., 80/10/10). While this is suitable for ontology completion tasks, it does not reflect the requirements of ontology reasoning. Our focus in this paper is specifically on reasoning. To evaluate this, we employed neurosymbolic reasoners (Box2EL and OWL2Vec*). However, in order to test these methods meaningfully, we first needed a reliable ground truth set. For this, we used Pellet, a well-known symbolic reasoner, to generate inferences. An introductory sentence was added to motivate why this step is necessary: “The majority of prior research has concentrated on ontology completion tasks (i.e., prediction) rather than on ontology reasoning tasks (i.e., inference) [https://arxiv.org/abs/2507.14334]. Ontology or link completion involves identifying plausible relations that enrich the original ontology, as demonstrated in the study by Chen et al. [https://arxiv.org/abs/2009.14654]. In prediction tasks, the training, validation, and testing datasets are typically created by randomly splitting the ontology axioms.” a. As far as I can tell, you are trying to modify the ontologies to be consistent in terms of hop length for all possible resources R within the original ontology. I.e, you are taking the subgraph for each resource R, and making it so that any statement/assertions are at most 2 hops away from R?, and then reconstituting the general graph this way into a modified ontology? Response: We have removed this part from the paper because we now operate directly on the full ontology. The earlier approach was introduced purely for computational efficiency. Specifically, for each resource R, we previously extracted a 2-hop subgraph that contained all statements and assertions reachable within two hops of R. This ensured a uniform hop length across resources and, more importantly, allowed us to run the Pellet reasoner on many small subgraphs rather than on the full ontology, which was significantly faster given the size of the original dataset. In the current version of the work, however, we no longer perform this transformation: we reason over the complete ontology using Pellet. This simplifies the preprocessing pipeline and removes the need for generating per-resource subgraphs. b. Why do you need to make the inference graphs i1, i2... iR? I thought that it is the NS reasoner's job (the one you are testing, not Pellet) to make these inference graphs for whatever assertion/inference you are trying to test for a given dataset with some noise. Response: We have removed this part from the current version of the paper, but let us clarify the motivation behind it. In our earlier experiments, for each resource R, we extracted a subgraph g_R and then used a standard DL reasoner (Pellet) to compute the corresponding inference graph i_R. These inference graphs served as ground-truth reference outputs. While the NS reasoner indeed produces its own inferences from the data, an external reference is required to evaluate its correctness. The NS reasoner cannot evaluate itself; we need a reliable source of expected inferences in order to measure the performance. Pellet-generated inference graphs provided this gold-standard baseline. In the updated version of the work, we no longer generate multiple per-resource graphs. Instead, we run Pellet once on the full ontology and store all inferred facts together. Conceptually the evaluation procedure is unchanged—we still compare the NS reasoner’s inferences against a ground truth—but we now obtain this reference from a single, global inference graph rather than many smaller ones. c. Why are you using Pellet? Is Pellet a standard tool to use? Response: We are using Pellet because it is a well-established symbolic reasoner for OWL ontologies. It serves as a reliable baseline for performing standard reasoning tasks, such as consistency checking and inference generation, which allows us to validate our approach and compare results. d. Why are you getting rid of "Literal" and "owl:Thing"? It wasn’t clear to me. Response: We remove Literal and owl:Thing because they do not contribute to meaningful or informative inferences. Our focus is on entities and relationships that convey semantic content, so excluding these generic or non-informative elements helps us concentrate on the inferences that are truly relevant. 2. This is my own fault for not knowing about MRR and Hits@N, but why are we using these metrics over others? No papers are cited that show that these metrics were used anywhere else in similar tasks -- thereby making it plausible to use these metrics as a unifying standard. If there are no papers that use it, then I think there should be an explanation for why they are being used. Response: MRR and Hits@N are widely used metrics in tasks such as Class Membership and Object Property Assertions (also known as Link Prediction). The link prediction task involves identifying an entity that forms a valid fact (an edge) when combined with a given relation and another entity. a. Is it possible to give a motivating example? I've never heard of these metrics and I didn't understand what exactly they're measuring. Response: A motivating example is as follows. In link prediction, we often evaluate queries such as (Barack Obama, born_in, ?), where the model must rank candidate answers (e.g., Hawaii, Kenya, New York, …). Hits@K measures how often the correct answer appears among the top K ranked candidates—for example, if the true answer “Hawaii” is ranked within the top 10 in 85% of cases, then Hits@10 = 85%. Mean Reciprocal Rank (MRR) instead considers the exact position: if “Hawaii” is ranked 1st, its reciprocal rank is 1; if 2nd, 1/2; if 10th, 1/10; and we average this across all queries. Thus, Hits@K captures whether the model places the correct entity near the top at all, while MRR captures how close to the very top the model ranks it on average. b. That may be beyond the scope of the paper and cause length issues, however. Response: We do not include this in the paper itself, as these metrics are standard and well-known for ontology evaluation but we hope that our response can better help you to understand these metrics. 3. I want to make sure I’m understanding what the actual “running”/”execution” of your code/dataset is: Response: The actual “execution” of the framework involves introducing noise into the ontologies using three distinct techniques. The primary goal of these experiments is to evaluate how the reasoner’s performance is affected on the two chosen ontology tasks when ABox noise is present during training. a. Is the idea that you run the reasoner and then see what class and object property assertions are generated when you introduce ABox noise into the ontologies at varying intensities? Response: Yes, the framework runs the reasoner on the ontology after adding noise at varying intensities. The generated class and object property assertions are then analyzed to measure how the reasoner’s predictions degrade as noise increases. b. Would Hits@N not count negative assertions (A isNotRel B?)? Response: Hits@N considers only the correct (positive) assertions when ranking predictions. c. I feel that more verbiage should be used to explain how exactly the Hits@N and MRR are generated; perhaps a contrived example might be used? I found the example about Richard_john_bright was helpful. Response: A detailed explanation is provided in our previous response; we do not include this in the paper itself, as these metrics are standard and well-known for ontology evaluation. 1. Regarding the results, I think they are interesting and prove the value of the work being done. However, I do not think the graphs themselves are particularly helpful at conveying the information. 2. The issue is that the unit values on the y-axes for both sets of graphs are so small that looking at the bars alone doesn't convey the drop/difference in y-values as you introduce different types of noise. 3. I think it would be helpful, potentially, to add the numerical values atop the bars themselves, so that we can see the numbers clearly and infer results using the numbers, rather than needing to read the corresponding paragraphs. Response: Following the recommendation, we have updated Figures 2-5 in the Results section to improve clarity and better convey the trends. Specifically, we added the numerical values atop the bars so that readers can more easily interpret the results without relying solely on the text. 4. Secondly, flipping between the pages (or going up and down on the computer pdf viewer) for class and property assertions for given reasoners and datasets makes it difficult to process the data. 5. I think it may be helpful to structure the graphs like so, so there are 2 - 4 on a page. Response: We have updated the layout of Figures 2-5 as suggested by the reviewer, improving readability and presentation of the results. a. Random Noise -- Owl2Vec | Random Noise -- Box2El b. Statistical Noise -- Owl2Vec | Statistical Noise -- Box2El c. Logical Noise -- Owl2Vec | Logical Noise - Box2El 6. In each graph, you can show the Class and Property assertion values for each dataset as noise is being varied. This collects the results according to the reasoners + noise, and shows the effects of noise on each of the tasks more clearly. Response: We have updated Figures 2 to 5 to display Class and Property assertion values for each dataset as noise is varied, allowing clearer visualization of the effects of noise on each reasoning task. 7. You will still have many graphs, but it will be easier to find information. For example, you describe the MRR for class and property object assertions decreasing as diff types of noise are introduced. You describe the effect of logical noise. Then the reader can look at the Logical Noise - Owl2Vec and Logical Noise -- Box2El graphs and see clearly the effects the noise has for both class and property assertions by seeing these graphs side by side. 8. By reformatting the graphs to have the numerical values of the MRR displayed as well as collecting the results in a slightly different way, it makes it much easier for the reader to read the data and then read the paragraphs explaining the data. As it is now, it's hard to read the values discussed in the Results paragraph and then try and verify by looking at the graphs. I would also like to note that I had issues with meson’s build process for openblas, I think, when trying to clone the repo to verify the results that the authors had found. The issue came about when running `pip install -r requirements.txt`. Therefore I could not replicate their results. If the authors can revise their codebase and retest it to make sure these results are easily generatable, that would go a long way towards replicability! Response: We changed the entire repository to make it reproducible. Please don’t hesitate to reach out if you still encounter any issues.

Previous Version: 

Tags: 

  • Under Review