NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 912-1929

Flag : Review Received

Authors:

Julie Loesch

Gunjan Singh

Raghava Mutharaju

Remzi Celebi

Responsible editor:

Guest Editors Neurosymbolic AI and Ontologies 2024

Submission Type:

Article in Special Issue (note in cover letter)

Full PDF Version:

nai-paper-912.pdf

Cover Letter:

RESUBMIT of #818-1810. Dear Editors, We sincerely thank the reviewers for their valuable time and insightful feedback on our manuscript, "NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise". Their constructive comments have greatly helped us enhance the clarity and accuracy of our work. Below, we address each of the reviewers’ remarks in detail and highlight the corresponding changes made in the revised version of the paper. **Responses to Reviewer 1** - The introduction of new datasets for testing neurosymbolic AI methods is very welcome, we need standardized tasks that are challenging and representative. I also appreciate the focus on robustness to noise. As the paper states, purely symbolic systems are brittle to noise, so it is interesting to measure this. However, the other related question -- not addressed in the paper -- is at least as interesting: adding logic to a neural system helps to deal with noise by pointing out inconsistencies in the data. Response: We appreciate the comment from the reviewer. Integrating logical reasoning into neural systems can help identify and mitigate inconsistencies in the data, which is indeed a compelling benefit. While our current work focuses on evaluating reasoning over existing ontologies and the effects of noise, exploring how logic can actively guide neural systems to handle noisy or inconsistent data is an exciting direction for future work. - Correctness: The work appears correct. One remark though: real-world KGs/ontologies are already notoriously noisy and incomplete and this existing noise is not accounted for; and the additional random noise could actually be true, just missing. The two datasets the authors evaluate on might be relatively complete and noise free (for Family this is definitely possible), if so, stating this explicitly would lead to a stronger experimental section. Maybe the idea of the paper would work best with a fully synthetic dataset where you can first generate an ontology that is easily learnable, complete and noise free, and then add noise to see how performance deteriorates in a controlled setting. Response: We agree that real-world knowledge graphs and ontologies are often noisy and incomplete. To address this concern, we included an additional ontology, Pizza, for which we developed a synthetic ABox generator, further supporting controlled experimentation under well-defined conditions. We added the following to Section 4.1 (first paragraph): “Using the Pizza ontology, we created an ABox generator to support experiments with synthetic data. The process for generating ABox data for the Pizza ontology begins by loading the Pizza TBox (Terminological Box) axioms. A custom instance generation step then automatically creates a specified number of individuals (ABox data), and their object properties are defined in a configuration. For this study, only NamedPizza class and hasTopping property are described in the configuration. Crucially, this generation leverages the TBox's inherent OWL restrictions (e.g., only or some constraints) to dynamically determine the appropriate target classes for object properties, thereby guaranteeing the generated ABox is semantically consistent with the ontology's definition. The final output is the complete ontology, comprising the original TBox and the newly populated ABox. In this study, we developed two datasets, Pizza_100 and Pizza_250, comprising 100 and 250 pizza instances, respectively.” In addition, we now used the same ABox generator logic for OWL2Bench, which is fully synthetic and based on UOBM ontology, an extension of LUBM, and provides a controlled and noise-free environment. We clarified this point in Section 4.1 (third paragraph): “OWL2Bench was developed as an extension of the well-known University Ontology Benchmark (UOBM), producing four distinct TBoxes—one for each OWL 2 profile. For this study, we used the TBox of OWL2Bench1-DL, expressed in the OWL DL profile. We generated a relatively smaller ABox than those in the OWL2Bench dataset as our work aims to assess the reasoning capabilities of neurosymbolic systems rather than their scalability. Our ABox generation for this ontology was restricted to the classes University, Department, Person, and Course, along with key object properties such as hasDepartment, hasDoctoralDegreeFrom, teachesCourse, and takesCourse.” - Presentation: the work is clearly presented but assumes background knowledge. NAI is broader than just ontologies so a brief primer on ontologies, in particular on the concepts of TBox and ABox, would be a good addition. A brief primer on ontologies, in particular on the concepts of TBox and ABox, would be a good addition. Response: We added a brief definition in the Introduction (first paragraph) by adding the following: “An ontology is a formal and explicit specification of a shared conceptualization of a domain. It describes the concepts, categories, relationships, and rules that structure knowledge within that domain, facilitating common understanding and interoperability among systems and people [https://www.sciencedirect.com/science/article/pii/S1042814383710083]. Ontologies distinguish between the Terminological Box (TBox), which defines classes, along with their relationships and restrictions, and the Assertional Box (ABox), which contains assertions about individuals such as property assertions or class memberships.” - For the statistical approach, please clarify whether the "lowest probability score" is re-evaluated after adding every noise triple; or whether the k triples are identified at once. The latter approach may introduce related triples? Response: In the statistical approach, the k triples are identified all at once based on the calculated probability scores, before any noise is added. We did not re-evaluate the scores after each noise triple. You are correct that this could lead to related triples being selected, which is a known limitation of this method. We opted for this approach for simplicity and reproducibility, but dynamically updating probabilities after each addition could be an interesting extension to explore. - Smaller comments: The examples in the methodology for logical noise are helpful. Showing noise evolution with a lineplot would be more appropriate than bars. Response: To better illustrate the noise evolution, we have replaced the bar plots with line plots in Figures 2–5 of the Results section. - Line 29 on page 6 abruptly changes the topic without introducing what is going to be discussed. Response: We added an introductory sentence: “The majority of prior research has concentrated on ontology completion tasks (i.e., prediction) rather than on ontology reasoning tasks (i.e., inference) [https://arxiv.org/abs/2507.14334]. Ontology or link completion involves identifying plausible relations that enrich the original ontology, as demonstrated in the study by Chen et al. [https://arxiv.org/abs/2009.14654]. In prediction tasks, the training, validation, and testing datasets are typically created by randomly splitting the ontology axioms.” - Two weaknesses are: 1. Completeness, novelty and motivation: The three introduced techniques are not sophisticated, which is not necessarily bad, but their discussion is lacking; I.e., I expected a more thorough comparison on the effect of the techniques, and a better motivation as to why these noise types are relevant. The work mentions "many other types of axioms and noise patterns merit investigation". These should arguably be discussed as well, such that the choice of investigated axioms can be better positioned and motivated. Related, the work claims "real-world datasets often contain errors, inconsistencies, or irrelevant information.", which raises the question as to why we still need to add noise instead of focus on real-world data. Furthermore, to introduce noise to synthetic data, it is unclear whether the proposed types of noise are at all representative for the noise that is present in real-world datasets. Response: We thank the reviewer for the insightful comments and for highlighting the need for stronger motivation and positioning of the chosen noise types. Our study aims to investigate how different forms of noise affect neural and logical systems, recognizing that reasoning can be undermined by inconsistent or irrelevant information. While the three noise-injection techniques we propose are intentionally simple, their design is motivated by common categories of errors observed in real-world knowledge bases. Similar to prior work [https://ojs.aaai.org/index.php/AAAI/article/view/28729], we define three types of noise: (i) random noise, (ii) logical noise, and (iii) statistical noise. Random noise does not depend on the data, representing unpredictable, accidental errors. Logical noise is more realistic, as many real-world errors stem from semantic confusion or violations of ontological constraints. Statistical noise is adversarially generated due to bias in automated KG construction models. Since error-detection models are often employed to identify mistakes in automatically constructed knowledge graphs, we use Graph Neural Networks to generate adversarial noise and to evaluate a model’s ability to detect such errors during KG construction. We acknowledge that our techniques are not exhaustive or highly sophisticated. The intent was to provide a first step in systematically investigating these effects. Comprehensive exploration of other logical axiom types and noise patterns is indeed an important direction for future work. Similarly, while it is true that learning from real-world datasets containing errors is ideal, publicly available datasets that contain structured errors suitable for controlled experiments are rare (e.g., Wikidata’s TBox is not semantically rich enough for our purposes). Introducing synthetic noise allows us to systematically study its effects in a controlled manner. We added the following to the Introduction (fourth paragraph) to motivate the choice of the noise injection techniques: "Random noise serves as a baseline, representing data-agnostic, unpredictable errors that may arise accidentally in real-world ontologies. We simulate this by corrupting existing triples—replacing either the subject or the object with a random entity. This allows us to probe the robustness of reasoning processes against general perturbations that do not depend on the underlying data. Statistical noise is generated adversarially using Graph Neural Networks (GNNs), reflecting low-probability links that emerge from predictive uncertainty or bias in automated knowledge graph construction. Although synthetic, this form of noise models realistic mistakes produced by machine-learning systems, and mirrors the types of errors that error-detection models are typically asked to identify during KG construction. Logical noise captures violations of semantic constraints, such as disjointness axioms or domain and range restrictions. Because many real-world ontology errors stem from semantic confusion rather than random corruption, this type of noise directly stresses the logical structure of the ontology and provides a more targeted challenge to reasoning systems. By combining these three types of noise, we aim to cover a spectrum of potential real-world errors, from accidental and statistically plausible mistakes to deliberate logical conflicts." - 2. The results of the experimental section are unclear to me: - I like the relative measurement of noise but what does 100% noise mean? That there is an equal number of noisy assertions compared to ground-truth assertions? This is confusing, with 100% noise I expect everything to be noise and nothing to be learnable. Response: 100% noise means that the number of added noisy assertions equals the number of original assertions, effectively doubling the total. We clarified this definition in the Results section in the figure captions (Figures 2-5). - The effect of adding noise seems small and the performance does not consistently decrease. In the case of statistical noise, the MRR even improves for Family and OWL2Vec? There seems to be something wrong there. Response: Statistical noise is introduced through GNN-generated triples. Unlike random noise, these GNN-generated triples tend to preserve some degree of structural and relational plausibility. As a result, the injected noise can occasionally reinforce existing local patterns in the graph instead of disrupting them. This can lead to slight performance increases in some configurations, which is consistent with the idea that the GNN model captures latent regularities that are beneficial for the embedding method. - It would be interesting to also benchmark a purely neural and a purely symbolic reasoner to show that this is a setting where NAI is useful. Response: We added experiments with a purely neural approach based on Graph Neural Networks. Purely symbolic reasoning is impractical: the symbolic space is too large, and the ontology contains inconsistencies that introduce logical noise, making it infeasible to execute symbolic reasoning reliably. These findings motivate our use of NAI, which combines the strengths of both paradigms. **Responses to Reviewer 2** - Artificiality of Noise Injection: While the paper introduces a clear and reproducible method for noise generation, some of the injected noise, particularly the statistical noise derived from low-probability GNN predictions, appears too "easy" or synthetic. Real-world ontologies often contain more adversarial or semantically subtle noise. The study would benefit from incorporating a broader spectrum of noise severity, including some manually constructed, semantically plausible errors that challenge different aspects of the reasoning process. Response: We appreciate the reviewer’s insightful comment regarding the nature of the injected noise. We agree that exploring a broader spectrum of noise, including manually constructed, semantically plausible errors, would indeed provide additional valuable insights into reasoning robustness. Our primary goal in this work was to systematically study reasoning robustness under different types of noise through three complementary, well-defined noise injection strategies. To better motivate these choices, we have added the following explanation in the Introduction (fourth paragraph): “Random noise serves as a baseline, representing data-agnostic, unpredictable errors that may arise accidentally in real-world ontologies. We simulate this by corrupting existing triples—replacing either the subject or the object with a random entity. This allows us to probe the robustness of reasoning processes against general perturbations that do not depend on the underlying data. Statistical noise is generated adversarially using Graph Neural Networks (GNNs), reflecting low-probability links that emerge from predictive uncertainty or bias in automated knowledge graph construction. Although synthetic, this form of noise models realistic mistakes produced by machine-learning systems, and mirrors the types of errors that error-detection models are typically asked to identify during KG construction. Logical noise captures violations of semantic constraints, such as disjointness axioms or domain and range restrictions. Because many real-world ontology errors stem from semantic confusion rather than random corruption, this type of noise directly stresses the logical structure of the ontology and provides a more targeted challenge to reasoning systems. By combining these three types of noise, we aim to cover a spectrum of potential real-world errors, from accidental and statistically plausible mistakes to deliberate logical conflicts.” We fully agree that extending our framework to include semantically subtle or adversarially crafted noise is an important direction for future work, and we explicitly note this as such. - Lack of Statistical Rigor in Results: The experimental section would be strengthened by the inclusion of error bars or confidence intervals to better reflect variability across runs and support claims about noise effects. Though the authors mention averaging over five runs, visual indicators of variance are missing in the main figures, limiting the statistical interpretability of trends.Response: We included boxplots in the appendix (Figures 6–9) to show the full distribution of results over the five runs. These visual indicators allow readers to assess variability and the statistical reliability of the observed trends, complementing the reported averages. - Limited Qualitative Analysis: The results focus on numerical performance metrics but omit qualitative insights into how specific examples of noise affect inference outcomes. Including a few illustrative examples where reasoning fails (or surprisingly succeeds) under noise would help ground the quantitative findings and offer readers more interpretability into the models' failure modes. Response: We appreciate the reviewer’s valuable suggestion. We have added an illustrative example of ABOX reasoning under noise in Figure 1. While we agree that additional qualitative examples would further enrich the analysis, time constraints prevent us from including more in this revision. We plan to incorporate more detailed cases of reasoning successes and failures under noise in future work to complement the quantitative results. - Some minor concerns: --On page 2, use the authors' names, not [1] et al. presented. --Sometimes closed quote is used instead of open quote e.g., on page 10, line 44 (as well as in a couple of other places). Response: We have addressed these minor issues throughout the paper: references to authors are now written using their names instead of “[1] et al.,” and quotation marks have been corrected where needed (e.g., page 10, line 44). **Responses to Reviewer 3** - The main motivation described in their work on creating a benchmark is not quite fulfilled, since the practicality of the creation of the benchmark currently lacks experiments and datasets that they have tested on. The evaluation is restricted to only two ontologies and two reasoners. This limits the generalizability of the findings. Response: In response, we reran the experiments and included one additional purely neural approach based on Graph Neural Networks to broaden the evaluation and also added a third ontology, Pizza. However, several practical constraints remain: (i) Ontologies: While many ontologies exist, most either have very small ABoxes or extremely large TBoxes (e.g., GeneOntology), posing practical challenges for systematic evaluation. (ii) Reasoners: For neurosymbolic reasoning, only a few approaches are readily applicable, such as random-walk-based methods (owl2vec*) and geometric-space methods (box2el). Many other existing techniques are not readily available, still in research prototypes, which further limits evaluation options. Despite these limitations, our benchmark provides a reproducible and extensible framework for systematic evaluation. It allows future work to incorporate additional ontologies or reasoning techniques as they become available, ensuring the benchmark remains a meaningful tool for assessing reasoning robustness. - The authors acknowledge that "specific characteristics of each ontology significantly influence the effectiveness of noise injection," yet fail to adequately address this through broader experimentation. Response: We acknowledge that the specific characteristics of each ontology, including the types of inferences and commonly used axioms, play a significant role in the effectiveness of noise injection. For instance, in the pizza ontology, many inferences involve subproperty, inverse property, or functional axioms. In future work, we plan to conduct a more systematic and detailed analysis of these inference patterns across different ontologies to better understand their impact on noise robustness. - The results show different patterns across ontologies, with no consistent trend and focus exclusively on ABox noise. Since TBox noise is common in real-world settings, it limits the applicability of the benchmark to certain scenarios. Response: In this study, we focus on ABox noise to systematically evaluate its effects on reasoning, as it represents a common first step in ontology noise research. We acknowledge that TBox noise is important for real-world ontologies and can impact reasoning in different ways. However, we focus on ABox noise, as the ABox is now more central to many real-world Knowledge Graphs. Extending our benchmark to include TBox noise is an important direction for future work, which would broaden the applicability of our approach and enable a more comprehensive evaluation across multiple types of ontology inconsistencies. - In addition, since the baseline performance scores are very low and, as they mentioned, "it is difficult to identify any clear trend, as the values are already low, even without the introduction of noise," it raises questions about the suitability of the chosen tasks and datasets without expanding their evaluation scope. Response: We reran the experiments, adding a purely neural baseline based on Graph Neural Networks and a third ontology, Pizza, to broaden the evaluation scope. We also refined our data splitting strategy. Specifically, we updated Section 4.1 (third paragraph ) as follows: “Let G denote the original ontology and I the ontology inferred using Pellet reasoner [https://www.sciencedirect.com/science/article/abs/pii/S1570826807000169]. Since our approach is unsupervised, the graph G is ultimately added to G_train, while I is randomly assigned to G_train, G_test and G_val. The TBox is further added to G_test and G_val, ensuring that the reasoning tasks are based on a shared conceptual framework.” Previously, we did not include any inferences in the training set, which was incorrect. This was incorrect, as excluding inferences led to incomplete graph representations during training and an inconsistent distribution between training and evaluation sets. Furthermore, our study highlights that most previous work has mainly focused on ontology completion (i.e., prediction tasks), whereas our emphasis is on ontology reasoning, a more challenging inference task. This naturally results in lower baseline scores, as reasoning requires multi-step logical deductions rather than simpler predictions. Additionally, the scores for Object Property Assertions (OPA) are sometimes low due to the nature of the test sets. For example, in the Family ontology, OPA triples constitute over 98% of the test set. Since many of these OPAs arise from multi-step inferences produced by Pellet, the test set is dominated by structurally complex, inference-heavy triples. This makes the reasoning task inherently difficult and causes all models to exhibit low OPA performance even without noise. - The authors should include additional ontologies from different domains and complexity levels and evaluate more neurosymbolic reasoners to establish generalizable patterns; otherwise, it is unclear how practitioners should use this benchmark and interpret results for improving reasoner robustness. Including traditional symbolic reasoners in the evaluation would better contextualize the performance of the neurosymbolic approaches. Response: To broaden the evaluation, we added an additional ontology, Pizza, and a purely neural baseline based on Graph Neural Networks to cover different domains and structural complexity. While traditional symbolic reasoners could help contextualize performance, they are generally unable to handle noise, which is a key aspect of our benchmark. We note practical constraints: many ontologies either have very small ABoxes or extremely large TBoxes (e.g., GeneOntology), and only a few neurosymbolic reasoners (e.g., owl2vec*, box2el) are readily applicable, while others remain research prototypes. Despite these limitations, our benchmark provides a reproducible and extensible framework, allowing future work to incorporate additional ontologies and reasoning techniques as they become available, ensuring it remains a useful tool for evaluating reasoning robustness. **Responses to Reviewer 4** 1. In Section 2.1, the paper brings up Henry Kautz’s categorization scheme of different types of reasoners. Adding a diagram/picture with a list of those categories as well as key examples, as you do for the two relevant categories in Section 2.1, will provide necessary additional context. Response: We added the definitions of those categories, as well as examples in the beginning of Section 2.1. 2. The reason that I have made this point is that Section 2.1 does not clearly summarize the differences between Box2El and OWL2Vec. I cannot determine any concrete differences between the two, though the author spends 2 paragraphs discussing each reasoner/embedding method. Adding an additional diagram or an additional concluding paragraph that recapitulates the key diffs between the tasks that these embeddings do well at (and the fact these tasks are different), as well as also stating that the datasets and metrics are different between these tasks, helps motivate section 2.2 clearly. Response: We added Table 1 at the end of Section 2.1 that highlights the key differences between OWL2Vec* and Box2El. 3. In Section 2.2, the authors are vague in describing the differences between Makni et al, and Ebrahimi et al. Both sets of authors are trying to give metrics for the effectiveness of RDFS entailment reasoning. But what are these metrics and how do they differ? Giving a concrete example will make it clear for the reader why, even when dealing with just one task, there is such difference and variety in metrics, and therefore motivate the need for this dataset/benchmark you are developing. Response: The idea was to provide an example of why we need to have standardized evaluations (including metrics and datasets). We added the following: “Specifically, Makni et al. [https://semantic-web-journal.net/system/files/swj1866.pdf] used LUBM and a scientist dataset derived from DBpedia as benchmarks, evaluating performance with Precision, Recall, and F1 score. In contrast, Ebrahimi et al. [https://arxiv.org/abs/2106.09225] employed LUBM and synthetic data, using exact matching accuracy as their metric.” 1. In 3.1.1, in the subsection "Introducing Noise", you say that you add "k" individuals to the ontology. Does this mean that the individuals do not currently exist in the ontology? I.e, with John rdf:type Male and John rdf:type Female, I should assume John is one of the k individuals and that John does not exist in the ontology to begin with? Response: For logical noise, we consider both existing and new individuals. We first select individuals already present in the ontology; if additional examples are needed to reach a desired noise level, we introduce new, fictional individuals. Disjoint class and disjoint property axioms are used to create inconsistencies by assigning individuals to either two disjoint classes or properties. We agree that the original phrasing “we added k individuals to the ontologies” may be misleading. Thus, Section 3.1.1 has been updated for greater clarity. In addition, we have added an illustrative example of ABOX reasoning under noise in Figure 1. a. Does this not contradict the line in the introduction: "While ABox noise [which these techniques are about introducing ABox noise] is about corrupting an existing triple in an ontology by changing one of the triples' resources"? I can’t tell if ABox noise is about corrupting the individual or adding new individuals or both. Response: ABox noise refers to the introduction of inconsistencies or corrupted triples into an ontology. This can occur either by modifying existing triples or by adding new, noisy ones. In our approach, we generate ABox noise (both random and statistical) by corrupting the subject or the object of existing triples and then adding the resulting modified triples as new entries in the ontology. b. Should I assume that you take existing individuals/triples in the ontology and make them violate the disjoint axioms, or that you add individuals (who do not exist in the ontology at all) that disobey the constraints? Response: We first select existing individuals and assign them to either disjoint classes or disjoint properties. If additional examples are needed to reach a specific noise level, or if the ontology lacks suitable candidates, we introduce new, fictional individuals to achieve the desired level of inconsistencies. c. Perhaps rephrasing to make this explicit will make it clearer, especially given that you say this in the description of Logical Noise: "We introduce noise... by assigning an individual to two disjoint classes.", and you also state in 3.3 that you "corrupt either the object of the subject of existing triples". Response: To address this, we have revised Section 3 entirely to clarify how individuals and triples are used in noise generation, including the distinction between modifying existing triples and introducing new, inconsistent ones. We hope this makes the methodology clearer. 2. In 3.2, I'm curious as to what would have happened had you added the triples with the low probability assertions rather than modifying existing triples to have the low probability assertions. Doing both and comparing and contrasting the effects they have would better cover all the types of contradictions that could occur, would it not? Or is the assumption that the low probability assertions would directly contradict with the current assertions that the triples-to-be-modified have? Response: We focused on modifying existing triples because our task is link prediction—i.e., predicting the missing entity in patterns like (?, predicate, object) or (subject, predicate, ?). By introducing low-probability assertions into existing triples, we ensure that the contradictions directly interact with the knowledge already present in the graph, which is the context in which link prediction operates. That said, we do in fact “add” these modified triples back into the graph. So while our emphasis was on altering existing triples for consistency with the prediction task, the framework naturally accommodates the presence of these newly modified triples. 3. Were there any considerations/scheme taken in 3.3 to figure out WHICH triples were going to be corrupted? I assume that triples/objects that appear more in a dataset (a particular person, for example, may have more triples than another person), if corrupted, would introduce more random noise than a person/object that only appears once as a triple. Response: The reviewer is correct that corrupting triples associated with frequently occurring entities could introduce disproportionately more noise than corrupting triples of infrequent entities. However, in our experiments, we deliberately kept the corruption process completely random. This choice ensures a clean comparison between this form of random noise injection and other techniques (statistical and logical noise), without introducing additional biases from a targeted selection scheme. 1. The paragraph that begins with "Let G denote the original ontology, and I the ontology inferred..." needs to be reworked to more clearly explain what is being done and why it is being done. I will add additional comments below: Response: In much of the prior work, ontologies are simply split into training, validation, and test sets using a standard ratio (e.g., 80/10/10). While this is suitable for ontology completion tasks, it does not reflect the requirements of ontology reasoning. Our focus in this paper is specifically on reasoning. To evaluate this, we employed neurosymbolic reasoners (Box2EL and OWL2Vec*). However, in order to test these methods meaningfully, we first needed a reliable ground truth set. For this, we used Pellet, a well-known symbolic reasoner, to generate inferences. An introductory sentence was added to motivate why this step is necessary: “The majority of prior research has concentrated on ontology completion tasks (i.e., prediction) rather than on ontology reasoning tasks (i.e., inference) [https://arxiv.org/abs/2507.14334]. Ontology or link completion involves identifying plausible relations that enrich the original ontology, as demonstrated in the study by Chen et al. [https://arxiv.org/abs/2009.14654]. In prediction tasks, the training, validation, and testing datasets are typically created by randomly splitting the ontology axioms.” a. As far as I can tell, you are trying to modify the ontologies to be consistent in terms of hop length for all possible resources R within the original ontology. I.e, you are taking the subgraph for each resource R, and making it so that any statement/assertions are at most 2 hops away from R?, and then reconstituting the general graph this way into a modified ontology? Response: We have removed this part from the paper because we now operate directly on the full ontology. The earlier approach was introduced purely for computational efficiency. Specifically, for each resource R, we previously extracted a 2-hop subgraph that contained all statements and assertions reachable within two hops of R. This ensured a uniform hop length across resources and, more importantly, allowed us to run the Pellet reasoner on many small subgraphs rather than on the full ontology, which was significantly faster given the size of the original dataset. In the current version of the work, however, we no longer perform this transformation: we reason over the complete ontology using Pellet. This simplifies the preprocessing pipeline and removes the need for generating per-resource subgraphs. b. Why do you need to make the inference graphs i1, i2... iR? I thought that it is the NS reasoner's job (the one you are testing, not Pellet) to make these inference graphs for whatever assertion/inference you are trying to test for a given dataset with some noise. Response: We have removed this part from the current version of the paper, but let us clarify the motivation behind it. In our earlier experiments, for each resource R, we extracted a subgraph g_R and then used a standard DL reasoner (Pellet) to compute the corresponding inference graph i_R. These inference graphs served as ground-truth reference outputs. While the NS reasoner indeed produces its own inferences from the data, an external reference is required to evaluate its correctness. The NS reasoner cannot evaluate itself; we need a reliable source of expected inferences in order to measure the performance. Pellet-generated inference graphs provided this gold-standard baseline. In the updated version of the work, we no longer generate multiple per-resource graphs. Instead, we run Pellet once on the full ontology and store all inferred facts together. Conceptually the evaluation procedure is unchanged—we still compare the NS reasoner’s inferences against a ground truth—but we now obtain this reference from a single, global inference graph rather than many smaller ones. c. Why are you using Pellet? Is Pellet a standard tool to use? Response: We are using Pellet because it is a well-established symbolic reasoner for OWL ontologies. It serves as a reliable baseline for performing standard reasoning tasks, such as consistency checking and inference generation, which allows us to validate our approach and compare results. d. Why are you getting rid of "Literal" and "owl:Thing"? It wasn’t clear to me. Response: We remove Literal and owl:Thing because they do not contribute to meaningful or informative inferences. Our focus is on entities and relationships that convey semantic content, so excluding these generic or non-informative elements helps us concentrate on the inferences that are truly relevant. 2. This is my own fault for not knowing about MRR and Hits@N, but why are we using these metrics over others? No papers are cited that show that these metrics were used anywhere else in similar tasks -- thereby making it plausible to use these metrics as a unifying standard. If there are no papers that use it, then I think there should be an explanation for why they are being used. Response: MRR and Hits@N are widely used metrics in tasks such as Class Membership and Object Property Assertions (also known as Link Prediction). The link prediction task involves identifying an entity that forms a valid fact (an edge) when combined with a given relation and another entity. a. Is it possible to give a motivating example? I've never heard of these metrics and I didn't understand what exactly they're measuring. Response: A motivating example is as follows. In link prediction, we often evaluate queries such as (Barack Obama, born_in, ?), where the model must rank candidate answers (e.g., Hawaii, Kenya, New York, …). Hits@K measures how often the correct answer appears among the top K ranked candidates—for example, if the true answer “Hawaii” is ranked within the top 10 in 85% of cases, then Hits@10 = 85%. Mean Reciprocal Rank (MRR) instead considers the exact position: if “Hawaii” is ranked 1st, its reciprocal rank is 1; if 2nd, 1/2; if 10th, 1/10; and we average this across all queries. Thus, Hits@K captures whether the model places the correct entity near the top at all, while MRR captures how close to the very top the model ranks it on average. b. That may be beyond the scope of the paper and cause length issues, however. Response: We do not include this in the paper itself, as these metrics are standard and well-known for ontology evaluation but we hope that our response can better help you to understand these metrics. 3. I want to make sure I’m understanding what the actual “running”/”execution” of your code/dataset is: Response: The actual “execution” of the framework involves introducing noise into the ontologies using three distinct techniques. The primary goal of these experiments is to evaluate how the reasoner’s performance is affected on the two chosen ontology tasks when ABox noise is present during training. a. Is the idea that you run the reasoner and then see what class and object property assertions are generated when you introduce ABox noise into the ontologies at varying intensities? Response: Yes, the framework runs the reasoner on the ontology after adding noise at varying intensities. The generated class and object property assertions are then analyzed to measure how the reasoner’s predictions degrade as noise increases. b. Would Hits@N not count negative assertions (A isNotRel B?)? Response: Hits@N considers only the correct (positive) assertions when ranking predictions. c. I feel that more verbiage should be used to explain how exactly the Hits@N and MRR are generated; perhaps a contrived example might be used? I found the example about Richard_john_bright was helpful. Response: A detailed explanation is provided in our previous response; we do not include this in the paper itself, as these metrics are standard and well-known for ontology evaluation. 1. Regarding the results, I think they are interesting and prove the value of the work being done. However, I do not think the graphs themselves are particularly helpful at conveying the information. 2. The issue is that the unit values on the y-axes for both sets of graphs are so small that looking at the bars alone doesn't convey the drop/difference in y-values as you introduce different types of noise. 3. I think it would be helpful, potentially, to add the numerical values atop the bars themselves, so that we can see the numbers clearly and infer results using the numbers, rather than needing to read the corresponding paragraphs. Response: Following the recommendation, we have updated Figures 2-5 in the Results section to improve clarity and better convey the trends. Specifically, we added the numerical values atop the bars so that readers can more easily interpret the results without relying solely on the text. 4. Secondly, flipping between the pages (or going up and down on the computer pdf viewer) for class and property assertions for given reasoners and datasets makes it difficult to process the data. 5. I think it may be helpful to structure the graphs like so, so there are 2 - 4 on a page. Response: We have updated the layout of Figures 2-5 as suggested by the reviewer, improving readability and presentation of the results. a. Random Noise -- Owl2Vec | Random Noise -- Box2El b. Statistical Noise -- Owl2Vec | Statistical Noise -- Box2El c. Logical Noise -- Owl2Vec | Logical Noise - Box2El 6. In each graph, you can show the Class and Property assertion values for each dataset as noise is being varied. This collects the results according to the reasoners + noise, and shows the effects of noise on each of the tasks more clearly. Response: We have updated Figures 2 to 5 to display Class and Property assertion values for each dataset as noise is varied, allowing clearer visualization of the effects of noise on each reasoning task. 7. You will still have many graphs, but it will be easier to find information. For example, you describe the MRR for class and property object assertions decreasing as diff types of noise are introduced. You describe the effect of logical noise. Then the reader can look at the Logical Noise - Owl2Vec and Logical Noise -- Box2El graphs and see clearly the effects the noise has for both class and property assertions by seeing these graphs side by side. 8. By reformatting the graphs to have the numerical values of the MRR displayed as well as collecting the results in a slightly different way, it makes it much easier for the reader to read the data and then read the paragraphs explaining the data. As it is now, it's hard to read the values discussed in the Results paragraph and then try and verify by looking at the graphs. I would also like to note that I had issues with meson’s build process for openblas, I think, when trying to clone the repo to verify the results that the authors had found. The issue came about when running `pip install -r requirements.txt`. Therefore I could not replicate their results. If the authors can revise their codebase and retest it to make sure these results are easily generatable, that would go a long way towards replicability! Response: We changed the entire repository to make it reproducible. Please don’t hesitate to reach out if you still encounter any issues.

Approve Decision:

Approved

Revised Version:

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Previous Version:

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tags:

Reviewed

Decision:
Minor Revision

Solicited Reviews:

Review #1 submitted on 18/Jan/2026

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Satisfactory
Overall presentation: Good

Detailed Comments:

The authors propose NSORN (Neurosymbolic Ontology Reasoning with Noise), a framework designed to address the current lack of standardized benchmarks and evaluation metrics for assessing the robustness of neurosymbolic ontology reasoning systems. To mitigate this gap, the authors developed three specific mechanisms for introducing noise into the ABox of an ontology: logical noise derived from axiom violations, random corruption of triples, and statistical noise generated through Graph Neural Networks.

In their study, the authors evaluated the performance of state-of-the-art neurosymbolic reasoners, specifically Box2EL and OWL2Vec*, alongside the purely neural R-GCN model across the Pizza, Family, and OWL2Bench ontologies. The research focused on class membership and object property assertion tasks under varying noise intensities. The results demonstrate that R-GCN consistently exhibits greater resilience to ontological noise—most notably logical noise—maintaining stable performance levels, whereas embedding-based neurosymbolic models suffer significant degradation in accuracy.

Overall, I feel that the paper is significantly more improved from its first version and the authors seem to have taken many of the reviewwr comments into account. I don't have further revisions to suggest.

Review #2 submitted on 02/Feb/2026

By Anonymous User
Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average

Content:
Technical Quality of the paper: Average
Originality of the paper: Yes
Adequacy of the bibliography: Yes, but see detailed comments

Detailed Comments:

The authors have made a genuine effort to address the concerns raised in the previous review. In particular, they extended the evaluation by adding a third ontology (Pizza) and a purely neural baseline based on Graph Neural Networks, which partially addresses the earlier concern regarding the narrow experimental scope limited to two ontologies and two neurosymbolic approaches. They also corrected the data-splitting strategy by including inferred knowledge during training, which improves the soundness of the experimental setup and strengthens the validity of the reported results.
However, three ontologies are still a small number for a benchmark paper that aims to claim generalizable patterns. While this represents an incremental improvement, it is not a decisive one. The authors’ justification for the lack of additional ontologies (“small ABoxes / huge TBoxes like Gene Ontology”) is somewhat reasonable as a practical constraint, but not fully convincing. There exist mid-sized ontologies/KGs that could have been used, even if not perfect. A benchmark can include “tiers” (small/medium/large) rather than aiming for a single ideal. The authors also acknowledge that only a few neurosymbolic methods are readily applicable. That’s honest, but it also means the benchmark is still method-limited.
They acknowledge that ontology-specific characteristics significantly influence the effectiveness of noise injection, yet this issue is not empirically addressed through broader experimentation and is instead deferred to future work. Regarding trends across ontologies and reasoner performance, the paper reports that graph neural network–based reasoning (R-GCN) exhibits significantly higher resilience to noisy ontological data, including logical noise. This is a valuable and concrete insight, and it provides a reasonable starting point for benchmarking noise adaptability in neurosymbolic reasoning approaches.
Overall, the revisions improve the paper and address several technical issues, but the benchmark remains limited in scope. If the authors soften their claims and reframe the contribution as a first-step benchmark framework that emphasizes reproducibility and extensibility, rather than as a definitive benchmark supporting broad conclusions about neurosymbolic reasoning robustness, the work would be more appropriately positioned for acceptance at this stage.

Review #3 submitted on 31/Jan/2026

By Shreyas Casturi
Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good

Content:
Technical Quality of the paper: Good
Originality of the paper: Yes
Adequacy of the bibliography: Yes

Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Good
Organization of the paper: Satisfactory
Level of English: Unsatisfactory
Overall presentation: Good

Detailed Comments:

I was one of the original reviewers for this paper back in July. I think this is a much stronger paper and much easier to read. I still have some questions and comments, so I will mark this as a "minor revision", but I think the paper is solid.

The main thrust of this paper is that there aren't a lot (if any) benchmark datasets that handle neurosymbolic reasoners' ability to handle noise -- specifically ABox noise, where you corrupt or affect the triples -- in a dataset. Noise in a dataset is to be expected, due to incomplete knowledge, random errors, etc...

The authors take two SOTA neurosymbolic reasoners, OWL2Vec and DL EL++, along with a graph neural network, and take three different ontologies, Pizza, Family, and OWL2Bench, and apply three different types of noise to these datasets: logical noise, where you mess with triples by making the subjects violate disjoint class axioms (James isCurrently Alive, James isCurrently Dead) or by breaking domain-range restrictions for object properties. There is random noise, which is simply corrupting tuples randomly, and statistical noise, where one uses a graph neural network to determine which objects have the lowest link prediction score (aren't related), and then make them part of a related triple with said link.

The authors test the reasoners on the datasets while varying the different types of noise on the test dataset (i.e, the reasoners are trained on an ontology/dataset using a training set, but the test set has noise applied to it). The two tasks for the reasoners to execute on each dataset are membership and object property determination (is an object a member of a class, and are there new relationships between two objects in an ontology). The metrics are Hits@N (N = 1, 5, 10) and Mean Reciprocal Rank (MRR).

The authors found that, overall, the graph neural network performs best on the datasets and is fairly resilient to noise, due to the graph structure which it uses for learning/embedding, which can mitigate the effects of noise. The neurosymbolic reasoners, on the other hand, breakdown significantly. The authors note that the reasoners tend to perform poorly on the object property task across all the datasets, and that this might be due to the way the test and training sets are setup, and what triples actually populate those datasets originally. The authors discuss how structural properties of the datasets will affect robustness (robustness to noise) generally, due to the composition ratios of different types of triples.

I think the paper is good and well-reasoned. The authors have added/modified the graphs that I discussed in my last revision, and as a result the paper is easier to read. The diagram showing the instances of Person along with the tables for different types of noise are also greatly appreciated. The authors also incorporated my suggestion of having a table to contrast between OWL2Vec and DL EL++ in Section 2. While I don't think the paragraphs themselves are still as well-written or do as good of a job at motivating the key differences between these two reasoners, the table helps and I think overall that section is fine.

The authors also took out some extraneous sections related to working on subsets of the ontologies, and developed and tested a new ontology (the Pizza ontology).

I think the authors have done what they set out to do: they discussed how to modify an ontology in order to introduce noise, and showed the effects of noise on neurosymbolic reasoners for two tasks on three datasets. The authors also made sure to mention how robustness to noise is necessarily tied to the structural properties of the datasets themselves, and the statistics reported show how neurosymbolic reasoners don't handle noise very well in any form, especially "logical" noise, illustrating that more work has to be done in order to improve neurosymbolic reasoning.

I didn't look through the bibliography too deeply, but nothing seemed out of turn.

Most of my critiques/questions are now fairly minor.
1. Can statistical contradiction/noise not potentially create or be an instance of logical noise, because the new links between two objects might violate domain/range properties? Or is the graph neural network trained to respect those semantics when accounting for low-probability triples?

2. In your response to my prior critique, you discussed how and why Pellet was used, i.e, that Pellet is used to get/infer all facts and be used as a ground-truth reference. I think you should include a line about this in the paper when you bring up Pellet for the first time, on page 8. I remember being confused when I read this, and then I saw your response to me and that made more sense. As a result, I think even one or two lines will sufice just to make it clear how Pellet fits in.

3. I'm not sure if this is possible and it may end up being a bit too much to include, but do you think it may be helpful for Figures 2, 3, and 4, to have subfigures for each of the plots? That way, in your results section, you can state: "Reference Figure 2, Subfigure X". The issue is i'm not sure where you'd put the Subfigure heading, at the top of each graph, perhaps? Again, this is minor; the graphs are already much improved from the prior review, and I'm appreciative of that.

4. In your Results section (Section 5), you have a series of short paragraphs discussing the results of running the reasoning tasks on each dataset subject to varying noise and varying the reasoner. I think these paragraphs are fine, but it may be helpful to divide the paragraphs into small subsections so that the reader can easily see results for which ontology they want to look at. Something like this might help:

5. Results

5.1: FAMILY RESULTS
- Discuss family ontology (the first few paragraphs)

5.2: PIZZA RESULTS
- Discuss/include pizza related paragraphs

5.3: OWL2BENCH RESULTS
- Discuss OWL2Bench paragraphs.

4. In Appendix A, you show the various experiment data tables and the statistics gleaned from each of the experimemts. Can you add a note about what number you are bolding? I believe you're bolding the "lowest" number, showing under what conditions each of the reasoners fails the most or loses the most out per metric, but it's not clear what is being bolded. Just adding this as a note alongside "Results on using ".

5. I'm not the biggest fan of the boxplots because I think they lack numbers. Notice how the graphs in Figures 2, 3, 4, have the numerical values at each point, so it's easy for us to remember what the y-axis is and see the numerical values. With the boxplots, I don't see those values. I don't usually graph boxplots, so I'm not sure if there is a feature to highlight what the numbers are for the low end, the median/middle, and the high end values (I forget what you call them exactly). I would be alright if the plots weren't there, but if you do want to keep them there, can you add a note in the Results section about what the variability is showing us/illustrating?

6. Probably a dumb question, but how/what exactly do you feed to the reasoners in order to get the inferences for each of the tasks? For example, do you ask the reasoners if an Object belongs to a Class for the membership task, and then for object property axiom question, do you feed a tuple of two Objects and a Relation and ask if the relation holds between the objects? I think maybe just a quick example, such as "We test the reasoners with the dataset on these tasks. For example, with the Pizza dataset, we would ask OWL2Vec , and we would ask DL EL++". This is probably on me for not working it out clearly.

7. Lastly, due to time constraints, I did not have the ability to really look through your code. I tried setting up with the new repo, and ran into issues with Meson again, but I think the errors were different this time around. For your reference, here is the error output:

```
Collecting scipy==1.13.1 (from -r requirements.txt (line 77))
Using cached scipy-1.13.1.tar.gz (57.2 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Installing backend dependencies ... done
Preparing metadata (pyproject.toml) ... error
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [53 lines of output]
+ meson setup /tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d /tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d/.mesonpy-z2a_d92a -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md --native-file=/tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d/.mesonpy-z2a_d92a/meson-python-native-file.ini
The Meson build system
Version: 1.10.1
Source dir: /tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d
Build dir: /tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d/.mesonpy-z2a_d92a
Build type: native build
Project name: scipy
Project version: 1.13.1
C compiler for the host machine: cc (gcc 14.2.0 "cc (Debian 14.2.0-19) 14.2.0")
C linker for the host machine: cc ld.bfd 2.44
C++ compiler for the host machine: c++ (gcc 14.2.0 "c++ (Debian 14.2.0-19) 14.2.0")
C++ linker for the host machine: c++ ld.bfd 2.44
Cython compiler for the host machine: cython (cython 3.0.12)
Host machine cpu family: x86_64
Host machine cpu: x86_64
Program python found: YES (/home/user/noisybench_work/env/bin/python3)
Found pkg-config: YES (/usr/bin/pkg-config) 1.8.1
Run-time dependency python found: YES 3.13
Program cython found: YES (/tmp/pip-build-env-tnbs86m2/overlay/bin/cython)
Compiler for C supports arguments -Wno-unused-but-set-variable: YES
Compiler for C supports arguments -Wno-unused-function: YES
Compiler for C supports arguments -Wno-conversion: YES
Compiler for C supports arguments -Wno-misleading-indentation: YES
Library m found: YES

../meson.build:78:0: ERROR: Unknown compiler(s): [['gfortran'], ['flang-new'], ['flang'], ['nvfortran'], ['pgfortran'], ['ifort'], ['ifx'], ['g95']]
The following exception(s) were encountered:
Running `gfortran --help` gave "[Errno 2] No such file or directory: 'gfortran'"
Running `gfortran --version` gave "[Errno 2] No such file or directory: 'gfortran'"
Running `gfortran -V` gave "[Errno 2] No such file or directory: 'gfortran'"
Running `flang-new --help` gave "[Errno 2] No such file or directory: 'flang-new'"
Running `flang-new --version` gave "[Errno 2] No such file or directory: 'flang-new'"
Running `flang-new -V` gave "[Errno 2] No such file or directory: 'flang-new'"
Running `flang --help` gave "[Errno 2] No such file or directory: 'flang'"
Running `flang --version` gave "[Errno 2] No such file or directory: 'flang'"
Running `flang -V` gave "[Errno 2] No such file or directory: 'flang'"
Running `nvfortran --help` gave "[Errno 2] No such file or directory: 'nvfortran'"
Running `nvfortran --version` gave "[Errno 2] No such file or directory: 'nvfortran'"
Running `nvfortran -V` gave "[Errno 2] No such file or directory: 'nvfortran'"
Running `pgfortran --help` gave "[Errno 2] No such file or directory: 'pgfortran'"
Running `pgfortran --version` gave "[Errno 2] No such file or directory: 'pgfortran'"
Running `pgfortran -V` gave "[Errno 2] No such file or directory: 'pgfortran'"
Running `ifort --help` gave "[Errno 2] No such file or directory: 'ifort'"
Running `ifort --version` gave "[Errno 2] No such file or directory: 'ifort'"
Running `ifort -V` gave "[Errno 2] No such file or directory: 'ifort'"
Running `ifx --help` gave "[Errno 2] No such file or directory: 'ifx'"
Running `ifx --version` gave "[Errno 2] No such file or directory: 'ifx'"
Running `ifx -V` gave "[Errno 2] No such file or directory: 'ifx'"
Running `g95 --help` gave "[Errno 2] No such file or directory: 'g95'"
Running `g95 --version` gave "[Errno 2] No such file or directory: 'g95'"
Running `g95 -V` gave "[Errno 2] No such file or directory: 'g95'"

A full log can be found at /tmp/pip-install-ocnc7ptb/scipy_ef223974eb7c4453a245796782c4813d/.mesonpy-z2a_d92a/meson-logs/meson-log.txt
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
```

Then, here is my computer configuration, obtained via `fastfetch`, if this helps you:

```
OS: Debian GNU/Linux 13 (trixie) x86_64
Host: HP ProDesk 600 G4 MT
Kernel: Linux 6.12.63+deb13-amd64
Uptime: 6 days, 19 hours, 54 mins
Packages: 1546 (dpkg)
Shell: bash 5.2.37
Display (V505-J01): 1707x960 @ 30 Hz in 49" [External]
DE: Xfce4 4.20
WM: Xfwm4 (X11)
WM Theme: Default
Theme: Xfce [GTK2/3/4]
Icons: Tango [GTK2/3/4]
Font: Sans (10pt) [GTK2/3/4]
Cursor: Adwaita
Terminal: xfce4-terminal 1.1.4
Terminal Font: Monospace (10pt)
CPU: Intel(R) Core(TM) i5-8500 (6) @ 4.10 GHz
GPU: Intel UHD Graphics 630 @ 1.10 GHz [Integrated]
Memory: 5.97 GiB / 15.41 GiB (39%)
Swap: 512.00 KiB / 6.16 GiB (0%)
Disk (/): 12.46 GiB / 109.81 GiB (11%) - ext4
```

FINAL VERDICT: Greatly improved paper. Just make a few revisions and I think it's good. Thanks for the work!

NSORN: Designing a Benchmark Dataset for Neurosymbolic Ontology Reasoning with Noise

Tracking #: 912-1929

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Previous Version:

Tags:

Recent blog posts

Journal Info

Submit

For Reviewers

Links

Search form

Tracking #: 912-1929

Flag : Review Received

Authors:

Responsible editor:

Submission Type:

Full PDF Version:

Cover Letter:

Approve Decision:

Previous Version:

Tags:

Journal Info

Submit

For Reviewers

Links