By Shreyas Casturi
Review Details
Reviewer has chosen not to be Anonymous
Overall Impression: Average
Content:
Technical Quality of the paper: Average
Originality of the paper: Yes, but limited
Adequacy of the bibliography: Yes
Presentation:
Adequacy of the abstract: Yes
Introduction: background and motivation: Limited
Organization of the paper: Needs improvement
Level of English: Satisfactory
Overall presentation: Weak
Detailed Comments:
This is my first review of any paper. I apologize that it took me as long as it did to get a review out.
The paper has noted an important problem -- Neurosymbolic (NS) reasoners lack datasets/ontologies to test a reasoner's resilience to noise. There are a variety of different datasets for NS reasoning that use different metrics, deal with different tasks, etc, but a lack of a unified benchmark dataset causes issues in evaluating reasoners' abilities in a general manner.
To that end, the authors codified different types of tactics to generate noise (randomly corrupting triples, making ontology resources violate disjoint axioms, adding low-probability links between resources that are superficially plausible but aren’t true), codified a set of metrics (Mean Reciprocal Rank and Hits@N) to measure a NS reasoner’s resilience to noise, and developed a dataset from the OWL2Bench ontology and a Family ontology and added varying amounts of noise. The authors ran 2 reasoners (OWL2Bench and Box2EL) on these datasets with varying amounts of noise and determined how different noise generation tactics affect the reasoner’s ability to infer relationships.
I think the paper provides value. The authors clearly note how there are many different types of tasks for reasoners and why the lack of standard datasets causes issues in evaluating reasoners effectively.
However, I think the paper is not ready to be accepted. The content of the paper is itself fine, but the presentation of the paper is lacking, and thus the paper needs massive revisions.
I have a series of comments and questions that I’m posting below. Again, I have never reviewed a paper before, so perhaps this is not the right way to review/ask for revisions, but I wanted to bring to the authors’ attention several points that I believe need revision/attention.
For now, I have chosen the “major revisions” option.
QUESTIONS/COMMENTS PER SECTION
SECTION 2
1. In Section 2.1, the paper brings up Henry Kautz’s categorization scheme of different types of reasoners. Adding a diagram/picture with a list of those categories as well as key examples, as you do for the two relevant categories in Section 2.1, will provide necessary additional context.
2. The reason that I have made this point is that Section 2.1 does not clearly summarize the differences between Box2El and OWL2Vec. I cannot determine any concrete differences between the two, though the author spend 2 paragraphs discussing each reasoner/embedding method. Adding an additional diagram or an additional concluding paragraph that recapitulates the key diffs between the tasks that these embeddings do well at (and the fact these tasks are different), as well as also stating that the datasets and metrics are different between these tasks, helps motivate section 2.2 clearly.
3. In Section 2.2, the authors are vague in describing the differences between Makni et al, and Ebrahimi et al. Both sets of authors are trying to give metrics for the effectiveness of RDFS entailment reasoning. But what are these metrics and how do they differ? Giving a concrete example will make it clear for the reader why, even when dealing with just one task, there is such difference and variety in metrics, and therefore motivate the need for this dataset/benchmark you are developing.
SECTION 3
1. In 3.1.1, in the subsection "Introducing Noise", you say that you add "k" individuals to the ontology. Does this mean that the individuals do not currently exist in the ontology? I.e, with John rdf:type Male and John rdf:type Female, I should assume John is one of the k individuals and that John does not exist in the ontology to begin with?
a. Does this not contradict the line in the introduction: "While ABox noise [which these techniques are about introducing ABox noise] is about corrupting an existing triple in an ontology by changing one of the triples' resources"? I can’t tell if ABox noise is about corrupting the individual or adding new individuals or both.
b. Should I assume that you take existing individuals/triples in the ontology and make them violate the disjoint axioms, or that you add individuals (who do not exist in the ontology at all) that disobey the constraints?
c. Perhaps rephrasing to make this explicit will make it clearer, especially given that you say this in the description of Logical Noise: "We introduce noise... by assigning an individual to two disjoint classes.", and you also state in 3.3 that you "corrupt either the object of the subject of existing triples".
2. In 3.2, I'm curious as to what would have happened had you added the triples with the low probability assertions rather than modifying existing triples to have the low probability assertions. Doing both and comparing and contrasting the effects they have would better cover all the types of contradictions that could occur, would it not? Or is the assumption that the low probability assertions would directly contradict with the current assertions that the triples-to-be-modified have?
3. Were there any considerations/scheme taken in 3.3 to figure out WHICH triples were going to be corrupted? I assume that triples/objects that appear more in a dataset (a particular person, for example, may have more triples than another person), if corrupted, would introduce more random noise than a person/object that only appears once as a triple.
SECTION 4
1. The paragraph that begins with "Let G denote the original ontology, and I the ontology inferred..." needs to be reworked to more clearly explain what is being done and why it is being done. I will add additional comments below:
a. As far as I can tell, you are trying to modify the ontologies to be consistent in terms of hop length for all possible resources R within the original ontology. I.e, you are taking the subgraph for each resource R, and making it so that any statement/assertions are at most 2 hops away from R?, and then reconstituting the general graph this way into a modified ontology?
b. Why do you need to make the inference graphs i1, i2... iR? I thought that it is the NS reasoner's job (the one you are testing, not Pellet) to make these inference graphs for whatever assertion/inference you are trying to test for a given dataset with some noise.
c. Why are you using Pellet? Is Pellet a standard tool to use?
d. Why are you getting rid of "Literal" and "owl:Thing"? It wasn’t clear to me.
2. This is my own fault for not knowing about MRR and Hits@N, but why are we using these metrics over others? No papers are cited that show that these metrics were used anywhere else in similar tasks -- thereby making it plausible to use these metrics as a unifying standard. If there are no papers that use it, then I think there should be an explanation for why they are being used.
a. Is it possible to give a motivating example? I've never heard of these metrics and I didn't understand what exactly they're measuring.
b. That may be beyond the scope of the paper and cause length issues, however.
3. I want to make sure I’m understanding what the actual “running”/”execution” of your code/dataset is:
a. Is the idea that you run the reasoner and then see what class and object property assertions are generated when you introduce ABox noise into the ontologies at varying intensities?
b. Would Hits@N not count negative assertions (A isNotRel B?)?
c. I feel that more verbiage should be used to explain how exactly the Hits@N and MRR are generated; perhaps a contrived example might be used? I found the example about Richard_john_bright was helpful.
SECTION 5
1. Regarding the results, I think they are interesting and prove the value of the work being done. However, I do not think the graphs themselves are particularly helpful at conveying the information.
2. The issue is that the unit values on the y-axes for both sets of graphs are so small that looking at the bars alone doesn't convey the drop/difference in y-values as you introduce different types of noise.
3. I think it would be helpful, potentially, to add the numerical values atop the bars themselves, so that we can see the numbers clearly and infer results using the numbers, rather than needing to read the corresponding paragraphs.
4. Secondly, flipping between the pages (or going up and down on the computer pdf viewer) for class and property assertions for given reasoners and datasets makes it difficult to process the data.
5. I think it may be helpful to structure the graphs like so, so there are 2 - 4 on a page.
a. Random Noise -- Owl2Vec | Random Noise -- Box2El
b. Statistical Noise -- Owl2Vec | Statistical Noise -- Box2El
c. Logical Noise -- Owl2Vec | Logical Noise - Box2El
6. In each graph, you can show the Class and Property assertion values for each dataset
as noise is being varied. This collects the results according to the reasoners + noise,
and shows the effects of noise on each of the tasks more clearly.
7. You will still have many graphs, but it will be easier to find information. For example, you describe the MRR for class and property object assertions decreasing as diff types of noise are introduced. You describe the effect of logical noise. Then the reader can look at the Logical Noise - Owl2Vec and Logical Noise -- Box2El graphs and see clearly the effects the noise has for both class and property assertions by seeing these graphs side by side.
8. By reformatting the graphs to have the numerical values of the MRR displayed as well as collecting the results in a slightly different way, it makes it much easier for the reader to read the data and then read the paragraphs explaining the data. As it is now, it's hard to read the values discussed in the Results paragraph and then try and verify by looking at the graphs.
I would also like to note that I had issues with meson’s build process for openblas, I think, when trying to clone the repo to verify the results that the authors had found. The issue came about when running `pip install -r requirements.txt`. Therefore I could not replicate their results. If the authors can revise their codebase and retest it to make sure these results are easily generatable, that would go a long way towards replicability!
In general, I think the paper has promise, but the presentation and motivation of certain key points is lacking. Hopefully these points provide a basis for a good revision!
I also didn't take a deep look at the bibliography; I didn't see anything that was out of place, however.