Link Search Menu Expand Document

Analysis about Commonsense Reasoning

Editors: Bill Yuchen Lin, Pei Zhou

Here we present a collection of papers focusing on the analysis about commonsense knowledge and commonsense reasoning.


Table of contents


Analyzing Language Models

This line of research focuses on understanding whether the pre-trained language models (e.g., BERT) have captured commonsense knowledge, and to what extent they can be used for different types of commonsense reasoning. They usually use the pre-trained LMs as they are (i.e., no fine-tuning) and analyze them by designing probing methods for recalling knowledge or completing a reasoning task.

πŸ“œ Language Models as Knowledge Bases?.
✍ Fabio Petroni, Tim Rocktaschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel (EMNLP 2019)

Paper Code Semantic Scholar

Used Resources: ConceptNet, DOQ, WordNet, WikidataWikidata

Abstract
  Recent progress in pretraining language models on large textual corpora led to a surge of improvements for downstream NLP tasks. Whilst learning linguistic knowledge, these models may also be storing relational knowledge present in the training data, and may be able to answer queries structured as "fill-in-the-blank" cloze statements. Language models have many advantages over structured knowledge bases: they require no schema engineering, allow practitioners to query about an open class of relations, are easy to extend to more data, and require no human supervision to train. We present an in-depth analysis of the relational knowledge already present (without fine-tuning) in a wide range of state-of-the-art pretrained language models. We find that (i) without fine-tuning, BERT contains relational knowledge competitive with traditional NLP methods that have some access to oracle knowledge, (ii) BERT also does remarkably well on open-domain question answering against a supervised baseline, and (iii) certain types of factual knowledge are learned much more readily than others by standard language model pretraining approaches. The surprisingly strong ability of these models to recall factual knowledge without any fine-tuning demonstrates their potential as unsupervised open-domain QA systems. The code to reproduce our analysis is available at https://github.com/facebookresearch/LAMA

Comments
  [Soumya] The authors propose the LAMA probe dataset to test the factual and commonsense knowledge in LMs. The facts are converted to cloze statements which are used to query the LM for the missing token. Similar to Knowledge Base completion evaluation, here the evaluation is done based on how high the missing token is ranked among all the tokens. They observe that BERT is able to consistenly outperform other Relation Extraction-based baselines and ElMo variants. This work mainly focuses on understanding what knowledge is already captured in the weights of an LM, and not tha ability of an LM to capture knowledge from a text.

πŸ“œ oLMpics-On What Language Model Pre-training Captures.
✍ Alon Talmor, Yanai Elazar, Y. Goldberg, Jonathan Berant (TACL 2020)

Paper Code Semantic Scholar

Used Resources: ConceptNet, DOQ, WordNet, Wikidata, Google Book Corpus

Abstract
  Recent success of pre-trained language models (LMs) has spurred widespread interest in the language capabilities that they possess. However, efforts to understand whether LM representations are useful for symbolic reasoning tasks have been limited and scattered. In this work, we propose eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition. A fundamental challenge is to understand whether the performance of a LM on a task should be attributed to the pre-trained representations or to the process of fine-tuning on the task data. To address this, we propose an evaluation protocol that includes both zero-shot evaluation (no fine-tuning), as well as comparing the learning curve of a fine-tuned LM to the learning curve of multiple controls, which paints a rich picture of the LM capabilities. Our main findings are that: (a) different LMs exhibit qualitatively different reasoning abilities, e.g., RoBERTa succeeds in reasoning tasks where BERT fails completely; (b) LMs do not reason in an abstract manner and are context-dependent, e.g., while RoBERTa can compare ages, it can do so only when the ages are in the typical range of human ages; (c) On half of our reasoning tasks all models fail completely. Our findings and infrastructure can help future work on designing new datasets, models, and objective functions for pre-training.

Comments
  [Soumya] The authors propose eight reasoning tasks requiring conceptual abilities like comparison, conjunction, and composition. They define evaluation protocols to study the performance of LMs on these tasks before and after fine-tuning. They find that different LMs have varied abilites to perform the reasoning tasks - BERT can completely fail, while RoBERTa shows good success. Further, they find that the success of LMs are not due to abstractive and composition-based resoning. LMs largely depend on the context of the task and any discrepancies from the training distribution leads to large performance drops. Among all the reasoning tasks considered, they find that LMs are particularly poor at multi-hop reasoning with near chance performance aross all models considered.

πŸ“œ Evaluating Commonsense in Pre-trained Language Models.
✍ Xuhui Zhou, Y. Zhang, Leyang Cui, Dandan Huang (AAAI 2020)

Paper Code Semantic Scholar

Used Resources: Sense Making, WSC, SWAG, HellaSwag,Argument Reasoning Comprehension

Abstract
  Contextualized representations trained over large raw text data have given remarkable improvements for NLP tasks including question answering and reading comprehension. There have been works showing that syntactic, semantic and word sense knowledge are contained in such representations, which explains why they benefit such tasks. However, relatively little work has been done investigating commonsense knowledge contained in contextualized representations, which is crucial for human question answering and reading comprehension. We study the commonsense ability of GPT, BERT, XLNet, and RoBERTa by testing them on seven challenging benchmarks, finding that language modeling and its variants are effective objectives for promoting models' commonsense ability while bi-directional context and larger training set are bonuses. We additionally find that current models do poorly on tasks require more necessary inference steps. Finally, we test the robustness of models by making dual test cases, which are correlated so that the correct prediction of one sample should lead to correct prediction of the other. Interestingly, the models show confusion on these test cases, which suggests that they learn commonsense at the surface rather than the deep level. We release a test set, named CATs publicly, for future research.

Comments
  [Soumya] In this work, the authors mainly probe the commonsense knowledge present in LMs to solve seven existing benchmark datasets. They reformulate all the tasks to a sentence-scoring task format where the sentence with a higher probability is assumed to be the correct predcition. They observe that models with bidirectional encoders are able to outperform the unidirectional counterparts. Further, they observe that in general the LMs perform better for tasks requiring shallow reasoning. LMs tend to perform quite worse at tasks that require multiple reasoning inferences. Finally, they experiment with minimally edited pairs of contrast sets and find that LMs are not able to consistently reason between the pairs of sentences. They conclude that although LMs learn some commonsense, the reasoning capabilities is quite shallow, without a deep semantic comprehension.

πŸ“œ RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms.
✍ Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, Xiang Ren (arXiv 2020, accepted in EMNLP-Findings 2020)

Paper Project Page Leaderboard Semantic Scholar

Used Resources: ConceptNet,ATOMIC, COMeT

Abstract
  Pre-trained language models (PTLM) have impressive performance on commonsense inference benchmarks, but their ability to practically employ commonsense to communicate with humans is fiercely debated. Prior evaluations of PTLMs have focused on factual world knowledge or the ability to reason when the necessary knowledge is provided explicitly. However, effective communication with humans requires inferences based on implicit commonsense relationships, and robustness despite paraphrasing. In the pursuit of advancing fluid human-AI communication, we propose a new challenge, RICA, that evaluates the capabilities of making commonsense inferences and the robustness of these inferences to language variations. In our work, we develop a systematic procedure to probe PTLMs across three different evaluation settings. Extensive experiments on our generated probe sets show that PTLMs perform no better than random guessing (even with fine-tuning), are heavily impacted by statistical biases, and are not robust to perturbation attacks. Our framework and probe sets can help future work improve PTLMs' inference abilities and robustness to linguistic variations--bringing us closer to more fluid communication.

Comments
  [Soumya] The authors propose the RICA challenge, a set of logical axiom based commonsense probes that evaluate LMs ability to perform commonsense reasoning. They define specific logical templates, and use existing commonsense KGs to generate these probes. Further, linguistic variations are defined on these probes to understand how LMs tackle operators like negation, paraphrasing, etc. They find that existing LMs perform close to random chance in a zero-shot evaluation setting. Additionally, they observe that finetuning on limited data or adding knowledge in the form of knowledge graphs does not help in a significant manner. They also release a human-curated version of the task which is even harder for LMs to crack. Finally, they observe that bias of LMs towards positive responses is a possible reason for this performance discrepancy.

Analyzing Reasoning Methods

The following papers focus on how the existing methods perform their reasoning for commonsense tasks.

πŸ“œ Does BERT Solve Commonsense Task via Commonsense Knowledge?.
✍ Leyang Cui, Sijie Cheng, Yu Wu, Yue Zhang (AAAI 2020)

Paper Semantic Scholar

Used Resources: CommonsenseQA, ConceptNet

Abstract
  The success of pre-trained contextualized language models such as BERT motivates a line of work that investigates linguistic knowledge inside such models in order to explain the huge improvement in downstream tasks. While previous work shows syntactic, semantic and word sense knowledge in BERT, little work has been done on investigating how BERT solves CommonsenseQA tasks. In particular, it is an interesting research question whether BERT relies on shallow syntactic patterns or deeper commonsense knowledge for disambiguation. We propose two attention-based methods to analyze commonsense knowledge inside BERT, and the contribution of such knowledge for the model prediction. We find that attention heads successfully capture the structured commonsense knowledge encoded in ConceptNet, which helps BERT solve commonsense tasks directly. Fine-tuning further makes BERT learn to use the commonsense knowledge on higher layers.

Comments

πŸ“œ Learning to Deceive Knowledge Graph Augmented Models via Targeted Perturbation.
✍ Mrigank Raman, Siddhant Agarwal, Peifeng Wang, Aaron Chan, H. Wang, Sungchul Kim, Ryan A. Rossi, Handong Zhao, Nedim Lipka, Xiang Ren (ICLR 2021)

Paper Code Semantic Scholar

Used Resources: CommonsenseQA, ConceptNet

Abstract
  Knowledge graphs (KGs) have helped neural-symbolic models improve performance on various knowledge-intensive tasks, like question answering and item recommendation. By using attention over the KG, such models can also "explain" which KG information was most relevant for making a given prediction. In this paper, we question whether these models are really behaving as we expect. We demonstrate that, through a reinforcement learning policy (or even simple heuristics), one can produce deceptively perturbed KGs which maintain the downstream performance of the original KG while significantly deviating from the original semantics and structure. Our findings raise doubts about KG-augmented models' ability to leverage KG information and provide plausible explanations

Comments
  [Soumya] In this paper, the authors raise an important question - do LMs use the knowledge graph information as humans expect them to use? To understand this, they study how the KG-augmented models' performance changes as the KG is systematically corrupted. They propose five corruption schemes - four heuristic and one RL-based. Overall, the goal of the corruption is to perturb facts in KGs that still keep the graph semantically plausible. They find that Kg-augmented LMs for commonsense reasoning and recommendation tasks are suprisingly good even when using significantly corrupted KGs. Further, they observe that increasing the pertubations do not significantly affect the performance - questioning the way in which LMs are actually using the knowledge graph. Finally, they show that noisily perturbing the KG without any semantic constraints can lead to performance drop in the KG-augmented models, suggesting that these LMs are at least sensitive to the KG semantics.

Analysis on Data

These papers work on analyzing the datasets and resources we used for studying commonsense reasoning. Are they biased? Do they contain error or noise? Why are the real challenges?

πŸ“œ Lawyers are Dishonest? Quantifying Representational Harms in Commonsense Knowledge Resources.
✍ Ninareh Mehrabi*, Pei Zhou*, Fred Morstatter, Jay Pujara, Xiang Ren, Aram Galstyan (arXiv 2021)

Paper Code Semantic Scholar

Used Resources: ConceptNet, COMeT

Abstract
  Numerous natural language processing models have tried injecting commonsense by using the ConceptNet knowledge base to improve performance on different tasks. ConceptNet, however, is mostly crowdsourced from humans and may reflect human biases such as β€œlawyers are dishonest.” It is important that these biases are not conflated with the notion of commonsense. We study this missing yet important problem by first defining and quantifying biases in ConceptNet as two types of representational harms: overgeneralization of polarized perceptions and representation disparity. We find that ConceptNet contains severe biases and disparities across four demographic categories. In addition, we analyze two downstream models that use ConceptNet as a source for commonsense knowledge and find the existence of biases in those models as well. We further propose a filtered-based bias-mitigation approach and examine its effectiveness. We show that our mitigation approach can reduce the issues in both resource and models but leads to a performance drop, leaving room for future work to build fairer and stronger commonsense models.

Comments

πŸ“œ WinoWhy: A Deep Diagnosis of Essential Commonsense Knowledge for Answering Winograd Schema Challenge.
✍ Hongming Zhang*, Xinran Zhao*, Yangqiu Song (ACL 2020)

Paper Code Semantic Scholar

Used Resources: WSC

Abstract
  In this paper, we present the first comprehensive categorization of essential commonsense knowledge for answering the Winograd Schema Challenge (WSC). For each of the questions, we invite annotators to first provide reasons for making correct decisions and then categorize them into six major knowledge categories. By doing so, we better understand the limitation of existing methods (i.e., what kind of knowledge cannot be effectively represented or inferred with existing methods) and shed some light on the commonsense knowledge that we need to acquire in the future for better commonsense reasoning. Moreover, to investigate whether current WSC models can understand the commonsense or they simply solve the WSC questions based on the statistical bias of the dataset, we leverage the collected reasons to develop a new task called WinoWhy, which requires models to distinguish plausible reasons from very similar but wrong reasons for all WSC questions. Experimental results prove that even though pre-trained language representation models have achieved promising progress on the original WSC dataset, they are still struggling at WinoWhy. Further experiments show that even though supervised models can achieve better performance, the performance of these models can be sensitive to the dataset distribution. WinoWhy and all codes are available at: this https URL

Comments
  The authors propose WinoWhy, a Winograd Schema Challenge extension with explanations for each question. The task is to give higher probability to the correct explanation from given positive and negative explanations of a WSC question. Additionally, they also categorize the knowledge type required to solve each question into the following categories - property, object, eventuality, spatial, quantity, and others. They find that LMs that perform quite strongly on the original WSC challenge are very poor in the WinoWhy task. This shows that models are not actually learning to reason. Further, they find that these LMs are poor across all knowledge categories.

πŸ“œ Commonsense Knowledge Graph Reasoning by Selection or Generation? Why?.
✍ Cunxiang Wang, Jinhang Wu, Luxin Liu, Yue Zhang (arXiv 2020)

Paper Semantic Scholar

Used Resources: ConceptNet, ATOMIC

Abstract
  Commonsense knowledge graph reasoning(CKGR) is the task of predicting a missing entity given one existing and the relation in a commonsense knowledge graph (CKG). Existing methods can be classified into two categories generation method and selection method. Each method has its own advantage. We theoretically and empirically compare the two methods, finding the selection method is more suitable than the generation method in CKGR. Given the observation, we further combine the structure of neural Text Encoder and Knowledge Graph Embedding models to solve the selection method's two problems, achieving competitive results. We provide a basic framework and baseline model for subsequent CKGR tasks by selection methods.

Comments

πŸ“œ An Analysis of Dataset Overlap on Winograd-Style Tasks.
✍ Ali Emami, Adam Trischler, Kaheer Suleman, Jackie Chi Kit Cheung (COLING 2020)

Paper Code Semantic Scholar

Used Resources: WSC

Abstract
 The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlap between these training corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the corpora on which state-of-the-art models are (pre)trained, and that a significant drop in classification accuracy occurs when we evaluate models on instances with minimal overlap. Based on these results, we develop the KnowRef-60K dataset, which consists of over 60k pronoun disambiguation problems scraped from web data. KnowRef-60K is the largest corpus to date for WSC-style common-sense reasoning and exhibits a significantly lower proportion of overlaps with current pretraining corpora.

Comments

Others

πŸ“œ The Box is in the Pen: Evaluating Commonsense Reasoning in Neural Machine Translation.
✍ Jie He, Tao Wang, Deyi Xiong, Qun Liu (EMNLP-Findings 2020)

Paper Code Semantic Scholar

Used Resources:

Abstract
  Does neural machine translation yield translations that are congenial with common sense? In this paper, we present a test suite to evaluate the commonsense reasoning capability of neural machine translation. The test suite consists of three test sets, covering lexical and contextless/contextual syntactic ambiguity that requires commonsense knowledge to resolve. We manually create 1,200 triples, each of which contain a source sentence and two contrastive translations, involving 7 different common sense types. Language models pretrained on large-scale corpora, such as BERT, GPT-2, achieve a commonsense reasoning accuracy of lower than 72% on target translations of this test suite. We conduct extensive experiments on the test suite to evaluate commonsense reasoning in neural machine translation and investigate factors that have impact on this capability. Our experiments and analyses demonstrate that neural machine translation performs poorly on commonsense reasoning of the three ambiguity types in terms of both reasoning accuracy ( 6 60.1%) and reasoning consistency (6 31%). We will release our test suite as a machine translation commonsense reasoning testbed to promote future work in this direction

Comments

Cited as (TBD)

@electronic{commonsenserun,
  title   = "An Online Compendium for Commonsense Reasoning Research.",
  author  = "Lin, Bill Yuchen and Qiao, Yang and Ilievski, Filip and Zhou, Pei and Wang, Peifeng and Ren, Xiang", 
  journal = "commonsense.run",
  year    = "2021",
  url     = "https://commonsense.run"
}

Page last modified: Apr 24 2021.

Edit this page on GitHub.