Link Search Menu Expand Document

Benchmark Datasets for Commonsense Reasoning

Editors: Bill Yuchen Lin, Yang Qiao, Peifeng Wang, Pei Zhou

We present a comprehensive collection of datasets for testing commonsense reasoning ability. They are grouped by different formulations, and covered a wide range of aspects: properties of common objects, real-life situations, elementary science, social skills, etc.

Table of contents


🌟 Summary Table 🌟
Name Link Tags Format SotA vs. Human
CommonsenseQA Link General MC 83.3 / 88.9 (Acc. %)
SocialIQA Link Social MC 83.2 / 88.1 (Acc. %)
PhysicalIQA Link Physical MC 90.1 / 94.9 (Acc. %)
ARC Link Science MC 81.4 / N/A (Acc. %)
OpenbookQA Link Elementary Science MC 87.2 / 91.7 (Acc. %)
SWAG Link Event MC 91.7 / 88.0 (Acc. %)
HellaSWAG Link Event MC 93.9 / 95.6 (Acc. %)
WSC Link General, Coreference MC 96.6 / 100 (Acc. %)
WinoGrande Link General, Coreference MC 91.28 / 94 (Acc. %)
COPA Link Causality, Event MC 98.4 / 100 (Acc. %)
X-COPA Link Causality, Event MC 76.1 / 97.6 (Acc. %)
CODAH Link General, Event MC 84.3 / 95.3 (Acc. %)
MC-TACO Link Temporal Commonsense, Events MC 80.9 / 75.8 (Acc. %)
aNLI Link Abductive Reasoning, Events MC 89.7 / 92.9 (Acc. %)
RiddleSense Link General, Figurative, Counterfactual MC 68.8 / 91.3 (Acc. %)
ROCStories Link General, Story MC 58.5 / 100 (Acc. %)
QASC Link Science MC 89.57 / 93.33 (Acc. %)
VCR Link Visual Understanding, Complex Situation VQA 70.8 / 85.0 (Acc. %)
ProtoQA Link Prototypical Situation OE 56.0 / 78.4 (WN. Sim.)
OpenCSR Link Science OE 40.8 / N/A (Acc. %)
CommonGen Link General, Everyday Scenario CNLG 33.3 / 52.4 (SPICE %)
ComVE (SubTask C) Link Nonsensical Statement CNLG 22.4 / 2.58 (BLEU)
LAMA Probes Link General LMP N/A
NumerSense Link Numerical LMP 70.4 / 96.3 (Acc. %)
RICA Link General LMP 52.2 / 91.7 (Acc. %)
ReCoRD Link News Articles RC 91.21 / 91.69 (F1)
CosmosQA Link Everyday Narratives RC 91.79 / 94.00 (Acc. %)
DREAM Link Everyday Dialogues RC 91.8 / 95.5 (Acc. %)
MCScript Link General, Script RC 72.0 / 98.2 (Acc. %)
TWC Link Objects TG N/A
Rainbow Link General Others N/A
GLUE Link General Others 97.8 / 97.8
SuperGLUE Link General Others 98.4 / 100
LOCATEDNEAR Link Objects Others N/A (Acc. %)

Multiple-Choice Tasks

The task format for multiple-choice (MC) tasks for commonsense reasoning is as follows.

  • Input: a question, a few candidate answers (i.e., choices).
  • Output: the label of the correct choice.
  • Metric: accuracy.

Notes:

  • There is one and only one correct choice for each input, and the others are distractors. Exception: MC-TACO allows multiple correct answers.
  • We do not consider the cases with additional input context (e.g., passages, images) here.
  • The inputs can be either interrogative sentences (as in CommonsenseQA, SocialIQA, etc.) or incomplete statements (as in SWAG, COPA, WSC, etc.).

CommonsenseQA

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge.
Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant. NAACL-19

Paper Official Link Huggingface Card

  • Topics: General. Mostly about properties of common objects and motivation/causes/results of events.
  • Size & Split: 12,102 in total — train (9,741), dev (1,221), test (1,140).
  • Dataset creation: The questions are crowdsourced from human annotators. The authors present a question concept, q, and some candidate concepts, which are linked to q, and ask annotators to write a natural-language question mentioning q and answered by only one of the answer candidates.
Illustrative Example
Question ID: b8c0a4703079cf661d7261a60a1bcbff
Question concept: "magazines"
Question: "Where would you find magazines along side many other printed works?"
Choices:  A: "doctor" | B: "bookstore" | C: "market" | D: "train station" | E: "mortuary"
Correct Choice: B

Comments

  • We use the latest version of the data (v1.11) downloaded from the official website to report the size, which is thus different from the paper.

SocialIQA

SocialIQA: Commonsense Reasoning about Social Interactions.
Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, Yejin Choi. EMNLP-19

Paper Official Link

  • Topics: Social Interactions. It focuses on reasoning about people’s actions and their social implications.
  • Size & Split: 37,588 in total — train (33,410), dev (1,954), test (2,224).
  • Dataset creation:
Illustrative Example
Question: 
    In the school play, Robin played a hero in the struggle to the death with the angry villain. How would others feel as a result?
Choices:  
    A) sorry for the villain
    B) hopeful that Robin will succeed
    C) like Robin should lose the fight
Correct Choice: B

PhysicalIQA

PIQA: Reasoning about Physical Commonsense in Natural Language.
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi. AAAI-20

Paper Official Link Huggingface Card

  • Topics: General. It focuses on how people interact with everyday objects in everyday situations.
  • Size & Split: around 20,000 QA pairs of multiple-choice in total — train (over 16,000), dev (∼2K), test (∼3k).
  • Dataset creation:
Illustrative Example
Question:
    To separate egg whites from the yolk using a water bottle, you should
Choices:
    A) Squeeze the water bottle and press it against the yolk. Release, which creates suction and lifts the yolk.
    B) Place the water bottle and press it against the yolk. Keep pushing, which creates suction and lifts the yolk.
Correct Choice: B

ARC

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord. arXiv, 2018

Paper Official Link Huggingface Card

  • Topics: Science. It focuses on natural, grade-school science questions.
  • Size & Split: 7,787 in total — train (3,370), dev (869), test (3,548).
  • Dataset creation:
Illustrative Example
Question:
    Which property of a mineral can be determined just by looking at it?
Choices:
    A) luster  B) mass  C) weight  D) hardness
Correct Choice: A

OpenbookQA

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering.
Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal. EMNLP-18

Paper Official Link Huggingface Card

  • Topics: General. The dataset is modeled after open book exams for assessing human understanding of a subject.
  • Size & Split: 5,957 in total — train (4,957), dev (500), test (500).
  • Dataset creation:
Illustrative Example
Question:
    Which of these would let the most heat travel through?
Choices:
    A) a new pair of jeans
    B) a steel spoon in a cafeteria  
    C) a cotton candy at a store  
    D) a calvi klein cotton hat
Correct Choice: B

SWAG and HellaSWAG

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference.
Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi. EMNLP-18

Paper Official Page Intro Page Huggingface Card

  • Topics: General. Mostly about grounded situations. Each question is a video caption from LSMDC or ActivityNet Captions, with four answer choices about what might happen next in the scene.
  • Size & Split: around 113k in total — train (73k), dev (20k), test (20k).
  • Dataset creation:
Illustrative Example
Question:
    On stage, a woman takes a seat at the piano. She
Choices:
    A) sits on a bench as her sister plays with the doll.
    B) smiles with someone as the music plays.
    C) is in the crowd, watching the dancers.
    D) nervously sets her fingers on the keys.
Correct Choice: D

HellaSwag: Can a Machine Really Finish Your Sentence?.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, Yejin Choi. ACL-19

Paper Official Page Intro Page Huggingface Card

  • Topics: General. Mostly about grounded commonsense situations.
  • Size & Split: 18,001 in total — train (6,833), dev (3,641), test (7,527).
  • Dataset creation:
Illustrative Example
Question:
    A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She
Choices:
    A) rinses the bucket off with soap and blow dries the dog's head.
    B) uses a hose to keep it from getting soapy.
    C) gets the dog wet, then it runs away again.
    D) gets into the bath tub with the dog.
Correct Choice: C

WSC

The Winograd Schema Challenge.
Hector J. Levesque, Ernest Davis, Leora Morgenstern. KR-12

Paper Official Link Huggingface Card

  • Topics: General.
  • Size & Split: 804 in total — train (554), dev (104), test (146).
  • Dataset creation:
Illustrative Example
label: 0,
options: ['The city councilmen', 'The demonstrators']
pronoun: they
pronoun_loc: 63
quote: they feared violence
quote_loc: 63
source: (Winograd 1972)
text: The city councilmen refused the demonstrators a permit because they feared violence.

WinoGrande

WinoGrande: An Adversarial Winograd Schema Challenge at Scale.
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. AAAI-20

Paper Official Link Huggingface Card

  • Topics: General. Mostly about commonsense inference in pronoun resolution problems.
  • Size & Split: 43,972 in total — train (40,938), dev (1,267), test (1,767).
  • Dataset creation:
Illustrative Example
Sentence: Katrina had the financial means to afford a new car while Monica did not, since _ had a high paying job.
Option1: Katrina
Option2: Monica
Correct Option: Option1

COPA and X-COPA

Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning.
Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S. Gordon. AAAI-11

Paper Official Link

  • Topics: General. Open-domain commonsense causal reasoning of everyday activities.
  • Size & Split: 1000 in total — train (400), dev (100), test (500).
  • Dataset creation:
Illustrative Example
Premise: The man broke his toe. What was the CAUSE of this?
Alternative 1: He got a hole in his sock.
Alternative 2: He dropped a hammer on his foot.
Correct Choice: Alternative 2

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning.
Edoardo Maria Ponti, Goran Glavaš, Olga Majewska, Qianchu Liu, Ivan Vulić, Anna Korhonen. EMNLP-20

Paper Official Link Huggingface Card

  • Topics: General. Same as COPA dataset but in 11 languages.
  • Size & Split: 1000 * 11 (# langs) in total — train (400 * 11), dev (100 * 11), test (500 * 11).
  • Dataset creation:
Illustrative Example
Premise: L'uomo aprì il rubinetto.
Alternative 1: Il gabinetto si riempì d'acqua.
Alternative 2: Dell'acqua fluì dal beccuccio.
Correct Choice: Alternative 1

CODAH

CODAH: An Adversarially Authored Question-Answer Dataset for Common Sense.
Michael Chen, Mike D’Arcy, Alisa Liu, Jared Fernandez, Doug Downey. RepEval-19

Paper Official Link Huggingface Card

  • Topics: General. Mostly about grounded situations in everyday activities.
  • Size: 2,801 in total. No official splits.
  • Dataset creation:
Illustrative Example
Question: 
    I am always very hungry before I go to bed. I am
Choices:
    A) concerned that this is an illness.
    B) glad that I do not have a kitchen.
    C) fearful that there are monsters under my bed.
    D) tempted to snack when I feel this way.
Correct Choice: D

MC-TACO

“Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding.
Ben Zhou, Daniel Khashabi, Qiang Ning, Dan Roth. EMNLP-19

Paper Official Link Huggingface Card

  • Topics: Temporal Commonsense. Focusing on event ordering, duration, stationarity, frequency and time.
  • Size & Split: 13k question-answer pairs in total — train (N/A), dev (3,783), test (9,442) .
  • Dataset creation:
Illustrative Example
Paragraph: 
    Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge.
Question:
    How many years did it take for Mark to become a judge?
Choices:
    A) 63 years  B) 7 weeks  C) 7 years  D) 7 seconds  E) 7 hours
Correct Choice: C

aNLI

Abductive Commonsense Reasoning.
Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, Yejin Choi. ICLR-20

Paper Official Link Huggingface Card

  • Topics: General. Mostly about observations of objects or events in daily life.
  • Size & Split: 17,801 context pairs in total — dev (1,532), test (3,059).
  • Dataset creation:
Illustrative Example
Obs1: It was a gorgeous day outside.
Obs2: She asked her neighbor for a jump-start.
Hyp1: Mary decided to drive to the beach, but her car would not start due to a dead battery.
Hyp2: It made a weird sound upon starting.
Correct Choice: Hyp1

ComVE (SemEval-2020 Task 4 SubTask A&B)

SemEval-2020 task 4: Commonsense validation and explanation.
Cunxiang Wang, Shuailong Liang, Yili Jin, Yilong Wang, Xiaodan Zhu, Yue Zhang. SemEval-2020

Paper Official Link CodaLab

  • Topics: General. General commonsense assertions in everyday life.
  • Size & Split: 11,997 instances splitted into 10,000 training set, 997 development set, and 1,000 test set.
  • Dataset creation:
Illustrative Example
Task A: Validation
Task: Which statement of the two is against common sense?
Statement1: He put a turkey into the fridge.
Statement2: He put an elephant into the fridge.
Task B: Explanation (Multi-Choice)
Task: Select the most corresponding reason why this statement is against common sense.
Statement: He put an elephant into the fridge.
A: An elephant is much bigger than a fridge.
B: Elephants are usually white while fridges are usually white.
C: An elephant cannot eat a fridge.

RiddleSense

RiddleSense: Answering Riddle Questions as Commonsense Reasoning.
Bill Yuchen Lin, Ziyi Wu, Yichi Yang, Dong-Ho Lee, Xiang Ren. arXiv, 2021

Paper Official Link

  • Topics: General. Mostly about riddle-style commonsense question answering.
  • Size & Split: 5,733 in total — train (3,510), dev (1,021), test (1,202).
  • Dataset creation:
Illustrative Example
Question:
    My life can be measured in hours. I serve by being devoured. Thin, I am quick; Fat, I am slow. Wind is my foe. What am I?
Choices:
    A) paper  B) candle  C) lamp  D) 7  clock  E) worm
Correct Choice: B

Comments

  • The dataset is not yet public. Contact the authors to get more information.

ROCStories

A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories.
Nasrin Mostafazadeh, Nathanael Chambers, Xiaodong He, Devi Parikh, Dhruv Batra, Lucy Vanderwende, Pushmeet Kohli, James Allen. NAACL HLT, 2016

Paper Official Link

  • Topics: General. Mostly about casual and correlational relationships between events.
  • Size & Split: 50k five-sentence commonsense stories, and 3,744 Story Cloze Test cases
  • Dataset creation:
Illustrative Example
Context: 
    Karen was assigned a roommate her first year of college. Her roommate asked her to go to a nearby city for a concert. Karen agreed happily. The show was absolutely exhilarating.
Right Ending: 
    Karen became good friends with her roommate.
Wrong Ending:
    Karen hated her roommate.

QASC

QASC: A Dataset for Question Answering via Sentence Composition.
Tushar Khot, Peter Clark, Michal Guerquin, Peter Jansen, Ashish Sabharwal. AAAI, 2020

Paper Official Link

  • Topics: Elementary and middle school level science, with a focus on fact composition.
  • Size & Split: 9,980 8-way multiple-choice questions about grade school science — train (8,134), dev (926), test (920), with a corpus of 17M sentences.
  • Dataset creation: For each multiple-choice question, a worker is provided with one seed fact at the starting point, and asked to find other relevant facts from the corpus and create a question based on them.
Illustrative Example
Question:: 
    What can trigger immune response?
Choices: 
    (A) decrease strength
    (B) transplanted organs
    (C) desire
    (D) matter vibrating
    (E) death
    (F) pain
    (G) chemical weathering
    (H) an automobile engine
Correct choice:
    (B)

Visually-Grounded QA

Visual Commonsense Reasoning

From Recognition to Cognition: Visual Commonsense Reasoning.
Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi. CVPR-19

Paper Official Link

  • Topics: Visual Common Sense. It focuses on challenging visual questions expressed in natural language, which require cognition-level visual understanding and commonsense reasoning.
  • Task format: Given an image, a list of regions, and a question, a model must answer the question and provide a rationale explaining why its answer is right.
  • Size & Split: 264,720 in total — train (212,923), dev (26,534), test (25,263).
  • Dataset creation:
Illustrative Example
(An image depicting three people sitting around a dining table and a waitress serving the table.)
Question:
    Why is [person4] pointing at [person1]?
Choices:
    A) He is telling [person3] that [person1] ordered the pancakes.
    B) He just told a joke.
    C) He is feeling accusatory towards [person1].  
    D) He is giving [person1] directions.
Correct Choice: A
----
Rationales: I chose A) because...
    A) [person1] has the pancakes in front of him.
    B) [person4] is taking everyone's order and asked for clarification.
    C) [person3] is looking at the pancakes both she and [person2] are smilling slightly.
    D) [person3] is delivering food to the table, and she might not know whose order is whose.
Correct Choice: D

Open-Ended QA

ProtoQA

ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning.
Michael Boratko, Xiang Lorraine Li, Tim O’Gorman, Rajarshi Das, Dan Le, Andrew McCallum. EMNLP-20

Paper Github Page Huggingface Card

  • Topics: Prototypical situation.
  • Task format: Given a question, a model is has to output a ranked list of answers covering multiple categories.
  • Size & Split: 5,733 in total — train (8,781), dev (1,030), test (102).
  • Dataset creation:
Illustrative Example
Question:
    Name a piece of equipment that you are likely to find at your office and not at home?
Categories: 
    printer/copier (37), office furniture (15), computer equipment (17), stapler (11), files (10), office appliances (5), security systems (1)

OpenCSR

Differentiable Open-Ended Commonsense Reasoning.
Bill Yuchen Lin, Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Xiang Ren, William W. Cohen. NAACL-21

Paper

  • Topics: Science. Most of the questions in the dataset are naturally occurring generic statements.
  • Task format: Given an open-ended question, the model will output a weighted set of concepts.
  • Size & Split: 19,520 in total — train (15,800), dev (1,756), test (1,965).
  • Dataset creation:
Illustrative Example
Question: What can help alleviate global warming?
Supporting Facts: 
    f1: Carbon dioxide is the major greenhouse gas contributing to global warming.
    f2: Trees remove carbon dioxide from the atmosphere through photosynthesis.
    f3: The atmosphere contains oxygen, carbon dioxide, and water.
Weighted Answers: Renewable energy (w1), tree (w2), solar battery (w3)

Constrained NLG

CommonGen

CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning.
Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, Xiang Ren. EMNLP-20 Findings

Paper Official Link Huggingface Card

  • Topics: General. A wide range of concepts from everyday scenario.
  • Task format: Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts.
  • Size & Split: 35,141 concept-sets in total — train (32,651), dev (993), test (1,497).
  • Dataset creation:
Illustrative Example
Common Concepts: {dog, frisbee, catch, throw}
Output:
GPT2 -- A dog throws a frisbee at a football player.
UniLM -- Two dogs are throwing frisbees at each other.
BART -- A dog throws a frisbee and a dog catches it.
T5 -- dog catches a frisbee and throws it to a dog.

Cos-E

Explain Yourself! Leveraging Language Models for Commonsense Reasoning.
Nazneen Fatema Rajani, Bryan McCann, Caiming Xiong, Richard Socher. ACL-19

Paper Github Page Huggingface Card

  • Topics: General. Most questions in the dataset is based on everyday scenario and events.
  • Task format: Given a question, a model will return an explanation with the correct answer to the question.
  • Size & Split: 10,952 in total — train (9,741), dev (1,211).
  • Dataset creation:
Illustrative Example
Question: While eating a hamburger with friends, what are people trying to do?
Choices: have fun, tasty, or indigestion
CoS-E: Usually a hamburger with friends indicates a good time.
Correct Choice: have fun

ComVE (SubTask C)

SemEval-2020 Task 4: Commonsense Validation and Explanation.
Cunxiang Wang, Shuailong Liang, Yili Jin, Yilong Wang, Xiaodan Zhu, Yue Zhang. SemEval-20

Paper Github Page

  • Topics: Commonsense explanation to nonsensical statement.
  • Task format: Given a nonsensical statement, the task is to generate the reason why this statement does not make sense.
  • Size & Split: 11,997 8-sentence tuples in total — train (10,000), dev (997), test (1,000).
  • Dataset creation:
Illustrative Example
Task C: Commonsense Explanation (Generation)
Generate the reason why this statement is against common sense and we will use BELU to evaluate it.
    Statement: He put an elephant into the fridge.
    Referential Reasons:
        i. An elephant is much bigger than a fridge.
        ii. A fridge is much smaller than an elephant.
        iii. Most of the fridges aren’t large enough to contain an elephant.

LM Probing Tasks

LAMA Probes

Language Models as Knowledge Bases?.
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel. EMNLP-19

Paper Github Page Huggingface Card

  • Topics: General. LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models.
  • Task format: Given a pretrained language model knows a fact (subject, relation, object) such as (Dante, born-in, Florence), the task should predict masked objects in cloze sentences such as “Dante was born in ___” expressing that fact.
  • Size & Split: N/A
  • Dataset creation:
Illustrative Example

The ConceptNet config has the following fields:

masked_sentence: One of the things you do when you are alive is [MASK].
negated: N/A
obj: think
obj_label: think
pred: HasSubevent, 
sub: alive
uuid: d4f11631dde8a43beda613ec845ff7d1

NumerSense

Birds have four legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models.
Bill Yuchen Lin, Seyeon Lee, Rahul Khanna, Xiang Ren. EMNLP-20

Paper Github Page Huggingface Card

  • Topics: Numerical commonsense. The dataset contains probes from a wide range of categories, including objects, biology, geometry, unit, math, physics, geography, etc.
  • Task format: Given a masked sentence, the task is to choose the correct numerical answer from all provided choices.
  • Size & Split: 13.6k masked-word-prediction probes in total — fine-tune (10.5k), test (3.1k).
  • Dataset creation:
Illustrative Example
Question: A car usually has [MASK] wheels.
Choices: 
A) One  B) Two  C) Three  D) Four  E) Five

RICA

RICA: Evaluating Robust Inference Capabilities Based on Commonsense Axioms.
Pei Zhou, Rahul Khanna, Seyeon Lee, Bill Yuchen Lin, Daniel Ho, Jay Pujara, Xiang Ren. arXiv 2020, accepted in EMNLP-Findings 2020

Paper Projec Page Leaderboard

  • Topics: General. The dataset contains probes from different types of commonsense, such as physical, material, and social properties. It focuses on logicaly-equivalent probes to test models’ robust inference abilities. It also uses unseen strings as entities to separate fact-based recall from abstract reasoning capabilities.
  • Task format: Given a masked sentence and two choices for the mask, the task is to selesct the correct choice.
  • Size & Split: Two evaluation settings with the same test data of 1.6k human-curated probes: 1. zero-shot setting; 2. models are fine-tuned on 9k of the human-verified RICA probes (8k for training and 1k for validation).
  • Dataset creation:
Illustrative Example
Question: A prindag is lighter than a fluberg, so a prindag should float [MASK] than a fluberg.
Choices: 
A) more  B) less

Reading Comprehension

ReCoRD

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension.
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, Benjamin Van Durme. arXiv, 2018

Paper Official Link

  • Topics: Commonsense-based reading comprehension with a focus on news articles.
  • Task format: Given a passage, a set of text spans marked in the passage, and a cloze-style query with a missing text span, a model must select a text span that best fits the query.
  • Size & Split: Queries/Passages 120,730/80,121 in total — train (100,730/65,709), dev (10,000/7,133), test (10,000/(7,279).
  • Dataset creation:
Illustrative Example
Passage: 
    (**CNN**) -- A lawsuit has been filed claiming that the iconic **Led Zeppelin** song "**Stairway to Heaven**" was far from original. The suit, filed on May 31 in the **United States District Court Eastern District of Pennsylvania**, was brought by the estate of the late musician **Randy California** against the surviving members of **Led Zeppelin** and their record label. The copyright infringement case alleges that the **Zeppelin** song was taken from the single "**Taurus**" by the 1960s band **Spirit**, for whom **California** served as lead guitarist. "Late in 1968, a then new band named **Led Zeppelin** began touring in the **United States**, opening for **Spirit**," the suit states. "It was during this time that **Jimmy Page**, **Led Zeppelin**'s guitarist, grew familiar with '**Taurus**' and the rest of **Spirit**'s catalog. **Page** stated in interviews that he found **Spirit** to be 'very good' and that the band's performances struck him 'on an emotional level.' "
• Suit claims similarities between two songs
• **Randy California** was guitarist for the group **Spirit**
• **Jimmy Page** has called the accusation "ridiculous"
(Cloze-style) Query:
    According to claims in the suit, "Parts of 'Stairway to Heaven,' instantly recognizable to the music fans across the world, sound almost identical to significant portions of ‘___.’”
Reference Answers:
    Taurus

Cosmos QA

Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning.
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. EMNLP-19

Paper Official Link Huggingface Card

  • Topics: Commonsense-based reading comprehension with a focus on people’s everyday narratives, asking questions concerning on the likely causes or effects of events that require reasoning beyond the exact text spans in the context.
  • Task format: Given a paragraph and a question, a model must select the correct answer from a set of choices.
  • Size & Split: Questions/Paragraphs 35,588/21,866 in total — train (25,588/13,715), dev (26,534/2,460), test (25,263/5,711).
  • Dataset creation:
Illustrative Example
Paragraph: 
    It's a very humbling experience when you need someone to dress you every morning, tie your shoes, and put your hair up. Every menial task takes an unprecedented amount of effort. It made me appreciate Dan even more. But anyway I shan't dwell on this (I'm not dying after all) and not let it detact from my lovely 5 days with my friends visiting from Jersey.
Question:
    What's a possible reason the writer needed someone to dress him every morning?
Choices:
    A) The writer doesn't like putting effort into these tasks.
    B) The writer has a physical disability.
    C) The writer is bad at doing his own hair.
    D) None of the above choices.
Correct Choice: B

DREAM

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension.
Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, Claire Cardie. TACL-19

Paper Official Link Huggingface Card

  • Topics: General. It focuses on in-depth multi-turn multi-party dialogues covering a variety of topics and scenarios in daily life. 34% of questions involve commonsense reasoning.
  • Task format: Given a dialogue and a question, a model must select the correct answer from a set of choices.
  • Size & Split: Questions/Dialogues 10,197/6,444 in total — train (6,116/3,869), dev (2,040/1,288), test (2,041/(1,287).
  • Dataset creation: The dialogues, questions, and answers are collected from English-as-a-foreign-language examinations designed by human experts.
Illustrative Example
Dialogue: 
    W: Tom, look at your shoes. How dirty they are! You must clean them.
    M: Oh, mum, I just cleaned them yesterday.
    W: They are dirty now. You must clean them again.
    M: I do not want to clean them today. Even if I clean them today, they will get dirty again tomorrow.
    W: All right, then.
    M: Mum, give me something to eat, please.
    W: You had your breakfast in the morning, Tom, and you had lunch at school.
    M: I am hungry again.
    W: Oh, hungry? But if I give you something to eat today, you will be hungry again tomorrow.
Question:
    Why did the woman say that she wouldn’t give him anything to eat?
Choices:
    (A) Because his mother wants to correct his bad habit.
    (B) Because he had lunch at school.
    (C) Because his mother wants to leave him hungry.
Correct Choice: (A)

MCScript

MCScript: A Novel Dataset for Assessing Machine Comprehension Using Script Knowledge.
Simon Ostermann, Ashutosh Modi, Michael Roth, Stefan Thater, Manfred Pinkal. LREC-18

Paper Official Link

  • Topics: Assession of the contribution of commonsense-based script knowledge to machine comprehension. Scripts are sequences of events describing stereotypical human activities.
  • Task format: Given a script and a subset of related questions, a model must select the correct answer from a set of choices to each question.
  • Size & Split: Approximately 2,100 texts and a total of approximately 14,000 questions in total — train (9,731 questions on 1,470 texts), dev (1,411 questions on 219 texts), and test (2,797 questions on 430 texts).
  • Dataset creation:
Illustrative Example
T: I wanted to plant a tree. I went to the home and garden store and picked a nice oak. Afterwards, I planted it in my garden.
Q1: What was used to dig the hole?
A) a shovel  B) his bare hands
Correct Answer: A
Q2 When did he plant the tree?
A) after watering it  B) after taking it home
Correct Answer: B

Text Game

TWC

Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines.
Keerthiram Murugesan, Mattia Atzeni, Pavan Kapanipathi, Pushkar Shukla, Sadhana Kumaravel, Gerald Tesauro, Kartik Talamadupula, Mrinmaya Sachan, Murray Campbell. AAAI-21

Paper Github Page

  • Topics: Objects. A specific kind of commonsense knowledge about objects, their attributes, and affordances.
  • Task format: A new text-based gaming environment for training and evaluating RL agents.
  • Size & Split: In TWC doamin, there are 928 total entities, 872 total objects, 190 unique objects, 56 supporters/containers, and 8 rooms. 30 unique games in total.
  • Dataset creation:
Illustrative Example

Example of a game walkthrough belonging to the easy difficulty level.

AFLWorld


Rainbow Benchmark

UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark.
Nicholas Lourie, Ronan Le Bras, Chandra Bhagavatula, Yejin Choi. AAAI-21

Paper Official Link

  • Topics: Rainbow is a universal commonsense reasoning benchmark spanning both social and physical common sense. Rainbow brings together 6 existing commonsense reasoning tasks: aNLI, Cosmos QA, HellaSWAG, Physical IQa, Social IQa, and WinoGrande.
  • Task format: text-to-text
  • Size & Split:
  • Dataset creation: reformatting specific versions of the above datasets to a text-to-text format so that models like T5 and BART.

GLUE and SuperGLUE Benchmark

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R. Bowman. ICLR-19

Paper Official Link Huggingface Card

LocatedNear Relation Extraction

Automatic Extraction of Commonsense LocatedNear Knowledge.
Frank F. Xu, Bill Yuchen Lin, Kenny Q. Zhu. ACL-18

Paper Github Page

  • Topics: Objects. Mostly about physical objects that are typically found near each other in real life.
  • Task format: Task 1 – judge if a sentence describes two objects (mentioned in the sentence) being physically close by; Task 2 – produce a ranked list of LOCATEDNEAR facts with the given classified results of large number of sentences.
  • Size & Split: 5,000 sentences describe a scene of two physical objects and with a label indicating if the two objects are co-located in the scene — train(4,000), test(1,000).
  • Dataset creation:
Illustrative Example
ID: 9888840
Sentence: In a few minutes more the mission ship was forsaken by her strange Sabbath congregation, and left with all the fleet around her floating quietly on the tranquil sea.	
Object 1: ship
Object 2: sea
Confidence: 1

Cited as (TBD)

@electronic{commonsenserun,
  title   = "An Online Compendium for Commonsense Reasoning Research.",
  author  = "Lin, Bill Yuchen and Qiao, Yang and Ilievski, Filip and Zhou, Pei and Wang, Peifeng and Ren, Xiang", 
  journal = "commonsense.run",
  year    = "2021",
  url     = "https://commonsense.run"
}

Page last modified: Mar 26 2021.

Edit this page on GitHub.