General Taxonomy Prediction Benchmark Dataset

created by Mohit Bansal and Gerard de Melo based on WordNet

Introduction

Automated taxonomy construction/prediction is often evaluated on very narrow domains, e.g., biological species. We provide a much broader dataset covering a wide range of different domains.

Our benchmark dataset consists of bottomed-out sub-trees extracted from the Princeton WordNet lexical database. Please refer to the paper below for further details.

Example taxonomy with different kinds of bottles, with incorrect predictions highlighted in red

Data

Bansal & de Melo WordNet Taxonomy Dataset

Download our WordNet-derived taxonomy prediction benchmark dataset.

Format: Each line provides a single ground truth taxonomy tree. Each tree is provided using (possibly nested) parentheses as follows: (parent child-1 ... child-n), where each parent is a word and each child-i is either a word or a tree itself. Words are given as plaintext, except that the special marker _$_ is used instead of space characters to allow for multi-word expressions.

License: WordNet 3.0 license (which allows free use, including for commercial purposes)

References

For more information about the datasets and our structured prediction method, please consult our publication:

Structured Learning for Taxonomy Induction with Belief Propagation BibTeX Slides
Mohit Bansal, David Burkett, Gerard de Melo, Dan Klein (2014)
In: Proc. ACL 2014. Association for Computational Linguistics.
🏆 Best Paper Honorable Mention
Acceptance rate: 26.2%

Return to Main Page