General Taxonomy Prediction Benchmark Dataset

created by Mohit Bansal and Gerard de Melo based on WordNet


Automated taxonomy construction/prediction is often evaluated on very narrow domains, e.g., biological species. We provide a much broader dataset covering a wide range of different domains.

Our benchmark dataset consists of bottomed-out sub-trees extracted from the Princeton WordNet lexical database. Please refer to the paper below for further details.

Bansal & de Melo WordNet Taxonomy Dataset
Download our WordNet-derived taxonomy prediction benchmark dataset.

Format: Each line provides a single ground truth taxonomy tree. Each tree is provided using (possibly nested) parentheses as follows: (parent child-1 ... child-n), where each parent is a word and each child-i is either a word or a tree itself. Words are given as plaintext, except that the special marker _$_ is used instead of space characters to allow for multi-word expressions.

License: WordNet 3.0 license (which allows free use, including for commercial purposes)


For more information about the datasets and our structured prediction method, please consult our publication:

Structured Learning for Taxonomy Induction with Belief Propagation   BibTeX   Slides
Mohit Bansal, David Burkett, Gerard de Melo, Dan Klein (2014)
In: Proc. ACL 2014. Association for Computational Linguistics.
🏆  Best Paper Honorable Mention
Acceptance rate: 26.2%


