Popular dataset for Multi-label text classification

Mohamad MahmoodMohamad Mahmood
2 min read

There are several popular datasets available for multi-label text classification. Here are a few examples:

  1. Reuters-21578: This dataset contains news articles from the Reuters news agency, where each article is assigned multiple labels from a predefined set of topics. It is widely used for multi-label classification research.

  2. Reuters Corpus Volume I (RCV1): Similar to Reuters-21578, RCV1 is a larger dataset that consists of news articles from Reuters. It contains over 800,000 documents and covers a wide range of topics.

  3. Amazon Reviews: This dataset includes product reviews from the Amazon e-commerce platform. Each review is associated with multiple product categories, such as electronics, books, or clothing. It can be used for multi-label sentiment analysis or product categorization tasks.

  4. Yahoo News Classification: This dataset contains news articles from the Yahoo News website, where each article is assigned multiple labels from a set of categories. It covers a broad range of topics, including sports, politics, entertainment, and more.

  5. Stack Overflow: Stack Overflow is a popular question-and-answer website for programming-related topics. The dataset includes questions asked on Stack Overflow, where each question is associated with multiple programming language tags. It can be used for multi-label topic classification in the programming domain.

  6. DBpedia: DBpedia is a dataset extracted from Wikipedia, where each document is labeled with multiple categories based on the Wikipedia ontology. It covers a wide range of topics and can be used for multi-label classification tasks.

These datasets provide a good starting point for multi-label text classification research and are widely used in the NLP community. They are publicly available and can be accessed for experimentation and model development.

0
Subscribe to my newsletter

Read articles from Mohamad Mahmood directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Mohamad Mahmood
Mohamad Mahmood

Mohamad's interest is in Programming (Mobile, Web, Database and Machine Learning). He studies at the Center For Artificial Intelligence Technology (CAIT), Universiti Kebangsaan Malaysia (UKM).