SKLearner Home | About | Contact | Examples

Scikit-Learn fetch_20newsgroups() Dataset

The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups.

This dataset is commonly used for text classification and clustering tasks, covering a wide range of topics. It serves as a useful benchmark for various text processing techniques and algorithms.

Key function arguments when loading the dataset include subset to select a portion of the data (train, test, or all), remove to filter out certain sections like headers, footers, or quotes, and categories to specify a subset of the available newsgroup categories.

This is a text classification problem where common algorithms like Naive Bayes, Support Vector Machines, and Logistic Regression are often applied.

from sklearn.datasets import fetch_20newsgroups

dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), random_state=42)

print(f"Dataset contains {len(dataset.data)} documents")
print(f"Dataset has {len(dataset.target_names)} categories: {dataset.target_names}")

print(f"\nExample document:")
print(dataset.data[0])
print(f"\nCategory of example document: {dataset.target_names[dataset.target[0]]}")

Running the example gives an output like:

Dataset contains 18846 documents
Dataset has 20 categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Example document:


I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!



Category of example document: rec.sport.hockey

The steps are as follows:

  1. Import the fetch_20newsgroups function from sklearn.datasets:

    • This function allows us to easily load the 20 Newsgroups dataset directly from the scikit-learn library.
  2. Fetch the dataset using fetch_20newsgroups():

    • Specify subset='all' to include both the training and testing subsets of the data.
    • Set remove=('headers', 'footers', 'quotes') to eliminate unnecessary information from the documents that could introduce noise into the text classification task.
    • Use random_state=42 to ensure reproducibility of the dataset split and any subsequent random operations.
  3. Print the number of documents and categories in the dataset:

    • Access the total number of documents using len(dataset.data).
    • Get the number and names of the newsgroup categories using len(dataset.target_names) and dataset.target_names, respectively.
  4. Display an example document and its corresponding category:

    • Print the text of the first document in the dataset using dataset.data[0].
    • Show the category of this document by indexing dataset.target_names with the first entry of dataset.target, which contains the numeric labels for each document’s category.

This example demonstrates how to quickly load and explore the 20 Newsgroups dataset using scikit-learn’s fetch_20newsgroups() function, allowing you to inspect the number of documents, categories, and individual document contents and labels. This sets the stage for further text preprocessing, feature extraction, and application of classification algorithms.



See Also