The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups.
This dataset is commonly used for text classification and clustering tasks, covering a wide range of topics. It serves as a useful benchmark for various text processing techniques and algorithms.
Key function arguments when loading the dataset include subset
to select a portion of the data (train, test, or all), remove
to filter out certain sections like headers, footers, or quotes, and categories
to specify a subset of the available newsgroup categories.
This is a text classification problem where common algorithms like Naive Bayes, Support Vector Machines, and Logistic Regression are often applied.
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), random_state=42)
print(f"Dataset contains {len(dataset.data)} documents")
print(f"Dataset has {len(dataset.target_names)} categories: {dataset.target_names}")
print(f"\nExample document:")
print(dataset.data[0])
print(f"\nCategory of example document: {dataset.target_names[dataset.target[0]]}")
Running the example gives an output like:
Dataset contains 18846 documents
Dataset has 20 categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
Example document:
I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game. PENS RULE!!!
Category of example document: rec.sport.hockey
The steps are as follows:
Import the
fetch_20newsgroups
function fromsklearn.datasets
:- This function allows us to easily load the 20 Newsgroups dataset directly from the scikit-learn library.
Fetch the dataset using
fetch_20newsgroups()
:- Specify
subset='all'
to include both the training and testing subsets of the data. - Set
remove=('headers', 'footers', 'quotes')
to eliminate unnecessary information from the documents that could introduce noise into the text classification task. - Use
random_state=42
to ensure reproducibility of the dataset split and any subsequent random operations.
- Specify
Print the number of documents and categories in the dataset:
- Access the total number of documents using
len(dataset.data)
. - Get the number and names of the newsgroup categories using
len(dataset.target_names)
anddataset.target_names
, respectively.
- Access the total number of documents using
Display an example document and its corresponding category:
- Print the text of the first document in the dataset using
dataset.data[0]
. - Show the category of this document by indexing
dataset.target_names
with the first entry ofdataset.target
, which contains the numeric labels for each document’s category.
- Print the text of the first document in the dataset using
This example demonstrates how to quickly load and explore the 20 Newsgroups dataset using scikit-learn’s fetch_20newsgroups()
function, allowing you to inspect the number of documents, categories, and individual document contents and labels. This sets the stage for further text preprocessing, feature extraction, and application of classification algorithms.