The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. The fetch_20newsgroups_vectorized()
function in scikit-learn loads this dataset with precomputed TF-IDF features, which simplifies the process of text classification.
This dataset is commonly used for text classification tasks, covering a wide range of topics. It serves as a useful benchmark for various text processing techniques and algorithms.
Key function arguments when loading the dataset include subset
to select a portion of the data (train, test, or all), and remove
to filter out certain sections like headers, footers, or quotes. This helps in obtaining cleaner text data.
This is a text classification problem where common algorithms like Naive Bayes, Support Vector Machines, and Logistic Regression are often applied.
from sklearn.datasets import fetch_20newsgroups_vectorized
# Load the 20 Newsgroups dataset with TF-IDF features
dataset = fetch_20newsgroups_vectorized(subset='all', remove=('headers', 'footers', 'quotes'))
# Display dataset information
print(f"Dataset contains {dataset.data.shape[0]} documents with shape {dataset.data.shape}")
print(f"Dataset has {len(dataset.target_names)} categories: {dataset.target_names}")
# Example document's TF-IDF feature vector and category
example_index = 0
example_vector = dataset.data[example_index]
example_category = dataset.target[example_index]
print(f"\nTF-IDF feature vector of example document:\n{example_vector}")
print(f"\nCategory of example document: {dataset.target_names[example_category]}")
Running the example gives an output like:
Dataset contains 18846 documents with shape (18846, 101631)
Dataset has 20 categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
TF-IDF feature vector of example document:
(0, 2939) 0.019153011712702294
(0, 3269) 0.019153011712702294
(0, 3417) 0.019153011712702294
(0, 3418) 0.019153011712702294
(0, 3427) 0.019153011712702294
(0, 3429) 0.019153011712702294
(0, 3453) 0.019153011712702294
(0, 6045) 0.019153011712702294
(0, 7883) 0.019153011712702294
(0, 8722) 0.019153011712702294
(0, 8818) 0.019153011712702294
(0, 9022) 0.019153011712702294
(0, 9963) 0.019153011712702294
(0, 16939) 0.019153011712702294
(0, 17426) 0.019153011712702294
(0, 17936) 0.019153011712702294
(0, 18110) 0.019153011712702294
(0, 18305) 0.019153011712702294
(0, 18408) 0.05745903513810688
(0, 18521) 0.4213662576794504
(0, 18755) 0.03830602342540459
(0, 18903) 0.019153011712702294
(0, 19443) 0.07661204685080918
(0, 19476) 0.019153011712702294
(0, 19576) 0.019153011712702294
: :
(0, 94388) 0.07661204685080918
(0, 94755) 0.019153011712702294
(0, 95663) 0.019153011712702294
(0, 95770) 0.019153011712702294
(0, 95778) 0.019153011712702294
(0, 95844) 0.019153011712702294
(0, 96061) 0.019153011712702294
(0, 96247) 0.09576505856351146
(0, 96391) 0.05745903513810688
(0, 96454) 0.07661204685080918
(0, 96532) 0.03830602342540459
(0, 96539) 0.019153011712702294
(0, 96857) 0.019153011712702294
(0, 96917) 0.019153011712702294
(0, 97285) 0.03830602342540459
(0, 97332) 0.019153011712702294
(0, 97469) 0.019153011712702294
(0, 99911) 0.019153011712702294
(0, 99957) 0.019153011712702294
(0, 99968) 0.03830602342540459
(0, 100197) 0.019153011712702294
(0, 100208) 0.019153011712702294
(0, 100221) 0.03830602342540459
(0, 100758) 0.03830602342540459
(0, 100759) 0.03830602342540459
Category of example document: talk.politics.mideast
The steps are as follows:
Import the
fetch_20newsgroups_vectorized
function fromsklearn.datasets
:- This function loads the 20 Newsgroups dataset with TF-IDF features, ready for text classification.
Fetch the dataset using
fetch_20newsgroups_vectorized()
:- Use
subset='all'
to include both training and testing data. - Set
remove=('headers', 'footers', 'quotes')
to clean the text data by removing extraneous parts of the documents.
- Use
Print dataset details:
- Display the number of documents and the shape of the TF-IDF feature matrix with
len(dataset.data)
anddataset.data.shape
. - Show the number and names of categories using
len(dataset.target_names)
anddataset.target_names
.
- Display the number of documents and the shape of the TF-IDF feature matrix with
Show an example document’s TF-IDF vector and category:
- Print the TF-IDF feature vector of the first document using
dataset.data[0]
. - Display the document’s category by indexing
dataset.target_names
with the first entry ofdataset.target
.
- Print the TF-IDF feature vector of the first document using
This example demonstrates loading and exploring the 20 Newsgroups dataset with precomputed TF-IDF features using scikit-learn’s fetch_20newsgroups_vectorized()
function. This sets the stage for applying text classification algorithms directly on the feature matrix.