Scikit-Learn fetch_20newsgroups_vectorized() Dataset

The 20 Newsgroups dataset is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. The fetch_20newsgroups_vectorized() function in scikit-learn loads this dataset with precomputed TF-IDF features, which simplifies the process of text classification.

This dataset is commonly used for text classification tasks, covering a wide range of topics. It serves as a useful benchmark for various text processing techniques and algorithms.

Key function arguments when loading the dataset include subset to select a portion of the data (train, test, or all), and remove to filter out certain sections like headers, footers, or quotes. This helps in obtaining cleaner text data.

This is a text classification problem where common algorithms like Naive Bayes, Support Vector Machines, and Logistic Regression are often applied.

from sklearn.datasets import fetch_20newsgroups_vectorized

# Load the 20 Newsgroups dataset with TF-IDF features
dataset = fetch_20newsgroups_vectorized(subset='all', remove=('headers', 'footers', 'quotes'))

# Display dataset information
print(f"Dataset contains {dataset.data.shape[0]} documents with shape {dataset.data.shape}")
print(f"Dataset has {len(dataset.target_names)} categories: {dataset.target_names}")

# Example document's TF-IDF feature vector and category
example_index = 0
example_vector = dataset.data[example_index]
example_category = dataset.target[example_index]

print(f"\nTF-IDF feature vector of example document:\n{example_vector}")
print(f"\nCategory of example document: {dataset.target_names[example_category]}")

Running the example gives an output like:

Dataset contains 18846 documents with shape (18846, 101631)
Dataset has 20 categories: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

TF-IDF feature vector of example document:
  (0, 2939)	0.019153011712702294
  (0, 3269)	0.019153011712702294
  (0, 3417)	0.019153011712702294
  (0, 3418)	0.019153011712702294
  (0, 3427)	0.019153011712702294
  (0, 3429)	0.019153011712702294
  (0, 3453)	0.019153011712702294
  (0, 6045)	0.019153011712702294
  (0, 7883)	0.019153011712702294
  (0, 8722)	0.019153011712702294
  (0, 8818)	0.019153011712702294
  (0, 9022)	0.019153011712702294
  (0, 9963)	0.019153011712702294
  (0, 16939)	0.019153011712702294
  (0, 17426)	0.019153011712702294
  (0, 17936)	0.019153011712702294
  (0, 18110)	0.019153011712702294
  (0, 18305)	0.019153011712702294
  (0, 18408)	0.05745903513810688
  (0, 18521)	0.4213662576794504
  (0, 18755)	0.03830602342540459
  (0, 18903)	0.019153011712702294
  (0, 19443)	0.07661204685080918
  (0, 19476)	0.019153011712702294
  (0, 19576)	0.019153011712702294
  :	:
  (0, 94388)	0.07661204685080918
  (0, 94755)	0.019153011712702294
  (0, 95663)	0.019153011712702294
  (0, 95770)	0.019153011712702294
  (0, 95778)	0.019153011712702294
  (0, 95844)	0.019153011712702294
  (0, 96061)	0.019153011712702294
  (0, 96247)	0.09576505856351146
  (0, 96391)	0.05745903513810688
  (0, 96454)	0.07661204685080918
  (0, 96532)	0.03830602342540459
  (0, 96539)	0.019153011712702294
  (0, 96857)	0.019153011712702294
  (0, 96917)	0.019153011712702294
  (0, 97285)	0.03830602342540459
  (0, 97332)	0.019153011712702294
  (0, 97469)	0.019153011712702294
  (0, 99911)	0.019153011712702294
  (0, 99957)	0.019153011712702294
  (0, 99968)	0.03830602342540459
  (0, 100197)	0.019153011712702294
  (0, 100208)	0.019153011712702294
  (0, 100221)	0.03830602342540459
  (0, 100758)	0.03830602342540459
  (0, 100759)	0.03830602342540459

Category of example document: talk.politics.mideast

The steps are as follows:

Import the fetch_20newsgroups_vectorized function from sklearn.datasets:
- This function loads the 20 Newsgroups dataset with TF-IDF features, ready for text classification.
Fetch the dataset using fetch_20newsgroups_vectorized():
- Use subset='all' to include both training and testing data.
- Set remove=('headers', 'footers', 'quotes') to clean the text data by removing extraneous parts of the documents.
Print dataset details:
- Display the number of documents and the shape of the TF-IDF feature matrix with len(dataset.data) and dataset.data.shape.
- Show the number and names of categories using len(dataset.target_names) and dataset.target_names.
Show an example document’s TF-IDF vector and category:
- Print the TF-IDF feature vector of the first document using dataset.data[0].
- Display the document’s category by indexing dataset.target_names with the first entry of dataset.target.

This example demonstrates loading and exploring the 20 Newsgroups dataset with precomputed TF-IDF features using scikit-learn’s fetch_20newsgroups_vectorized() function. This sets the stage for applying text classification algorithms directly on the feature matrix.

See Also