Scikit-Learn fetch_rcv1() Dataset

Datasets

The Reuters Corpus Volume I (RCV1) dataset is a large collection of news documents, often used for text classification tasks.

When loading the dataset, key function arguments include return_X_y for returning data as a tuple and as_frame to return a DataFrame.

This example will demonstrate a classification problem, where algorithms like Logistic Regression and SVM are typically used.

from sklearn.datasets import fetch_rcv1

# Fetch the dataset
dataset = fetch_rcv1()

# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")

# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data[:5]}")

Running the example gives an output like:

Dataset shape: (804414, 47236)
First few rows of the dataset:
  (0, 863)	0.0497399253756197
  (0, 1522)	0.044664135988103
  (0, 1680)	0.0673871572152868
  (0, 2292)	0.0718104827746566
  (0, 2844)	0.0657133637266077
  (0, 2866)	0.0653401708076665
  (0, 3239)	0.0795167845321379
  (0, 4124)	0.0423215276156812
  (0, 4270)	0.0691368598826452
  (0, 4664)	0.0500863047167235
  (0, 5215)	0.252185352537681
  (0, 5572)	0.0672561839956375
  (0, 5698)	0.0594998147298331
  (0, 5793)	0.0737821454910533
  (0, 6221)	0.12450060912141
  (0, 6591)	0.101431159576997
  (0, 7226)	0.194090655513477
  (0, 7974)	0.0766400848671463
  (0, 8144)	0.0295331356836656
  (0, 8758)	0.0595662280181838
  (0, 8770)	0.130789753977649
  (0, 8900)	0.052116236521377
  (0, 8926)	0.0367838394252549
  (0, 8939)	0.0479419428634425
  (0, 9106)	0.0533192746608269
  :	:
  (4, 31159)	0.0442207572612894
  (4, 31317)	0.049010468085621
  (4, 31593)	0.0341783624171978
  (4, 31654)	0.141422969807087
  (4, 32247)	0.365793379740658
  (4, 32581)	0.115395824241365
  (4, 32911)	0.082587511124252
  (4, 34203)	0.03301669507691
  (4, 35040)	0.163988454435459
  (4, 35597)	0.083577839218575
  (4, 36935)	0.193540088658331
  (4, 37106)	0.0825272755326549
  (4, 37880)	0.122071444286704
  (4, 39144)	0.0614645570983034
  (4, 39175)	0.0658043600067598
  (4, 39496)	0.0821108240723932
  (4, 39524)	0.0212798042961383
  (4, 39767)	0.104720913743799
  (4, 40926)	0.125674378303718
  (4, 41203)	0.0670817651876355
  (4, 41628)	0.110707254941027
  (4, 42437)	0.0912860804504496
  (4, 44065)	0.0962313447247118
  (4, 45883)	0.0850971213416254
  (4, 45895)	0.130110865082658

The steps are as follows:

Import the fetch_rcv1 function from sklearn.datasets:
- This function loads the RCV1 dataset directly from scikit-learn.
Fetch the dataset using fetch_rcv1():
- The dataset includes a collection of news documents for text classification tasks.
Print the dataset shape:
- Access the shape using dataset.data.shape.
Display the first few rows of the dataset:
- Print the initial rows using dataset.data[:5] to get a sense of the dataset structure and content.

This example demonstrates how to quickly load and explore the RCV1 dataset using scikit-learn’s fetch_rcv1() function, allowing you to inspect the data’s shape. This sets the stage for further preprocessing and application of text classification algorithms.

See Also