The Reuters Corpus Volume I (RCV1) dataset is a large collection of news documents, often used for text classification tasks.
When loading the dataset, key function arguments include return_X_y
for returning data as a tuple and as_frame
to return a DataFrame.
This example will demonstrate a classification problem, where algorithms like Logistic Regression and SVM are typically used.
from sklearn.datasets import fetch_rcv1
# Fetch the dataset
dataset = fetch_rcv1()
# Display dataset shape and types
print(f"Dataset shape: {dataset.data.shape}")
# Display first few rows of the dataset
print(f"First few rows of the dataset:\n{dataset.data[:5]}")
Running the example gives an output like:
Dataset shape: (804414, 47236)
First few rows of the dataset:
(0, 863) 0.0497399253756197
(0, 1522) 0.044664135988103
(0, 1680) 0.0673871572152868
(0, 2292) 0.0718104827746566
(0, 2844) 0.0657133637266077
(0, 2866) 0.0653401708076665
(0, 3239) 0.0795167845321379
(0, 4124) 0.0423215276156812
(0, 4270) 0.0691368598826452
(0, 4664) 0.0500863047167235
(0, 5215) 0.252185352537681
(0, 5572) 0.0672561839956375
(0, 5698) 0.0594998147298331
(0, 5793) 0.0737821454910533
(0, 6221) 0.12450060912141
(0, 6591) 0.101431159576997
(0, 7226) 0.194090655513477
(0, 7974) 0.0766400848671463
(0, 8144) 0.0295331356836656
(0, 8758) 0.0595662280181838
(0, 8770) 0.130789753977649
(0, 8900) 0.052116236521377
(0, 8926) 0.0367838394252549
(0, 8939) 0.0479419428634425
(0, 9106) 0.0533192746608269
: :
(4, 31159) 0.0442207572612894
(4, 31317) 0.049010468085621
(4, 31593) 0.0341783624171978
(4, 31654) 0.141422969807087
(4, 32247) 0.365793379740658
(4, 32581) 0.115395824241365
(4, 32911) 0.082587511124252
(4, 34203) 0.03301669507691
(4, 35040) 0.163988454435459
(4, 35597) 0.083577839218575
(4, 36935) 0.193540088658331
(4, 37106) 0.0825272755326549
(4, 37880) 0.122071444286704
(4, 39144) 0.0614645570983034
(4, 39175) 0.0658043600067598
(4, 39496) 0.0821108240723932
(4, 39524) 0.0212798042961383
(4, 39767) 0.104720913743799
(4, 40926) 0.125674378303718
(4, 41203) 0.0670817651876355
(4, 41628) 0.110707254941027
(4, 42437) 0.0912860804504496
(4, 44065) 0.0962313447247118
(4, 45883) 0.0850971213416254
(4, 45895) 0.130110865082658
The steps are as follows:
Import the
fetch_rcv1
function fromsklearn.datasets
:- This function loads the RCV1 dataset directly from scikit-learn.
Fetch the dataset using
fetch_rcv1()
:- The dataset includes a collection of news documents for text classification tasks.
Print the dataset shape:
- Access the shape using
dataset.data.shape
.
- Access the shape using
Display the first few rows of the dataset:
- Print the initial rows using
dataset.data[:5]
to get a sense of the dataset structure and content.
- Print the initial rows using
This example demonstrates how to quickly load and explore the RCV1 dataset using scikit-learn’s fetch_rcv1()
function, allowing you to inspect the data’s shape. This sets the stage for further preprocessing and application of text classification algorithms.