I am trying to adapt a Pytorch script that was created for linear regression. It was originally written to take in a set of random values(created with np.random) as features and targets.
I have now created a dataframe of actual data for analysis:
df = pd.read_csv('file_name.csv')
The df looks like this:
X1 X2 X3 X4 X5 X6 X7 X8 Y1 Y2 0 0.98 514.5 294.0 110.25 7.0 2 0.0 0 15.55 21.33 1 0.98 514.5 294.0 110.25 7.0 3 0.0 0 15.55 21.33 2 0.98 514.5 294.0 110.25 7.0 4 0.0 0 15.55 21.33 3 0.98 514.5 294.0 110.25 7.0 5 0.0 0 15.55 21.33 4 0.90 563.5 318.5 122.50 7.0 2 0.0 0 20.84 28.28
…and I am currently extracting just two columns(X1 and X2) as my features, and one column(Y1) as my targets, like this:
x = df[['X1', 'X2']] y = df['Y1']
So features look like this:
X1 X2 0 0.98 514.5 1 0.98 514.5 2 0.98 514.5 3 0.98 514.5 4 0.90 563.5
and targets look like this:
Y1 0 15.55 1 15.55 2 15.55 3 15.55 4 20.84
However, when I attempt to convert the features (X1 and X1) and targets(Y1) to tensors, in order to feed them to the NN, the code fails at the line:
dataset = TensorDataset(x_tensor_flat, y_tensor_flat)
I get the error:
line 45, in <module> dataset = TensorDataset(x_tensor, y_tensor) AssertionError: Size mismatch between tensors
There’s clearly some shaping issue at play, but I can’t work out what. I have tried to flatten as well as transposing the tensors, but I get the same error. Any help would be hugely appreciated.
Here’s the full section of code that is causing the issue:
import pandas as pd import torch import torch.optim as optim import torch.nn as nn from torch.utils.data import Dataset, TensorDataset, DataLoader from torch.utils.data.dataset import random_split device = 'cuda' if torch.cuda.is_available() else 'cpu' df = pd.read_csv('file_name.csv') x = df[['X1', 'X2']] y = df['Y1'] x_tensor = torch.from_numpy(np.array(x)).float() y_tensor = torch.from_numpy(np.array(y)).float() train_loader = DataLoader(dataset=train_dataset, batch_size=10) val_loader = DataLoader(dataset=val_dataset, batch_size=10) class ManualLinearRegression(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(2, 1) def forward(self, x): return self.linear(x) def make_train_step(model, loss_fn, optimizer): def train_step(x, y): model.train() yhat = model(x) loss = loss_fn(y, yhat) loss.backward() optimizer.step() optimizer.zero_grad() return loss.item() return train_step torch.manual_seed(42) model = ManualLinearRegression().to(device) loss_fn = nn.MSELoss(reduction='mean') optimizer = optim.SGD(model.parameters(), lr=1e-1) train_step = make_train_step(model, loss_fn, optimizer) n_epochs = 50 training_losses = [] validation_losses = [] print(model.state_dict()) for epoch in range(n_epochs): batch_losses = [] for x_batch, y_batch in train_loader: x_batch = x_batch.to(device) y_batch = y_batch.to(device) loss = train_step(x_batch, y_batch) batch_losses.append(loss) training_loss = np.mean(batch_losses) training_losses.append(training_loss) with torch.no_grad(): val_losses = [] for x_val, y_val in val_loader: x_val = x_val.to(device) y_val = y_val.to(device) model.eval() yhat = model(x_val) val_loss = loss_fn(y_val, yhat).item() val_losses.append(val_loss) validation_loss = np.mean(val_losses) validation_losses.append(validation_loss) print(f"[{epoch+1}] Training loss: {training_loss:.3f}t Validation loss: {validation_loss:.3f}") print(model.state_dict())
Advertisement
Answer
The problem is with how you have called the random_split
function. Note that it takes lengths as input, not the percentage or ratio of the split. The error is about the same, i.e., the sum of lengths (80+20) that you have specified is not the same as the length of data (5).
The below code snippet should fix your problem. Also, you do not need to flatten tensors… I think.
dataset = TensorDataset(x_tensor, y_tensor) val_size = int(len(dataset)*0.2) train_size = len(dataset)- int(len(dataset)*0.2) train_dataset, val_dataset = random_split(dataset, [train_size, val_size])