Machine Learning Workflow with PyTorch

5 min readMar 8, 2021

PyTorch has become one of the leading Machine Learning frameworks in the past couple of years, with various companies adopting its easy-to-use framework for their research. According to The Gradient, PyTorch is becoming a major competitor to the long-time favorite, TensorFlow, showing major growth in its mentions at research conferences.

Founded in 2016 by Adam Paszke, Sam Gross, Soumith Chintala and Gregory Chanan, its use of idiomatic Python makes it very easy to implement.

The code below show’s PyTorch’s clever integration with idiomatic Python.

In this post, we’ll be showcasing Logistic Regression in PyTorch using some Car Crash data (click here to download).

Our goal is to predict the level of DAMAGE on the car. This is divided into three categories (making this a classification issue):

$500 or less in damage
$501 — $1,500
Over $1,500

1. Data Preparation

First, let’s get our data ready for modeling.

We are going to load the data in as a Pandas DataFrame, preprocess, then convert into tensor format to be used by PyTorch.

import pandas as pd
pd.set_option('display.max_columns', None)data = pd.read_csv('traffic_crashes_chicago.csv')

Let’s check out our NaN values:

data.isna().sum()CRASH_RECORD_ID                       0
RD_NO                              3917
CRASH_DATE_EST_I                 446607
CRASH_DATE                            0
POSTED_SPEED_LIMIT                    0
TRAFFIC_CONTROL_DEVICE                0
DEVICE_CONDITION                      0
WEATHER_CONDITION                     0
LIGHTING_CONDITION                    0
FIRST_CRASH_TYPE                      0
TRAFFICWAY_TYPE                       0
LANE_CNT                         283902
ALIGNMENT                             0
ROADWAY_SURFACE_COND                  0
ROAD_DEFECT                           0
REPORT_TYPE                       11746
CRASH_TYPE                            0
INTERSECTION_RELATED_I           373918
NOT_RIGHT_OF_WAY_I               460154
HIT_AND_RUN_I                    340916
DAMAGE                                0
DATE_POLICE_NOTIFIED                  0
PRIM_CONTRIBUTORY_CAUSE               0
SEC_CONTRIBUTORY_CAUSE                0
STREET_NO                             0
STREET_DIRECTION                      3
STREET_NAME                           1
BEAT_OF_OCCURRENCE                    5
PHOTOS_TAKEN_I                   476801
STATEMENTS_TAKEN_I               473134
DOORING_I                        481317
WORK_ZONE_I                      479748
WORK_ZONE_TYPE                   480406
WORKERS_PRESENT_I                482118
NUM_UNITS                             0
MOST_SEVERE_INJURY                  978
INJURIES_TOTAL                      967
INJURIES_FATAL                      967
INJURIES_INCAPACITATING             967
INJURIES_NON_INCAPACITATING         967
INJURIES_REPORTED_NOT_EVIDENT       967
INJURIES_NO_INDICATION              967
INJURIES_UNKNOWN                    967
CRASH_HOUR                            0
CRASH_DAY_OF_WEEK                     0
CRASH_MONTH                           0
LATITUDE                           2670
LONGITUDE                          2670
LOCATION                           2670
dtype: int64

Now we’ll drop all the NaN heavy columns and columns that don’t hold any predictive value:

data.drop(labels=['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I',
                 'LANE_CNT', 'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I',
                 'DATE_POLICE_NOTIFIED', 'STREET_NO', 
                 'STREET_DIRECTION', 'STREET_NAME', 
                 'BEAT_OF_OCCURRENCE', 'PHOTOS_TAKEN_I', 
                 'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I', 
                 'CRASH_DATE', 'REPORT_TYPE', 
                 'INTERSECTION_RELATED_I', 'WORK_ZONE_TYPE', 
                 'LOCATION', 'WORKERS_PRESENT_I'], 
                 axis=1, inplace=True)

Since the remaining NaN values only take up a small portion of the remaining dataset, we’ll drop all the rows containing NaN values:

data.dropna(inplace=True)

Now, let’s chop up the data into train and test sets using sklearn . There is also an option in PyTorch:

from torch.utils.data import random_split

That can perform this operation, but for this example we’ll use sklearn :

X = data.drop(labels='DAMAGE', axis=1)
y = data['DAMAGE']from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y)

Let’s scale our numerical variables and grab dummies for our categorical variables:

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()numerical_train = X_train[['LATITUDE', 'LONGITUDE']]num_X_train_scaled = pd.DataFrame(ss.fit_transform(numerical_train), columns =numerical_train.columns)categorical_train = X_train.drop(labels=['LATITUDE', 'LONGITUDE'], axis=1)# Using OneHotEncoder to get converted categorical variablesfrom sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)X_tr_dummies = pd.DataFrame(ohe.fit_transform(categorical_train), columns = ohe.get_feature_names())X_tr_transformed = pd.concat([X_tr_dummies, num_X_train_scaled], axis=1)

And for our test data…

numerical_test = X_test[['LATITUDE', 'LONGITUDE']]
num_X_test_scaled = pd.DataFrame(ss.transform(numerical_test), columns=numerical_test.columns)categorical_test = X_test.drop(labels=['LATITUDE', 'LONGITUDE'], axis=1)X_te_dummies = pd.DataFrame(ohe.transform(categorical_test), columns = ohe.get_feature_names())X_te_transformed = pd.concat([X_te_dummies, numerical_test], axis=1)

And, last but not least, we have to convert our Pandas Dataframe into a PyTorch tensor:

# X_train values
X_tr_tensor = torch.tensor(X_tr_transformed.values)# X_test values
X_te_tensor = torch.tensor(X_te_transformed.values)

2. Split up our data into “batches”

This is the process of PyTorch Logistic Regression, in a nutshell:

Procure a random sample from the data (“batch”).
Compute the loss (For this example, we will use the Cross Entropy loss function).
Compute the gradient by performing a “backward pass” (loss.backward())
Use an optimizer function to take a step to minimize the loss (for this example, we will use SGD, Stochastic Gradient Descent).
Rinse and repeat for any given number of epochs.

Let’s begin!

import torchX_tr_tensor = torch.tensor(x_tr_transformed.values)
X_tr_tensor.shape>>> torch.Size([359417, 387])X_te_tensor = torch.tensor(x_te_transformed.values)
X_te_tensor.shape>>> torch.Size([210273, 387])

Let’s split up our data into 100 evenly sized batches:

from torch.utils.data import DataLoaderbatch_size = 100train_loader = DataLoader(X_tr_tensor, batch_size, shuffle=True)
test_loader = DataLoader(X_te_tensor, batch_size)

3. Instantiate our model!

import torch.nn as nninput_size = X_tr_tensor.shape[1]
num_classes = 3# Logistic regression model
model = nn.Linear(input_size, num_classes)

Let’s take a look at our initial weights and biases:

model.weight.shape>>> torch.Size([3, 387])model.weight>>> Parameter containing:
tensor([[ 0.0189,  0.0020, -0.0448,  ...,  0.0175,  0.0464,  0.0140],
        [-0.0335,  0.0505, -0.0031,  ..., -0.0387, -0.0364, -0.0502],
        [ 0.0013,  0.0317,  0.0103,  ..., -0.0373,  0.0377,  0.0122]],
       requires_grad=True)model.bias.shape>>> torch.Size([3]) # 3 biases for 3 classes, a ternary classifiermodel.bias>>> Parameter containing:
tensor([-0.0396, -0.0158, -0.0029], requires_grad=True)# Alternative:
list(model.parameters())>>> [Parameter containing:
 tensor([[ 0.0189,  0.0020, -0.0448,  ...,  0.0175,  0.0464,  0.0140],
         [-0.0335,  0.0505, -0.0031,  ..., -0.0387, -0.0364, -0.0502],
         [ 0.0013,  0.0317,  0.0103,  ..., -0.0373,  0.0377,  0.0122]],
        requires_grad=True),
 Parameter containing:
 tensor([-0.0396, -0.0158, -0.0029], requires_grad=True)]

4. Generate predictions:

outputs = model(X_tr_tensor.float())

In sklearn , we can generate probabilities using the predict_proba() method. In PyTorch, we’ll use the following:

import torch.nn.functional as Fprobs = F.softmax(outputs, dim=1)print("Here are some of our newly converted probabilities:\n", probs[:2].data)
>>> Here are our sample probabilities:
 tensor([[0.2804, 0.3906, 0.3291],
        [0.3087, 0.3869, 0.3043]])print(torch.sum(probs[0]).item())
>>> Sum:  1.0

This softmax() method converts the logit values into positive probabilities:

Source: https://jovian.ai/aakashns/03-logistic-regression

Before we can calculate how well our model did, we have to convert our y_train and y_test data into Tensors:

# Converting y values into their respective classes (0,1,2)
conditions_train = [(y_train == 'OVER $1,500'), 
                    (y_train == '$501 - $1,500'), 
                    (y_train == '$500 OR LESS')]
conditions_test = [(y_test == 'OVER $1,500'), (y_test == '$501 - $1,500'), (y_test == '$500 OR LESS')]
choices = [0,1,2]y_train = np.select(conditions_train, choices)
y_test = np.select(conditions_test, choices)#Typecasting into tensors:
y_tr_tensor = torch.tensor(y_train)
y_te_tensor = torch.tensor(y_test)

5. Calculate Loss

loss_fn = F.cross_entropyloss = loss_fn(outputs, y_tr_tensor)
print(loss)>>> tensor(1.1407, grad_fn=<NllLossBackward>)

Since cross-entropy is the negative logarithm of the predicted probability of the correct label averaged over all training samples, we can get a sense of our average “accuracy” using the following:

import mathmath.exp(-loss)>>> 0.3196105865388012

6. Create/Implement function or sub-class to repeat process for all batches and epochs:

To streamline this process for every batch and every epoch, we can use a function like the following (source: FreeCodeCamp):

def fit(epochs, lr, model, train_loader, 
        val_loader, opt_func=torch.optim.SGD
        ):
    optimizer = opt_func(model.parameters(), lr)
    history = [] # for recording epoch-wise results
    
    for epoch in range(epochs):
        
        # Training Phase 
        for batch in train_loader:
            loss = model.training_step(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
        
        # Validation phase
        result = evaluate(model, val_loader)
        model.epoch_end(epoch, result)
        history.append(result)    return history

This training module can be analyzed using other PyTorch features such as TensorBoard. You can access the official tutorial here.

If you want to play with some other Neural-Network-primed datasets, you can load in datasets with the following code:

# MNIST (Modified National Institute of Standards and Technology)
# handwritten digits 0-9 . from torchvision.datasets import MNIST
dataset = MNIST(root='data/', download=True)

To check out other built-in datasets, click here. Other datasets include CelebA (large-scale face attributes dataset with more than 200K celebrity images), CIFAR (object recognition), and Cityscapes (pictures of urban street scenes from 50 different cities).