Machine Learning Workflow with PyTorch

Jamie Dowat
5 min readMar 8, 2021

--

PyTorch has become one of the leading Machine Learning frameworks in the past couple of years, with various companies adopting its easy-to-use framework for their research. According to The Gradient, PyTorch is becoming a major competitor to the long-time favorite, TensorFlow, showing major growth in its mentions at research conferences.

Founded in 2016 by Adam Paszke, Sam Gross, Soumith Chintala and Gregory Chanan, its use of idiomatic Python makes it very easy to implement.

The code below show’s PyTorch’s clever integration with idiomatic Python.

In this post, we’ll be showcasing Logistic Regression in PyTorch using some Car Crash data (click here to download).

Our goal is to predict the level of DAMAGE on the car. This is divided into three categories (making this a classification issue):

  1. $500 or less in damage
  2. $501 — $1,500
  3. Over $1,500

1. Data Preparation

First, let’s get our data ready for modeling.

We are going to load the data in as a Pandas DataFrame, preprocess, then convert into tensor format to be used by PyTorch.

import pandas as pd
pd.set_option('display.max_columns', None)
data = pd.read_csv('traffic_crashes_chicago.csv')

Let’s check out our NaN values:

data.isna().sum()CRASH_RECORD_ID                       0
RD_NO 3917
CRASH_DATE_EST_I 446607
CRASH_DATE 0
POSTED_SPEED_LIMIT 0
TRAFFIC_CONTROL_DEVICE 0
DEVICE_CONDITION 0
WEATHER_CONDITION 0
LIGHTING_CONDITION 0
FIRST_CRASH_TYPE 0
TRAFFICWAY_TYPE 0
LANE_CNT 283902
ALIGNMENT 0
ROADWAY_SURFACE_COND 0
ROAD_DEFECT 0
REPORT_TYPE 11746
CRASH_TYPE 0
INTERSECTION_RELATED_I 373918
NOT_RIGHT_OF_WAY_I 460154
HIT_AND_RUN_I 340916
DAMAGE 0
DATE_POLICE_NOTIFIED 0
PRIM_CONTRIBUTORY_CAUSE 0
SEC_CONTRIBUTORY_CAUSE 0
STREET_NO 0
STREET_DIRECTION 3
STREET_NAME 1
BEAT_OF_OCCURRENCE 5
PHOTOS_TAKEN_I 476801
STATEMENTS_TAKEN_I 473134
DOORING_I 481317
WORK_ZONE_I 479748
WORK_ZONE_TYPE 480406
WORKERS_PRESENT_I 482118
NUM_UNITS 0
MOST_SEVERE_INJURY 978
INJURIES_TOTAL 967
INJURIES_FATAL 967
INJURIES_INCAPACITATING 967
INJURIES_NON_INCAPACITATING 967
INJURIES_REPORTED_NOT_EVIDENT 967
INJURIES_NO_INDICATION 967
INJURIES_UNKNOWN 967
CRASH_HOUR 0
CRASH_DAY_OF_WEEK 0
CRASH_MONTH 0
LATITUDE 2670
LONGITUDE 2670
LOCATION 2670
dtype: int64

Now we’ll drop all the NaN heavy columns and columns that don’t hold any predictive value:

data.drop(labels=['CRASH_RECORD_ID', 'RD_NO', 'CRASH_DATE_EST_I',
'LANE_CNT', 'NOT_RIGHT_OF_WAY_I', 'HIT_AND_RUN_I',
'DATE_POLICE_NOTIFIED', 'STREET_NO',
'STREET_DIRECTION', 'STREET_NAME',
'BEAT_OF_OCCURRENCE', 'PHOTOS_TAKEN_I',
'STATEMENTS_TAKEN_I', 'DOORING_I', 'WORK_ZONE_I',
'CRASH_DATE', 'REPORT_TYPE',
'INTERSECTION_RELATED_I', 'WORK_ZONE_TYPE',
'LOCATION', 'WORKERS_PRESENT_I'],
axis=1, inplace=True)

Since the remaining NaN values only take up a small portion of the remaining dataset, we’ll drop all the rows containing NaN values:

data.dropna(inplace=True)

Now, let’s chop up the data into train and test sets using sklearn . There is also an option in PyTorch:

from torch.utils.data import random_split

That can perform this operation, but for this example we’ll use sklearn :

X = data.drop(labels='DAMAGE', axis=1)
y = data['DAMAGE']
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y)

Let’s scale our numerical variables and grab dummies for our categorical variables:

from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
numerical_train = X_train[['LATITUDE', 'LONGITUDE']]num_X_train_scaled = pd.DataFrame(ss.fit_transform(numerical_train), columns =numerical_train.columns)categorical_train = X_train.drop(labels=['LATITUDE', 'LONGITUDE'], axis=1)# Using OneHotEncoder to get converted categorical variablesfrom sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)
X_tr_dummies = pd.DataFrame(ohe.fit_transform(categorical_train), columns = ohe.get_feature_names())X_tr_transformed = pd.concat([X_tr_dummies, num_X_train_scaled], axis=1)

And for our test data

numerical_test = X_test[['LATITUDE', 'LONGITUDE']]
num_X_test_scaled = pd.DataFrame(ss.transform(numerical_test), columns=numerical_test.columns)
categorical_test = X_test.drop(labels=['LATITUDE', 'LONGITUDE'], axis=1)X_te_dummies = pd.DataFrame(ohe.transform(categorical_test), columns = ohe.get_feature_names())X_te_transformed = pd.concat([X_te_dummies, numerical_test], axis=1)

And, last but not least, we have to convert our Pandas Dataframe into a PyTorch tensor:

# X_train values
X_tr_tensor = torch.tensor(X_tr_transformed.values)
# X_test values
X_te_tensor = torch.tensor(X_te_transformed.values)

2. Split up our data into “batches”

This is the process of PyTorch Logistic Regression, in a nutshell:

  1. Procure a random sample from the data (“batch”).
  2. Compute the loss (For this example, we will use the Cross Entropy loss function).
  3. Compute the gradient by performing a “backward pass” (loss.backward())
  4. Use an optimizer function to take a step to minimize the loss (for this example, we will use SGD, Stochastic Gradient Descent).
  5. Rinse and repeat for any given number of epochs.

Let’s begin!

import torchX_tr_tensor = torch.tensor(x_tr_transformed.values)
X_tr_tensor.shape
>>> torch.Size([359417, 387])X_te_tensor = torch.tensor(x_te_transformed.values)
X_te_tensor.shape
>>> torch.Size([210273, 387])

Let’s split up our data into 100 evenly sized batches:

from torch.utils.data import DataLoaderbatch_size = 100train_loader = DataLoader(X_tr_tensor, batch_size, shuffle=True)
test_loader = DataLoader(X_te_tensor, batch_size)

3. Instantiate our model!

import torch.nn as nninput_size = X_tr_tensor.shape[1]
num_classes = 3
# Logistic regression model
model = nn.Linear(input_size, num_classes)

Let’s take a look at our initial weights and biases:

model.weight.shape>>> torch.Size([3, 387])model.weight>>> Parameter containing:
tensor([[ 0.0189, 0.0020, -0.0448, ..., 0.0175, 0.0464, 0.0140],
[-0.0335, 0.0505, -0.0031, ..., -0.0387, -0.0364, -0.0502],
[ 0.0013, 0.0317, 0.0103, ..., -0.0373, 0.0377, 0.0122]],
requires_grad=True)
model.bias.shape>>> torch.Size([3]) # 3 biases for 3 classes, a ternary classifiermodel.bias>>> Parameter containing:
tensor([-0.0396, -0.0158, -0.0029], requires_grad=True)
# Alternative:
list(model.parameters())
>>> [Parameter containing:
tensor([[ 0.0189, 0.0020, -0.0448, ..., 0.0175, 0.0464, 0.0140],
[-0.0335, 0.0505, -0.0031, ..., -0.0387, -0.0364, -0.0502],
[ 0.0013, 0.0317, 0.0103, ..., -0.0373, 0.0377, 0.0122]],
requires_grad=True),
Parameter containing:
tensor([-0.0396, -0.0158, -0.0029], requires_grad=True)]

4. Generate predictions:

outputs = model(X_tr_tensor.float())

In sklearn , we can generate probabilities using the predict_proba() method. In PyTorch, we’ll use the following:

import torch.nn.functional as Fprobs = F.softmax(outputs, dim=1)print("Here are some of our newly converted probabilities:\n", probs[:2].data)
>>> Here are our sample probabilities:
tensor([[0.2804, 0.3906, 0.3291],
[0.3087, 0.3869, 0.3043]])
print(torch.sum(probs[0]).item())
>>> Sum: 1.0

This softmax() method converts the logit values into positive probabilities:

Source: https://jovian.ai/aakashns/03-logistic-regression

Before we can calculate how well our model did, we have to convert our y_train and y_test data into Tensors:

# Converting y values into their respective classes (0,1,2)
conditions_train = [(y_train == 'OVER $1,500'),
(y_train == '$501 - $1,500'),
(y_train == '$500 OR LESS')]
conditions_test = [(y_test == 'OVER $1,500'), (y_test == '$501 - $1,500'), (y_test == '$500 OR LESS')]
choices = [0,1,2]
y_train = np.select(conditions_train, choices)
y_test = np.select(conditions_test, choices)
#Typecasting into tensors:
y_tr_tensor = torch.tensor(y_train)
y_te_tensor = torch.tensor(y_test)

5. Calculate Loss

loss_fn = F.cross_entropyloss = loss_fn(outputs, y_tr_tensor)
print(loss)
>>> tensor(1.1407, grad_fn=<NllLossBackward>)

Since cross-entropy is the negative logarithm of the predicted probability of the correct label averaged over all training samples, we can get a sense of our average “accuracy” using the following:

import mathmath.exp(-loss)>>> 0.3196105865388012

6. Create/Implement function or sub-class to repeat process for all batches and epochs:

To streamline this process for every batch and every epoch, we can use a function like the following (source: FreeCodeCamp):

def fit(epochs, lr, model, train_loader, 
val_loader, opt_func=torch.optim.SGD
):
optimizer = opt_func(model.parameters(), lr)
history = [] # for recording epoch-wise results

for epoch in range(epochs):

# Training Phase
for batch in train_loader:
loss = model.training_step(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()

# Validation phase
result = evaluate(model, val_loader)
model.epoch_end(epoch, result)
history.append(result)
return history

This training module can be analyzed using other PyTorch features such as TensorBoard. You can access the official tutorial here.

If you want to play with some other Neural-Network-primed datasets, you can load in datasets with the following code:

# MNIST (Modified National Institute of Standards and Technology)
# handwritten digits 0-9 .
from torchvision.datasets import MNIST
dataset = MNIST(root='data/', download=True)

To check out other built-in datasets, click here. Other datasets include CelebA (large-scale face attributes dataset with more than 200K celebrity images), CIFAR (object recognition), and Cityscapes (pictures of urban street scenes from 50 different cities).

--

--

Jamie Dowat

Performed to my heart’s content for a year at music theater school (thank you Viterbo University) — dropped out and am now making lots of graphs.