Mining, Classification, Prediction

Quincy Smith (qrs227)

Introduction

The dataset that I chose contains the receiving statistics for the 2020 NFL season. Receivers are measusred on many different metrics, including targets, receptions, and touchdowns. The categorical variables in this dataset include team and player name. There is also a binary variable, AFC which is used to determine player conference. It is binary because it is a true/false observation. In total, the data has 21 variables with 499 rows.

This dataset was chosen because of my love for sports. In the last project, I analyzed different basketball metrics to make statements about different aspects of how the NBA has been changing from year to year as basketball is my favorite sport. I decided to take on a new challenge and analyze football and the NFL. Football has been my second favorite sport for a while now and I took this opportunity to learn more about it.

library(tidyverse)
# read your datasets in here, e.g., with read_csv()
TB12 <- read_csv("nfldata.csv")
# if your dataset needs tidying, do so here
AFC <- c("TEN", "IND", "HOU", "LVR", "KAN", "LAC", "BUF", "NWE", 
    "PIT", "CLE", "BAL", "MIA", "JAX", "CIN", "NYJ", "DEN")
TB12 <- TB12 %>% mutate(AFC = Team %in% AFC)
# any other code here

Here, we added a binary variable. This variable is called AFC and notes whether or not a player played the entire season for an AFC team.

Cluster Analysis

library(cluster)
cluster_data <- TB12 %>% select(Tgt, Rec, TD, Yds, Fmb, Lng)


sil_width <- vector()
for (i in 2:10) {
    kms <- kmeans(cluster_data, centers = i)
    sil <- silhouette(kms$cluster, dist(cluster_data))
    sil_width[i] <- mean(sil[, 3])
}
ggplot() + geom_line(aes(x = 1:10, y = sil_width)) + scale_x_continuous(name = "k", 
    breaks = 1:10)

pam1 <- cluster_data %>% pam(k = 2)

library(GGally)
TB12 %>% mutate(cluster = as.factor(pam1$clustering)) %>% ggpairs(columns = c("Tgt", 
    "Rec", "TD", "Yds", "Fmb", "Lng"), aes(color = cluster))

plot(pam1, which = 2)

kmeans2 <- cluster_data %>% kmeans(2)

TB12 <- TB12 %>% mutate(cluster = kmeans2$cluster)

Tp determine the number of clusters, I used the silhouette method where the width of a the silhouette would determine the number of clusters. I used a for loop and test a k value 2-10 and found that two clusters would fit the data the best. Then, using ggpairs, I plotted the via cluster and correlation of data.Cluster two appears to be the none starting receiver-type players as they have signficantly less targets than cluster one. Cluster one also has more yards and receptions which is to be expected with having more targets. It is worth noting that both clusters have sparatic TD numbers which shows how important play calling is when scoring rather than just pure skill.

Dimensionality Reduction with PCA

eig1 <- cluster_data %>% cor() %>% eigen()

X <- cluster_data %>% select(Tgt, Rec, TD, Yds, Fmb, Lng) %>% 
    scale

PCAscores <- X %*% eig1$vectors
TB12 %>% mutate(PC1 = PCAscores[, 1], PC2 = PCAscores[, 2]) %>% 
    ggplot(aes(PC1, PC2, color = AFC)) + geom_point()

TB12 %>% mutate(PC3 = PCAscores[, 3], PC4 = PCAscores[, 4]) %>% 
    ggplot(aes(PC3, PC4, color = AFC)) + geom_point()

Here a PCA test is performed in an attempt to reduce the dimensions of the dataset. In order to keep the data consistent with the first clustering, the same six variables are chosen to conduct the dimension reduction. The first plots PCA score 1 vs. PCA score 2 which are the scores referring to Targets and Receptions As the data moves to the right on the graph, there are more targets thrown while as an upward vertical shift in the data indicates more receptions caught. The data is also colored by conference. In the PC1 vs. PC2 graph, it is clear that as a receiver gets more targets, they also get more receptions. A majority of the data is concentrated around the 0 PC1 and PC2 mark as players generally received the same amount of targets and receptions comparatively. There were a few players who seemed to have caught less targets in relation to most receivers (those receivers probably do not start). The PC3 and PC4 graph reduces the dimensions of the data based on TDs and Yards. In this graph, as you move to the right, a receiver has more yards, and as you move up, a receiver has more touchdowns. This breaks a common way of thinking as yards do not seem to translate to TDs. Therefore, one can assume that play calling plays a large factor in scoring as receivers who are not gaining as many are yards are still scoring relatively similar amounts of touchdowns. The variability of this data represented by the PCA scores is about 0.68

Linear Classifier

log_fit <- glm(AFC == TRUE ~ TD + Yds + Tgt + Rec + Ctch + Fmb + 
    Lng + YPT + YPG + YPR, data = TB12)
prob_reg <- predict(log_fit, type = "response")

class_diag(prob_reg, TB12$AFC, positive = T)

##            acc  sens   spec    ppv     f1     ba    auc
## Metrics 0.5471 0.496 0.5976 0.5491 0.5212 0.5468 0.5503

table(truth = TB12$AFC, predictions = prob_reg > 0.5)

##        predictions
## truth   FALSE TRUE
##   FALSE   150  101
##   TRUE    125  123

set.seed(322)
k = 1

data <- sample_frac(TB12)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$AFC
    
    # train model
    fit <- glm(AFC == "True" ~ TD + Yds + Tgt + Rec + Ctch + 
        Fmb + Lng + YPT + YPG + YPR, data = TB12, family = "binomial")
    # test model
    probs <- predict(fit, type = "response")
    # get performance metrics for each fold
    
    diags <- rbind(diags, class_diag(probs, truth))
}

# average performance metrics across all folds
summarize_all(diags, mean)

##     acc sens spec ppv  f1  ba    auc
## 1 0.503    0    1 NaN NaN 0.5 0.4679

Using the glm() test as a linear classifier, the AUC on the entire data set was 0.5503, which is not a very strong value. This may be because the data is not necessarily linear. The glm() function predicts conference based on TDs, Yards, Targets, Receptions, Catch percentage, fumbles, longest catch, yards per target, yards per game, and yards per reception. The AUC also reflects the confusion matrix which shows a roughly fifty/fifty split among in what is true and what is false. When running the k folds CV model, the AUC was 0.4679, which may suggest some over-fitting by this linear classifier. This difference in AUC combined with the magnitude of the AUC may suggest that the model is not well suited for the data.

Non-Parametric Classifier

library(caret)
knn_fit <- knn3(AFC == TRUE ~ TD + Yds + Tgt + Rec + Ctch + Fmb + 
    Lng + YPT + YPG + YPR, data = TB12)

prob_knn <- predict(knn_fit, newdata = TB12)[, 2]

class_diag(prob_knn, TB12$AFC)

##            acc   sens   spec    ppv     f1     ba    auc
## Metrics 0.6653 0.6734 0.6574 0.6601 0.6667 0.6654 0.6968

table(truth = TB12$AFC, predictions = prob_knn > 0.5)

##        predictions
## truth   FALSE TRUE
##   FALSE   165   86
##   TRUE     81  167

set.seed(322)
k = 10

data <- sample_frac(TB12)  #randomly order rows
folds <- rep(1:k, length.out = nrow(data))  #create folds

diags <- NULL

i = 1
for (i in 1:k) {
    # create training and test sets
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    truth <- test$AFC
    
    # train model
    fit <- knn3(AFC == TRUE ~ TD + Yds + Tgt + Rec + Ctch + Fmb + 
        Lng + YPT + YPG + YPR, data = TB12)
    
    # test model
    probs <- predict(fit, newdata = test)[, 2]
    # get performance metrics for each fold
    diags <- rbind(diags, class_diag(probs, truth))
}

# average performance metrics across all folds
summarize_all(diags, mean)

##       acc    sens    spec     ppv      f1      ba     auc
## 1 0.66518 0.67391 0.65599 0.65918 0.66299 0.66493 0.69557

The Non-Parametric Classifier used was the k-nearest-neighbors function (knn). When running the test on the dataset as a whole initially, the model returned and AUC of 0.6968 which is a moderate fit. Thus the model does an alright job at classifying whether or not a receiver type player is in the AFC based on ten different receiver stats. Then we apply the function to a model that is trained on random samples of the data to test whether or not the model is over-fitting. During this cross validation test, we took a k-10 fold sampling approach. After randomizing the data and taking random samples, preforming the knn on those samples, finding each sample’s AUC, and finally averaging all the AUC, the overall AUC was 0.6955. This is extremely close to the previous AUC, meaning our model does well when predicting small handfuls of players’ conferences. Even though both results were moderate in strength, we can make the statement that the model does well because it does not over-fit and is adaptable to different data inputs. On another note, the confusion matrix also shows that the model does a little bit better than 50% in predicting the correct conference which is consistent with our AUC measurement.

Regression/Numeric Prediction

predict_data <- TB12 %>% select(6:19)
fit <- lm(TD ~ ., data = predict_data)
yhat <- predict(fit)

mean((predict_data$TD - yhat)^2, na.rm = T)

## [1] 1.782094

set.seed(322)
k = 7

data <- predict_data[sample(nrow(predict_data)), ]
folds <- cut(seq(1:nrow(predict_data)), breaks = k, labels = F)

diags <- NULL
for (i in 1:k) {
    train <- data[folds != i, ]
    test <- data[folds == i, ]
    
    
    fit <- lm(TD ~ ., data = train)
    
    
    yhat <- predict(fit, newdata = test)
    
    
    diags <- mean((test$TD - yhat)^2)
}

mean(diags)

## [1] 2.043098

The linear regression was used to predict TDs versus all other numerical variables including targets, receptions, catch percentage, yards per game, yards per catch, etc. In the first regression prediction, we applied the regression to the entire dataset and we got a mean standard error (MSE) of 1.782, meaning on average the square of the residual of a data point was about 1.782 units. Then we tested the regression by training a model and seeing if it over-fit. We did this by taking 7 folds of the linear regression, taking random samples, then for each sample finding the MSE, then finding the mean of all the sample’s MSEs. The overall MSE was 2.043 which is not much higher than our 1.782 that we got for the entire dataset, thus the model is not really over-fitting since the increase in MSE was not very large.

Python

library(reticulate)
use_python("/usr/bin/python3")


import numpy as np
NFL = r.TB12
py_data = []

py_data.append((np.mean(NFL.Tgt)))
py_data.append((np.mean(NFL.Rec)))
py_data.append((np.mean(NFL.Ctch)))
py_data.append((np.mean(NFL.Yds)))
py_data.append((np.mean(NFL.YPR)))
py_data.append((np.mean(NFL.TD)))

for i in range(len(py_data)):
  py_data[i] = round(py_data[i], 2)

py$py_data

## [[1]]
## [1] 34.62
## 
## [[2]]
## [1] 23.56
## 
## [[3]]
## [1] 0.71
## 
## [[4]]
## [1] 261.49
## 
## [[5]]
## [1] 10.17
## 
## [[6]]
## [1] 1.75

In the R chunk of code, we access the Python3 library and commands. Using the r. prefix, we are able to bring all of our data into the Python environment. Once the data is there, we saved it as a Python variable to make is easier to work with. Then using the numpy library, we found the mean of some key receiving variables which are targets, receptions, catch percentage, yards, yards per reception, and touchdowns. This allowed us to obtain the average for the NFL based on stats from all the NFL players. We put all of the the means in Python list and sent it back to the R chunk using the py$ prefix and the assigned variable name.

Concluding Remarks

Include concluding remarks here, if any

Project 2: Data Mining, Classification, Prediction

SDS322E