OneClass_ProgramPerformance.knit

Abstract

Over the last few years, as a result of events endured, the machine learning models built for admissions and program evaluation purposes lost their predictive power — a condition known as model drifting. In this talk, I discuss the relevance of model drift in academe - especially the combined impact of COVID restrictions, social upheaval, and the demographic-shift on the leading causes of model drift: concept drift and covariate drift. Succinctly, the events drove changes in the stationarity of the target variable and the predictors.

Formally, I frame the resulting situation as a One Class problem to rely on several well-known one-class algorithms: Support Vector Machines and Random Forests. We explain their use in reconstructing a true representative sample of our student body. Last, we use propensity matching to train a new admissions model on the reconstructed data set. Matching is required to adjust for well-known selection biases. Biases that may have also been impacted by recent events.

AE Rodriguez
Department of Economics & Business Analytics
Pompea College of Business
University of New Haven
March 2022

The long-awaited demographic shift in college age applicants brought about a historical low in the availability of college-age individuals. The COVID firehose was a total reset; COVID era proscriptions closed testing centers used routinely in graduate admissions; beyond restrictions and proscriptions the COVID turmoil seemingly, and dramatically appear to have altered individual preferences and expectations associated with attending college. Horizons appear to have been reduced and social discount rates increased - shifting the appeal of the current vis-a-vis the future. The very expensive and lengthy higher-education path of yesteryear appears to have emerged the worse for wear. Last, the George Floyd uprising and related social upheaval led to a realization of possible inequities and unfairness built into the traditional metrics used for higher education admissions, financial aid – even social standing.

In short, schools faced a dwindling student funnel. Graduate programs admissions teams improvised. Admissions criteria were loosened, supplemented or replaced entirely by considerations finely tuned to public sentiment and opinion over social and racial disparities. Entrance exams such as the GRE and the GMAT were rescinded entirely or rendered optional. GPA, honor courses and other academic signals were supplemented with a sensitivity towards less traditional schooling experiences. These steps may have been a genuine response to inequities, or difficulties accessing testing centers, or other Covid related hardships - or it may have been pretextual, or perhaps more realistically: a little bit of both. What may not be a well-known fact is that for a while there many schools were awfully close to not making it.

These changes in admissions protocols may have increased raw student numbers. But it eviscerated the historical admissions model. The extant models were calibrated and trained on well-known stable target variables and relied on a relatively stationary distribution of selection variables - covariates. These no longer exist. That’s what the changes have wrought for this little corner of the world. And in this talk I discuss personal experiences with programs and my institution. However, it is my belief that other programs at other institutions shared our experience.

The Model Drift Problem

In the business of analytics there is an ever-present bugbear known as – inter alia – model drift or model shift, data shift, concept shift, covariate shift. I will refer to these, generically, as “data-shifting.” The terms are interrelated but are often used synonymously. I explain each in more detail below.

Conceptually, the issue is a straightforward one. If you formulated a predictive model on specific variables and a specific class, reflecting an implicit (or explicit) decision-making criteria then the predictive capabilities of the model will decline to the extent that the underlying variables change or to the extent the target variable changes.

What is not as straightforward to parse is the seeming change in a behavioral peculiarity that afflicts selection programs. As a general matter, in a higher-education admission ecosystem there are at least two potential sources of selection bias that may obscure prospective student appraisals. The issue here is that one can plausibly infer that as a result of the events described above the incentive structure underlying these behavioral biases – may have changed as well. Still, resulting effect of the selection process and the underlying success-driven ecosystem is yet another instance of data-shifting.

The process of admissions is likely to be non-random - and susceptible to selection bias. First, there is a positive correlation between attending a specific university program and subsequent student performance. Students admitted into a program are likely to benefit from the program ecosystem. A graduate program has every interest in the success of its students and is likely to bend over backwards to ensure they make progress towards their degrees.

As a higher number of lower quality students attend it becomes costlier for the institution to finance the success ecosystem. Costlier because the institution has to devote more hard and soft resources to ensure the progress and success of ostensibly lower quality students. Hard resources would be additional tutoring or remedial services; soft expenditures would be the additional time and effort devoted to help less talented students stay abreast.

Second, in yesteryear, students would self-select based on an awareness of their meeting admissions thresholds – GPA and GMAT/GRE scores ex ante. In other words, those students who sported good academic records were more likely to apply - most likely gauging their acceptance by the various quality signals available: GRE/GMAT thresholds, admissions ratios, placement rates. That is to say – the better students were the more likely to let themselves be considered for selection. Put differently, poor students or students with poor academic records were unlikely to consider applying to selective programs.

This self-selection process has probably been inverted. The rescinding or ameliorating of admissions criteria – quality signals - has empowered less qualified students to apply. And perhaps simultaneously – the elimination of quality signals may have reduced the incentive of higher academic-quality students to apply. Or, they may choose to apply to the more selective schools who still manage to retain and enforce admissions conditions.

The net effect of these self-selection processes in combination with the operational changes discussed above is that all or nearly all of our students currently in the program are “successful” students. Many of those students who would otherwise have been rejected have been artificially shifted into the successful category. Thus, without a clear understanding of the student quality spectrum we are confronted with a host of problems. But for purposes of this talk – one of these problems is that we are unable to train another admission/performance model with the existing data. Most classification algorithms require instance of both classes of a binary variable to work. Since all of our students are “successes” – that is to say, they belong only to one class, we have way of discriminating. In this talk I will discuss one approach to overcome this problem – know as the One Class problem.

We will compare the two most popular algorithms used to address this issue. And we will appraise the need or relevance of matching to overcome the selection bias above. This last effort is not clear to me and this is offered to engage discussion.

Model Drift or Model Deterioration

Over time, an admissions model starts to lose its predictive power, a concept known as model drift. This drift is an insidious phenomenon that creeps into our models. It can have detrimental effects on our student admissions pipeline or operations.

Concept drift means that the properties of the target variable: student performance change over time. As a general rule this causes problems because the predictions become less accurate and become unreliable. At its most extreme – the target variable disappears. Graduate programs rescinded the GMAT/GRE and discounted undergraduate GPA. Other factors such as diversity and equity took on added weight.

Covariate shift is the change in the distribution of one or more of the independent variables or input variables of the dataset. This means that even though the relationship between feature and target variable remains unchanged, the distribution of the feature itself has changed. When statistical properties of this input data change, the same model which has been built before will not provide unbiased results. This leads to inaccurate predictions. Importantly, in our modeling, the variable disappeared entirely – to be replaced with experience, quality of professional job, and age as a proxy for experience and maturity and thus, by extension, to success in our program.

How to Determine Whether there has been Model Drift?

To ascertain whether the student distribution has shifted one should build a classifier model to determine whether it can distinguish between the reference (pre-events) and compared distributions (the current or post-event distribution). The process entails the following steps.

First, one creates a dummy variable and set it to 0 for the original data. And symmetrically, label the identifier dummy 1 for the new batch of students.

Second, deploy a model to discriminate between the two groupings. For example one could fit a simple naïve bayes model to discriminate between the two groups. If the naïve bayes model can easily discriminate between the two sets of data, then a covariate shift has occurred.

On the other hand, if the model cannot distinguish the two sets then it is fair to conclude that a data shift has not occurred. An accuracy of approximately 50 percent suggests an outcome no better than a random flip of a coin.

In our situation - this step was unnecessary, for I am certain of the model shift. Second, I am working with toy data.

The Present Student Body: a Simulation

With the elimination of admissions exams such as the GRE and the GMAT undergraduate GPA and professional work experience became the go-to measures. Age crept in as a possible discriminating factor primarily because the average age of recent entrants appears to have crept downwards. In reality - age is highly correlated with experience and, to a lesser extent, with GPA.

We use a data generating process not atypical of those found in previous studies that rely on simulation (Austin 2009). For each student, three covariates, gpa, age and experience were generated from a multivariate normal distribution. Both variables were standardized and mapped into distributions better resembling our student body. Thus, experience was transformed into a binary variable, age into a truncated variable bound by 18 and 41 at its lower and upper limit. Similarly, gpa ranges from 2.1 to 4.0, reflecting US grade-tallying practices.

  #Install "pacman" to handle installation and package management.
  options(digits = 3, scipen = 9999, knitr.table.format = "rst", length = 120)
  pacman::p_load(tidyverse, DescTools, sn, randomForest, caret, mice, naniar, e1071)
  pacman::p_load(solitude)
  pacman::p_load(MatchIt)
  pacman::p_load(cluster)
  pacman::p_load(GGally)
  pacman::p_load(forcats)
  pacman::p_load(lmtest)
  pacman::p_load(MASS)
              
              set.seed(12345)
              n = 100
                  Sigma<-rbind(c(1,0.8,0.7), c(0.8, 1, 0.9), c(0.7,0.9,1))
                  #Sigma
              # create the mean vector
              mu<-c(10, 5, 2)
              X = mvrnorm(n, mu, Sigma)
              experience = X[,1]
                experience = round(scales::rescale(scale(experience), to = c(0,1)),0)
                #table(experience)
              age = X[,2]
                age = scales::rescale(scale(age), to = c(18,41))
              gpa = X[,3]
                gpa = scales::rescale(scale(gpa), to = c(2.1,4.0))
                
              head(data.frame(experience, age, gpa),3)

The figure below displays the correlations among the covariates.

ggpairs(cbind.data.frame(age, experience, gpa))

For each student the logit of the parameter for the outcome model, p, is generated below and displayed in the resulting graph.

      xb <- -7 + 1.5*experience + 0.2*age + 0.05*gpa
      p <- 1/(1 + exp(-xb))
          plot(xb,p)

          #hist(p)

The randomly generated binomial distribution function below allows us to generate random values from a bernoulli distribution with a size n and a probability of success p.

y <- rbinom(n = 100, size = 1, prob = p)
    table(y)

## y
##  0  1 
## 47 53

The Data

  cbind.data.frame(y, age, experience, gpa) %>% group_by(y) %>% dplyr::summarise(n_students = n(),
                                            mean_gpa = mean(gpa),
                                            mean_age = mean(age))

One Class Support Vector Machines

Support vector machine is a supervised learning algorithm derived from statistical learning theory.

Supervised learning is not possible when data are of one Class as is the case here. It is necessary to find natural clustering of the data to groups. New data is assigned to the new formed groups. The support-vector clustering algorithm applies the statistics of support vectors to categorize unlabeled data. (Hava Siegelmann, Vladimir Vapnik)

      x = cbind.data.frame(age, experience, gpa)
      model_svm3 <- svm(x, y = NULL,type = "one-classification", kernel = "linear") 
      
      svm_predict3 = predict(model_svm3)
      table(svm_predict3, y)

##             y
## svm_predict3  0  1
##        FALSE 31 26
##        TRUE  16 27

The accuracy score suggests very poor performance.

mean(svm_predict3 == y)

## [1] 0.58

The SVM algorithm allows for probability predictions. In raises the question as to whether these could be used for propensity matching.

svm_predict3_prob = predict(model_svm3, x, decision.values = TRUE, probability = TRUE)
      probs = attr(svm_predict3_prob, "decision.values")
      res  = ifelse(probs> 0, "1", "0")
table(res, y )

##    y
## res  0  1
##   0 31 26
##   1 16 27

One Class Random Forests

Another alternative is to use the popular random forests algorithm in unsupervised mode. In effect, the outcome variable y is set to NULL.

The algorithm generates proximity matrix. A proximity matrix is a two-dimensional array containing the pairwise distances between the elements of a matrix. As a general proposition a proximity matrix measures the similarity (or dissimilarity) between the pairs of a matrix.

      rf2 <- randomForest(x, mtry = 2, ntree = 5000, proximity = TRUE)
      
      prox <- rf2$proximity
      pam.rf <- pam(prox, 2)
      pam.rf$clustering = factor(pam.rf$clustering, levels = c("1","2"))
      pam.rf$clustering =    fct_recode(pam.rf$clustering, "0" = "2", "1" = "1")
        mean(pam.rf$clustering == y)

## [1] 0.62

I use the confusion matrix function from the package caret; it offers an array of performance metrics.

caret::confusionMatrix(pam.rf$clustering,as.factor(y))

## Warning in confusionMatrix.default(pam.rf$clustering, as.factor(y)): Levels are
## not in the same order for reference and data. Refactoring data to match.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 13  4
##          1 34 49
##                                         
##                Accuracy : 0.62          
##                  95% CI : (0.517, 0.715)
##     No Information Rate : 0.53          
##     P-Value [Acc > NIR] : 0.0437        
##                                         
##                   Kappa : 0.209         
##                                         
##  Mcnemar's Test P-Value : 0.00000255    
##                                         
##             Sensitivity : 0.277         
##             Specificity : 0.925         
##          Pos Pred Value : 0.765         
##          Neg Pred Value : 0.590         
##              Prevalence : 0.470         
##          Detection Rate : 0.130         
##    Detection Prevalence : 0.170         
##       Balanced Accuracy : 0.601         
##                                         
##        'Positive' Class : 0             
##

cbind.data.frame(pam.rf$clustering, y, age, experience, gpa) %>% group_by(pam.rf$clustering) %>% 
  dplyr::summarise(n_students = n(),
                                            mean_gpa = mean(gpa),
                                            mean_age = mean(age))

g1 = cbind.data.frame(pam.rf$clustering, y, age, experience, gpa) %>% 
  ggplot(aes(y, gpa,col = as.factor(y) )) + 
        geom_boxplot() + 
          labs(title = "Student Performance",
                              x = "Success = 1",
                              ylabs = "Undergraduate GPA") +
              theme(legend.position = "none")
  

g2 = cbind.data.frame(pam.rf$clustering, y, age, experience, gpa) %>% 
  ggplot(aes(x = fct_rev(pam.rf$clustering), gpa, col = as.factor(pam.rf$clustering )) ) + 
        geom_boxplot() + 
        labs(title = "Student Performance",
                              x = "Success = 1",
                              ylabs = "Undergraduate GPA") +
  theme(legend.position = "none") 
      
cowplot::plot_grid(g1, g2)

      perf_data = data.frame(y, age, experience, gpa)
      adjusted_data = cbind.data.frame(pam.rf$clustering, perf_data[,2:4])
      head(adjusted_data)

      t.test(gpa~pam.rf$clustering, data = adjusted_data)

## 
##  Welch Two Sample t-test
## 
## data:  gpa by pam.rf$clustering
## t = 12, df = 43, p-value = 0.000000000000009
## alternative hypothesis: true difference in means between group 1 and group 0 is not equal to 0
## 95 percent confidence interval:
##  0.552 0.785
## sample estimates:
## mean in group 1 mean in group 0 
##            3.12            2.45

Propensity Score Matching

To address the selection that may be distorting the selection process, I adjust the data set via propensity score matching. Covariates should be related to both classes. Accordingly, I use the same covariates as matching variables in the OneClass exercise above, age, experience, and gpa. The adjusted outcome variable is the grouping variable that was obtained via the OneClass process above. We use the nearest-neighbor method as found in the R package MatchIt.

The summary of the matching results are shown below.

The results are telling. In the original data set, the difference in GPA between the “Successful” and “Not Successful” students was approximately 22 percent. As well as an 18 percent difference in those experienced individuals. This can be seen in the Summary of Balance for All Data table below.

The percent differences between the two class of students in the matched data are lower for the matched data. the difference in GPA, for example, dropped to approximately 4 percent. Whereas the difference in age between groups dropped from from 18 percent to approximately 8 percent.

      perf_dat2 = cbind.data.frame(pam.rf$clustering, experience, age, gpa)
        #table(pam.rf$clustering)
        #head(perf_dat2)
      match.it <- matchit(pam.rf$clustering ~ age + gpa + experience, data =perf_dat2, method="nearest", ratio=1)

## Warning: glm.fit: algorithm did not converge

## Warning: Fewer control units than treated units; not all treated units will get
## a match.

      summary(match.it)

## 
## Call:
## matchit(formula = pam.rf$clustering ~ age + gpa + experience, 
##     data = perf_dat2, method = "nearest", ratio = 1)
## 
## Summary of Balance for All Data:
##            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance           0.000          1.00           -2.65      0.166     0.845
## age               30.460         22.48            2.10      5.106     0.495
## gpa                3.120          2.45            1.95      3.537     0.478
## experience         0.566          0.00            1.14          .     0.566
##            eCDF Max
## distance      1.000
## age           0.941
## gpa           0.892
## experience    0.566
## 
## 
## Summary of Balance for Matched Data:
##            Means Treated Means Control Std. Mean Diff. Var. Ratio eCDF Mean
## distance           0.000          1.00          -2.649      0.793     0.757
## age               29.005         22.48           1.718      4.060     0.398
## gpa                2.992          2.45           1.579      3.809     0.389
## experience         0.412          0.00           0.831          .     0.412
##            eCDF Max Std. Pair Dist.
## distance      1.000           2.649
## age           0.941           1.718
## gpa           0.824           1.622
## experience    0.412           0.831
## 
## Percent Balance Improvement:
##            Std. Mean Diff. Var. Ratio eCDF Mean eCDF Max
## distance               0.0       87.1      10.4      0.0
## age                   18.2       14.1      19.7      0.0
## gpa                   19.2       -5.9      18.7      7.6
## experience            27.3          .      27.3     27.3
## 
## Sample Sizes:
##           Control Treated
## All            17      83
## Matched        17      17
## Unmatched       0      66
## Discarded       0       0

The matching process renders a more equitable data set. It appears to have purged some of the distortions caused by the data shifting and the associated behavioral incentives that may result in selection bias.

plot(match.it, type = 'jitter', interactive = FALSE)

The matched data set is recovered with the function displayed below.

df.match <- match.data(match.it)[1:ncol(perf_dat2)]
      #head(df.match)
      dim(df.match)

## [1] 34  4

And can be used for subsequent analysis. For example, the difference in GPA between groups is not statistically significant after matching.

t.test(df.match$gpa~df.match$`pam.rf$clustering`, data = df.match)

## 
##  Welch Two Sample t-test
## 
## data:  df.match$gpa by df.match$`pam.rf$clustering`
## t = 6, df = 24, p-value = 0.00001
## alternative hypothesis: true difference in means between group 1 and group 0 is not equal to 0
## 95 percent confidence interval:
##  0.34 0.74
## sample estimates:
## mean in group 1 mean in group 0 
##            2.99            2.45

An Adjusted Admissions Algorithm

The matched data serves as data set on which to fit an admissions algorithm. I illustrate this feature by fitting a simple naive bayes model to the matched data set. And test its performance. The resulting accuracy score is reported.

In the example, the decision to classify the student as to her likelihood of success in the program is made based on the two variables accessible to the decision-making committee. The accuracy of the algorithm is 77 percent.

      fit_nb = naiveBayes(`pam.rf$clustering` ~experience + age, data = df.match)
        sample_new = sample_n(df.match, 13)
        predict_new = predict(fit_nb, newdata = sample_new[,2:4])
      mean(predict_new == sample_new$`pam.rf$clustering`)

## [1] 0.692

Adjusting the data set may seem immaterial or unnecessary in assembling the data set used the admissions function above. If this is the case then there is no need to adjust the OneClass adjusted dataset. However, leaving the model unadjusted may cloud the validity of the internal comparisons and of program performance.

Not recognizing that the position or ranking of particular students may be a reflection of a sample selection process tends to distort our assessment of our own abilities. ’Tis better to have a realistic appreciation of how we are doing.

Concluding Comments

Events over the last years vitiated the machine learning models we built for admissions and program evaluation purposes. They have lost their predictive power — a condition known as model drifting. In this talk, we introduced model drift which has resulted in a perplexing conundrum with program administrators. I explain how the events resulted in the two most common cause of model drift: concept drift and data drift. Both events involve changes in the stationarity of the target variable and the predictors.

The matter of selection bias which has always been present in the admissions selection process continues apace. However, the direction of the bias appears to have changed as a result of shifting incentives and operational adjustments by program administrators.

We framed the resulting situation as a One Class problem to explain the use and limitations of several well-known algorithms: specifically Support Vector Machines and Random Forests. We explain their use in reconstructing a true representative sample of our student body. Last, we use propensity matching to train a new admissions model on the reconstructed data set.

version: 03.05.2022

Model-Drift in Academic Admissions