Time Efficiency and Accuracy Improvement using PCA
Yaumil Sitta

19 minute read

About Dimensionality Reduction

If you are familiar enough with data, sometimes you are faced with too many predictor variables that make the computation so heavy. Let us say, you are challenged to predict employee in your company will resign or not while the variables are the level of satisfaction on work, number of project, average monthly hours, time spend at the company, etc. You are facing so many predictor that took so long for training your model. One way to speed up your training process is by reducing the dimension that can make the computation less heavy.

To do the dimensionality reduction, the techniques divide into two ways:

  • Feature Elimination
  • Feature Extraction

Feature Elimination

Feature elimination is when you select the variable that is influence your prediction, and throw away the variable that has no contribution to your prediction. In the case of prediction of resigning employee or not, for example, you only choose the variable that is influencing the employee resignation.

Generally, you choose the variables based on your expertise on experiencing the employee resignation. Besides, you can use several statistical technique to this, like using variance, spearman, anova, etc. Unfortunately, this article will not explain what kinds of feature elimination here, since we want to focus on one of feature extraction methods.

Feature Extraction

Feature extraction is a technique that you create new variable based on your existing variable. Let us say, for the employee resignation case, given we have 10 predictor variables to predict the employee will resign or not. So, in feature extraction, we create 10 new variables based on the 10 given variable. One of the techniques to do this is called Principal Component Analysis (PCA).

Principal Component Analysis

The Principal Component Analysis (PCA) is a statistical method to reduce the dimension of the data by extracting the variables and leave the variables that has least information about something that we predicted \(\hat{y}\).

Then, when you should using PCA instead of other method?1

  • When you want to reduce the dimension/variable, but you dont care what variables that is completely remove
  • When you want to ensure your variables are not correlate of one another
  • When you are comfortable enough to make your predictor variables less interpretable

In this article, we want to apply Principal Component Analysis on two datasets, the Online Shopper Intention and Breast Cancer dataset. The aim of this article is to compare how powerful PCA when applied in the data that has less correlate of one another and the dataset that has higher correlation of each variables. Now, let us start with the Online shopper intention dataset first.

Applying PCA on Online Shopper Intention Dataset

We will explore PCA on the data that has variables correlation and no correlation of one another. We will start with the correlated variables first.

In this use case, we use Online Shoppers Intention dataset. The data is downloaded from kaggle. The data consists of various Information related to customer behavior in online shopping websites. Let us say, we want to predict a customer will generate the revenue of our business or not.

We will create two models here, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.

Load the library needed.

# data wrangling
library(tidyverse)
library(GGally)

# data preprocessing
library(recipes)

# modelling
library(rsample)
library(caret)

# measure time consumption
library(tictoc)

Load the shopper intention dataset to our environment.

shopper_intention <- read_csv("pca_use_case/online_shoppers_intention.csv")

The data is shown as seen below:

glimpse(shopper_intention)
#> Rows: 12,330
#> Columns: 18
#> $ Administrative          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0...
#> $ Administrative_Duration <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0...
#> $ Informational           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ Informational_Duration  <dbl> 0, 0, -1, 0, 0, 0, -1, -1, 0, 0, 0, 0, 0, 0...
#> $ ProductRelated          <dbl> 1, 2, 1, 2, 10, 19, 1, 1, 2, 3, 3, 16, 7, 6...
#> $ ProductRelated_Duration <dbl> 0.000000, 64.000000, -1.000000, 2.666667, 6...
#> $ BounceRates             <dbl> 0.200000000, 0.000000000, 0.200000000, 0.05...
#> $ ExitRates               <dbl> 0.200000000, 0.100000000, 0.200000000, 0.14...
#> $ PageValues              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
#> $ SpecialDay              <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4, 0.0, 0.8...
#> $ Month                   <chr> "Feb", "Feb", "Feb", "Feb", "Feb", "Feb", "...
#> $ OperatingSystems        <fct> 1, 2, 4, 3, 3, 2, 2, 1, 2, 2, 1, 1, 1, 2, 3...
#> $ Browser                 <fct> 1, 2, 1, 2, 3, 2, 4, 2, 2, 4, 1, 1, 1, 5, 2...
#> $ Region                  <fct> 1, 1, 9, 2, 1, 1, 3, 1, 2, 1, 3, 4, 1, 1, 3...
#> $ TrafficType             <dbl> 1, 2, 3, 4, 4, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3...
#> $ VisitorType             <chr> "Returning_Visitor", "Returning_Visitor", "...
#> $ Weekend                 <fct> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FA...
#> $ Revenue                 <fct> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F...

The dataset has 12,330 observations and 18 variables. Hence, we have 17 predictor variables and 1 target variable to predict. Here are the description of the variables in the dataset:

  • Administrative = Administrative Value
  • Administrative_Duration = Duration in Administrative Page
  • Informational = Informational Value
  • Informational_Duration = Duration in Informational Page
  • ProductRelated = Product Related Value
  • ProductRelated_Duration = Duration in Product Related Page
  • BounceRates = percentage of visitors who enter the site from that page and then leave (“bounce”) without triggering any other requests to the analytics server during that session.
  • ExitRates = Exit rate of a web page
  • PageValuesPage = values of each web page
  • SpecialDaySpecial = days like valentine etc
  • Month = Month of the year
  • OperatingSystems = Operating system used
  • Browser = Browser used
  • Region = Region of the user
  • TrafficType = Traffic Type
  • VisitorType = Types of Visitor
  • Weekend = Weekend or not
  • Revenue = Revenue will be generated or not

Based on its description, it looks like our variables are in its correct data type. Besides, we want to check the correlation between each numerical predictor variable using visualization in ggcorr() function from GGally package.

ggcorr(select_if(shopper_intention, is.numeric), 
       label = T, 
       hjust = 1, 
       layout.exp = 3)

It looks like we have several variables that has correlation of one another, but the correlation is not quite high. Now, let us do the cross validation to split the data into train and test. We will split the data into 80% to be our training dataset and 20% to be our testing dataset.

RNGkind(sample.kind = "Rounding")
set.seed(417)
splitted <- initial_split(data = shopper_intention, prop = 0.8, strata = "Revenue")

Now, let us check the proportion of our target variable in the train dataset, that is Revenue.

prop.table(table(training(splitted)$Revenue))
#> 
#>     FALSE      TRUE 
#> 0.8452103 0.1547897

Based on the proportion of our target variable, only 15.4% of our visitor in the website purchase any goods, hence it resulting revenue for the shop. Besides, the proportion of our target variable is imbalance

Then, let us check is there any missing value on each variable.

colSums(is.na(shopper_intention))
#>          Administrative Administrative_Duration           Informational 
#>                      14                      14                      14 
#>  Informational_Duration          ProductRelated ProductRelated_Duration 
#>                      14                      14                      14 
#>             BounceRates               ExitRates              PageValues 
#>                      14                      14                       0 
#>              SpecialDay                   Month        OperatingSystems 
#>                       0                       0                       0 
#>                 Browser                  Region             TrafficType 
#>                       0                       0                       0 
#>             VisitorType                 Weekend                 Revenue 
#>                       0                       0                       0

Based on the output above, our data has several missing value (NA), but the number of missing value still 5% of our data. Hence, we can remove the NA in our preprocessing step.

The Revenue on Online Wesite Prediction with PCA

In this article, we do the several preprocessing step using recipe() function from recipe package. We store all of our preprocessing in step_*() function, including the PCA step. The syntax of PCA in our recipe is stored as step_pca(all_numeric(), threshold = 0.90). The syntax means, we use the numeric variable only and take the 90% of cummulative variance of the data, hence the threshold is set by 0.90.

rec <- recipe(Revenue~., training(splitted)) %>% 
  step_naomit(all_predictors()) %>% # remove the observation that has NA (missing value)
  step_nzv(all_predictors()) %>% # remove the near zero variance variable
  step_upsample(Revenue, ratio = 1, seed = 100) %>% # balancing the target variable proportion
  step_center(all_numeric()) %>% # make all the predictor has 0 mean
  step_scale(all_numeric()) %>% # make the predictor has 1 sd
  step_pca(all_numeric(), threshold = 0.90) %>% # do the pca by using 90% variance of the data
  prep() # prepare the recipe
train <- juice(rec)
test <- bake(rec, testing(splitted))

Now, peek our train dataset after the preprocessing applied.

head(train)
#> # A tibble: 6 x 13
#>   Month OperatingSystems Browser Region VisitorType Weekend Revenue   PC1
#>   <fct> <fct>            <fct>   <fct>  <fct>       <fct>   <fct>   <dbl>
#> 1 Feb   1                1       1      Returning_~ FALSE   FALSE   -3.80
#> 2 Feb   2                2       1      Returning_~ FALSE   FALSE   -1.70
#> 3 Feb   3                3       1      Returning_~ TRUE    FALSE   -1.30
#> 4 Feb   2                2       1      Returning_~ FALSE   FALSE   -1.09
#> 5 Feb   2                4       3      Returning_~ FALSE   FALSE   -3.83
#> 6 Feb   1                2       1      Returning_~ TRUE    FALSE   -3.74
#> # ... with 5 more variables: PC2 <dbl>, PC3 <dbl>, PC4 <dbl>, PC5 <dbl>,
#> #   PC6 <dbl>

We can see in train dataset above, we have 1 target variable, 6 categorical predictor and 6 new numeric PCs (the result of 90% variance of PCA) predictor that will be trained in to our model.

In our first model– the model that use PCA in the preprocessing step, we want to build a random forest model using 5 fold validation and 3 repeats to predict if the visitor of our website will generate the revenue or not. Besides, we use tic() and toc() function to measure the time elapsed while running the random forest model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model <- train(Revenue ~ ., data = train, method = "rf", trControl = ctrl)
toc()

After running the model, the time consumed to build the model is 1608.41 or around 26 minutes.

Then, we use the model to predict the test dataset.

prediction_pca <- predict(model, test)

Now, lets check the accuracy of the model built on a confusion matrix.

confusionMatrix(prediction_pca, test$Revenue, positive = "TRUE")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE TRUE
#>      FALSE  1954  170
#>      TRUE    128  211
#>                                           
#>                Accuracy : 0.879           
#>                  95% CI : (0.8655, 0.8916)
#>     No Information Rate : 0.8453          
#>     P-Value [Acc > NIR] : 1.064e-06       
#>                                           
#>                   Kappa : 0.5155          
#>                                           
#>  Mcnemar's Test P-Value : 0.01755         
#>                                           
#>             Sensitivity : 0.55381         
#>             Specificity : 0.93852         
#>          Pos Pred Value : 0.62242         
#>          Neg Pred Value : 0.91996         
#>              Prevalence : 0.15469         
#>          Detection Rate : 0.08567         
#>    Detection Prevalence : 0.13764         
#>       Balanced Accuracy : 0.74616         
#>                                           
#>        'Positive' Class : TRUE            
#> 

The Revenue on Online Wesite Prediction without PCA

Now, we want to compare the result of model that use PCA in the preprocessing step with the model that use the same preprocessing step, but without PCA. Now, let us make the recipe first.

rec2 <- recipe(Revenue~., training(splitted)) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_upsample(Revenue, ratio = 1, seed = 100) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()
train2 <- juice(rec2)
test2 <- bake(rec2, testing(splitted))

Then, take a look at our training data

head(train2)
#> # A tibble: 6 x 17
#>   Administrative Administrative_~ Informational Informational_D~ ProductRelated
#>            <dbl>            <dbl>         <dbl>            <dbl>          <dbl>
#> 1         -0.787           -0.532        -0.452           -0.287         -0.725
#> 2         -0.787           -0.532        -0.452           -0.287         -0.706
#> 3         -0.787           -0.532        -0.452           -0.287         -0.551
#> 4         -0.787           -0.532        -0.452           -0.287         -0.378
#> 5         -0.787           -0.538        -0.452           -0.294         -0.725
#> 6         -0.500           -0.538        -0.452           -0.294         -0.725
#> # ... with 12 more variables: ProductRelated_Duration <dbl>, BounceRates <dbl>,
#> #   ExitRates <dbl>, PageValues <dbl>, Month <fct>, OperatingSystems <fct>,
#> #   Browser <fct>, Region <fct>, TrafficType <dbl>, VisitorType <fct>,
#> #   Weekend <fct>, Revenue <fct>

As seen above, we use 16 predictors, means there are no variable that has been removed (unlike the predictors in our previous model). Next, apply the random forest algorithm with the exact same model tuning to compare the time comsume and the accuracy of the model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model2 <- train(Revenue ~ ., data = train2, method = "rf", trControl = ctrl)
toc()
prediction <- predict(model2, test2)
confusionMatrix(prediction, test$Revenue, positive = "TRUE")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction FALSE TRUE
#>      FALSE  1949  143
#>      TRUE    133  238
#>                                           
#>                Accuracy : 0.8879          
#>                  95% CI : (0.8748, 0.9001)
#>     No Information Rate : 0.8453          
#>     P-Value [Acc > NIR] : 6.578e-10       
#>                                           
#>                   Kappa : 0.5669          
#>                                           
#>  Mcnemar's Test P-Value : 0.588           
#>                                           
#>             Sensitivity : 0.62467         
#>             Specificity : 0.93612         
#>          Pos Pred Value : 0.64151         
#>          Neg Pred Value : 0.93164         
#>              Prevalence : 0.15469         
#>          Detection Rate : 0.09663         
#>    Detection Prevalence : 0.15063         
#>       Balanced Accuracy : 0.78040         
#>                                           
#>        'Positive' Class : TRUE            
#> 

Result:
- The online shopper data has a few variables that correlated of one another.
- The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.87, without PCA 0.88)
- The time consuming while using PCA is 1608.41 sec elapsed and without PCA is 1936.95. Then we can save 328.54 seconds or +-/ 5 minutes of time when using PCA.

Now, how if we have larger numeric predictor and stronger correlation?

Applying PCA in Breast Cancer Dataset

In this section, we will use breast cancer dataset. Let us say, we want to predict a patient is diagnosed with malignant or benign cancer. The predictor variables are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The data itself can be downloaded from UCI Machine Learning Repository

Here, we will create two models, the first is the model that the predictors is using PCA, and the second is the model without PCA in the preprocessing data.

cancer <- read_csv("pca_use_case/breast-cancer-wisconsin-data/data.csv")

Now, let us take a look at our data.

glimpse(cancer)
#> Rows: 569
#> Columns: 33
#> $ id                      <dbl> 842302, 842517, 84300903, 84348301, 8435840...
#> $ diagnosis               <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M"...
#> $ radius_mean             <dbl> 17.990, 20.570, 19.690, 11.420, 20.290, 12....
#> $ texture_mean            <dbl> 10.38, 17.77, 21.25, 20.38, 14.34, 15.70, 1...
#> $ perimeter_mean          <dbl> 122.80, 132.90, 130.00, 77.58, 135.10, 82.5...
#> $ area_mean               <dbl> 1001.0, 1326.0, 1203.0, 386.1, 1297.0, 477....
#> $ smoothness_mean         <dbl> 0.11840, 0.08474, 0.10960, 0.14250, 0.10030...
#> $ compactness_mean        <dbl> 0.27760, 0.07864, 0.15990, 0.28390, 0.13280...
#> $ concavity_mean          <dbl> 0.30010, 0.08690, 0.19740, 0.24140, 0.19800...
#> $ `concave points_mean`   <dbl> 0.14710, 0.07017, 0.12790, 0.10520, 0.10430...
#> $ symmetry_mean           <dbl> 0.2419, 0.1812, 0.2069, 0.2597, 0.1809, 0.2...
#> $ fractal_dimension_mean  <dbl> 0.07871, 0.05667, 0.05999, 0.09744, 0.05883...
#> $ radius_se               <dbl> 1.0950, 0.5435, 0.7456, 0.4956, 0.7572, 0.3...
#> $ texture_se              <dbl> 0.9053, 0.7339, 0.7869, 1.1560, 0.7813, 0.8...
#> $ perimeter_se            <dbl> 8.589, 3.398, 4.585, 3.445, 5.438, 2.217, 3...
#> $ area_se                 <dbl> 153.40, 74.08, 94.03, 27.23, 94.44, 27.19, ...
#> $ smoothness_se           <dbl> 0.006399, 0.005225, 0.006150, 0.009110, 0.0...
#> $ compactness_se          <dbl> 0.049040, 0.013080, 0.040060, 0.074580, 0.0...
#> $ concavity_se            <dbl> 0.05373, 0.01860, 0.03832, 0.05661, 0.05688...
#> $ `concave points_se`     <dbl> 0.015870, 0.013400, 0.020580, 0.018670, 0.0...
#> $ symmetry_se             <dbl> 0.03003, 0.01389, 0.02250, 0.05963, 0.01756...
#> $ fractal_dimension_se    <dbl> 0.006193, 0.003532, 0.004571, 0.009208, 0.0...
#> $ radius_worst            <dbl> 25.38, 24.99, 23.57, 14.91, 22.54, 15.47, 2...
#> $ texture_worst           <dbl> 17.33, 23.41, 25.53, 26.50, 16.67, 23.75, 2...
#> $ perimeter_worst         <dbl> 184.60, 158.80, 152.50, 98.87, 152.20, 103....
#> $ area_worst              <dbl> 2019.0, 1956.0, 1709.0, 567.7, 1575.0, 741....
#> $ smoothness_worst        <dbl> 0.1622, 0.1238, 0.1444, 0.2098, 0.1374, 0.1...
#> $ compactness_worst       <dbl> 0.6656, 0.1866, 0.4245, 0.8663, 0.2050, 0.5...
#> $ concavity_worst         <dbl> 0.71190, 0.24160, 0.45040, 0.68690, 0.40000...
#> $ `concave points_worst`  <dbl> 0.26540, 0.18600, 0.24300, 0.25750, 0.16250...
#> $ symmetry_worst          <dbl> 0.4601, 0.2750, 0.3613, 0.6638, 0.2364, 0.3...
#> $ fractal_dimension_worst <dbl> 0.11890, 0.08902, 0.08758, 0.17300, 0.07678...
#> $ X33                     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...

The dataset has 569 observations and 33 variables (32 predictors, 1 response variable). While, the variable description is explained below:

  • ID = ID number
  • diagnosis = (M = malignant, B = benign)

Ten real-valued features are computed for each cell nucleus:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

From the data, the id and X33 variable did not help us to predict the diagnosis of cancer patient. Let us remove it from the data.

cancer <- cancer %>% 
  select(-c(X33, id))

Then, let us check is there any missing value on each variable.

colSums(is.na(cancer))
#>               diagnosis             radius_mean            texture_mean 
#>                       0                       0                       0 
#>          perimeter_mean               area_mean         smoothness_mean 
#>                       0                       0                       0 
#>        compactness_mean          concavity_mean     concave points_mean 
#>                       0                       0                       0 
#>           symmetry_mean  fractal_dimension_mean               radius_se 
#>                       0                       0                       0 
#>              texture_se            perimeter_se                 area_se 
#>                       0                       0                       0 
#>           smoothness_se          compactness_se            concavity_se 
#>                       0                       0                       0 
#>       concave points_se             symmetry_se    fractal_dimension_se 
#>                       0                       0                       0 
#>            radius_worst           texture_worst         perimeter_worst 
#>                       0                       0                       0 
#>              area_worst        smoothness_worst       compactness_worst 
#>                       0                       0                       0 
#>         concavity_worst    concave points_worst          symmetry_worst 
#>                       0                       0                       0 
#> fractal_dimension_worst 
#>                       0

Now, let us check the correlation of each variable below to make sure the are the variables high correlated of one another rather than the online shopper data.

ggcorr(cancer, label = T, hjust = 1, label_size = 2, layout.exp = 6)

From the visualization above, the data has higher correlated between each variable than the online shopper data.

RNGkind(sample.kind = "Rounding")
set.seed(100)
idx <- initial_split(cancer, prop = 0.8,strata = "diagnosis")
cancer_train <- training(idx)
cancer_test <- testing(idx)

The Breast Cancer Prediction with PCA

Using breast cancer dataset, we first want to build a model using PCA in the preprocessing approach. Still, we use the 90% of the variance of the data.

rec_cancer_pca <- recipe(diagnosis~., cancer_train) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>%  
  step_center(all_numeric()) %>%  
  step_scale(all_numeric()) %>%  
  step_pca(all_numeric(), threshold = 0.9) %>%  
  prep() 
cancer_train_pca <- juice(rec_cancer_pca)
cancer_test_pca <- bake(rec_cancer_pca, cancer_test)

After applying PCA in breast cancer dataset, here are the number of variable that we will be using.

head(cancer_train_pca)
#> # A tibble: 6 x 8
#>   diagnosis   PC1    PC2    PC3    PC4    PC5     PC6    PC7
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>   <dbl>  <dbl>
#> 1 M         -9.15  -1.58  0.900 -3.89  -0.655  1.33   -2.06 
#> 2 M         -2.33   3.98  0.528 -1.04   0.584 -0.0925 -0.104
#> 3 M         -7.21 -10.1   3.13  -0.868 -2.31   2.92   -1.35 
#> 4 M         -3.91   2.22 -1.51  -2.72   0.833 -1.28    0.829
#> 5 M         -2.18   2.86  1.66  -0.242 -0.108 -0.196   0.194
#> 6 M         -3.16  -3.26  3.06   0.153 -1.55   0.542   0.213

From the table above, we use 7 PCs instead of 30 predictor variables. Now lets train the data to the model.

RNGkind(sample.kind = "Rounding")
set.seed(100)
tic()
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer_pca <- train(diagnosis ~ ., data = cancer_train_pca, method = "rf", trControl = ctrl)
toc()

The time consumed when using PCA is 4.88 seconds on training the dataset. Next, we can predict the test dataset from the model_cancer_pca.

pred_cancer_pca <- predict(model_cancer_pca, cancer_test_pca)

Now, let us check the condusion matrix of our model using confusion matrix.

confusionMatrix(pred_cancer_pca, cancer_test_pca$diagnosis, positive = "M")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  B  M
#>          B 70  3
#>          M  1 39
#>                                           
#>                Accuracy : 0.9646          
#>                  95% CI : (0.9118, 0.9903)
#>     No Information Rate : 0.6283          
#>     P-Value [Acc > NIR] : <2e-16          
#>                                           
#>                   Kappa : 0.9235          
#>                                           
#>  Mcnemar's Test P-Value : 0.6171          
#>                                           
#>             Sensitivity : 0.9286          
#>             Specificity : 0.9859          
#>          Pos Pred Value : 0.9750          
#>          Neg Pred Value : 0.9589          
#>              Prevalence : 0.3717          
#>          Detection Rate : 0.3451          
#>    Detection Prevalence : 0.3540          
#>       Balanced Accuracy : 0.9572          
#>                                           
#>        'Positive' Class : M               
#> 

The accuracy of the model for the test data while using PCA is 0.96. Then, we will build a model that’s not using PCA to be compared with.

The Breast Cancer Prediction without PCA

In this part, we want to classify the breast cancer patient diagnosis without PCA in the preprocessing step. Let us create a recipe for it.

rec_cancer <- recipe(diagnosis~., cancer_train) %>% 
  step_naomit(all_predictors()) %>% 
  step_nzv(all_predictors()) %>% 
  step_center(all_numeric()) %>% 
  step_scale(all_numeric()) %>% 
  prep()
cancer_train <- juice(rec_cancer)
cancer_test <- bake(rec_cancer, cancer_test)

Here, we want to create a model using the same algorithm and specification to be compared with the previous model.

tic()
set.seed(100)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
model_cancer <- train(diagnosis ~ ., data = cancer_train, method = "rf", trControl = ctrl)
toc()

The time consuming without PCA in processing data is 11.21 seconds, means it is almost 3x faster than the model that is using PCA in the preprocessing data.

pred_cancer <- predict(model_cancer, cancer_test)

How about the accuracy of the model? is the accuracy greater while we do not use PCA? Now let us check it using confusion matrix below

confusionMatrix(pred_cancer, cancer_test$diagnosis, positive = "M")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  B  M
#>          B 71  5
#>          M  0 37
#>                                           
#>                Accuracy : 0.9558          
#>                  95% CI : (0.8998, 0.9855)
#>     No Information Rate : 0.6283          
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.9029          
#>                                           
#>  Mcnemar's Test P-Value : 0.07364         
#>                                           
#>             Sensitivity : 0.8810          
#>             Specificity : 1.0000          
#>          Pos Pred Value : 1.0000          
#>          Neg Pred Value : 0.9342          
#>              Prevalence : 0.3717          
#>          Detection Rate : 0.3274          
#>    Detection Prevalence : 0.3274          
#>       Balanced Accuracy : 0.9405          
#>                                           
#>        'Positive' Class : M               
#> 

Turns out, based on the confusion matrix above, the accuracy is lesser (0.95) than using PCA (0.96). Hence, the PCA really works well on the data that has high dimensional data and high correlated of variables2.

Result:

  • The breast cancer dataset has many variables that correlated of one another.
  • The two model above (the model with PCA and not) has almost similar in accuracy (with PCA 0.96, without PCA 0.95)
  • The time consuming while using PCA is 4.88 sec elapsed and without PCA is 11.21. Then we can save 6.33 seconds or while using PCA the computation is more than 2x faster than the model without PCA.

Conclusion

Principal Component Analysis (PCA) is very useful to speed up the computation by reducing the dimensionality of the data. Plus, when you have high dimensionality with high correlated variable of one another, the PCA can improve the accuracy of classification model. Unfortunately, while using PCA, you make your machine learning model less interpretable. Also, PCA will only be applied in your dataset when your dataset contains more than one numerical variable that you want to reduce its dimension.

Reference:

comments powered by Disqus