PCA in R

Feb 24

Imagine you've got data on various cars, including their miles per gallon (mpg), horsepower (hp), and weight (wt). A principal components analysis (PCA) can help you understand which of these aspects contribute most significantly to the differences between the cars. For a graduate student in engineering, this could mean identifying design features that most impact fuel efficiency. For someone in marketing, it might reveal how these features cluster together, suggesting distinct consumer preferences.

In our PCA of mtcars, one of the datasets that comes included with the free R software package, two principal components paint the bulk of the picture. Let’s show you the code first:

# Perform PCA using the base R stats package

pca_result <- prcomp(mtcars, scale. = TRUE)

# Summary of PCA results

summary(pca_result)

# Standard biplot

biplot(pca_result)

PC1 and PC2 Axes: The horizontal axis (PC1) and the vertical axis (PC2) represent the first and second principal components, respectively. These two components are the new axes in the transformed space that capture the most variation.

Position of Cars: Each car's position in this space is determined by its scores on these two components. Cars that are closer together are more similar in terms of the original variables used in the PCA.
Vectors (Red Arrows): The vectors represent the loadings of the original variables on the principal components. Their direction and length indicate how each variable influences the principal components:
Direction: The direction of the arrow indicates which values of the component correspond to higher values of that variable.
Length: The length of the arrow indicates the strength of the influence of the variable on the component. Longer arrows represent variables that have a stronger effect on the distribution of the data along that principal component.
Variable Correlations: Vectors pointing in the same direction indicate positive correlation between those variables, while vectors pointing in opposite directions indicate negative correlation. For example, 'mpg' is negatively correlated with 'hp' and 'disp'.
Contribution to Variance: The closeness of the 'hp', 'disp', and 'wt' (weight) vectors to the PC1 axis indicates that these variables contribute significantly to the variance captured by PC1, as also seen in the proportion of variance numbers provided earlier.

From the plot, it seems that 'hp' (horsepower) and 'disp' (displacement) are strongly associated with PC1 and lie in the positive direction of PC1, indicating that cars with higher values in these variables will have higher PC1 scores. Variables such as 'mpg' (miles per gallon) lie in the negative direction of PC1, suggesting that cars with better fuel efficiency have lower PC1 scores.

The Maserati Bora, Ford Pantera L, and Ferrari Dino stand out on the positive side of PC1, which, combined with the loading vectors, suggests these cars have high horsepower and displacement. On the other hand, cars like the Honda Civic and Toyota Corolla are on the opposite end, indicating lower horsepower and better fuel efficiency.

That’s just one example of how PCA can offer unexpected and useful insights into data.

BridgeText can help you with all of your statistics needs.