While I was learning about PCA on Cousera, I came across two websites that have relatively good explanations:
- Amoeba, a contributor on Stack Exchange, who explained the concept in layman terms to mathematical terms.
- George Dallas, a UK-based social scientist, who simplified the concept with a step-by-step illustration.
I thought it would be great to combine all these three sources and have a better grasp of the concept with cheese as an example.
What does PCA do in general?
Generally, cheese has a whole list of different characteristics taste, texture, milk source, country of origin etc. But some related properties are redundant to be measured.
In layman terms, what PCA does is, it summarizes the related characteristics into main characteristics for describing each cheese. This process is called ‘dimensionality reduction‘ or ‘data compression‘.
How does PCA “summarizes” the characteristics of cheese?
Let’s say, two cheese characteristics – texture and milk source, are correlated and we graph them in a scatter plot. Each ‘X‘ point is a particular type of cheese. While summarizing for a new cheese characteristic, PCA has a goal, which is to reduce the projection error – the average of all distance of every ‘X’ point to the projection line. To achieve this, PCA will do the following two steps:
- Data compression: PCA constructs a new property by drawing a line through the center of this cheese cloud and projecting all ‘X’ points onto the line (which is the red point).
- Reconstruction: the blue line with red points is now the new ‘summarized’ cheese characteristic, which is called the “first principal component“
Isn’t this a linear regression?
No, this is a common misconception.
- In linear regression, we are minimizing the squared error from every ‘X’ point to our predictor line. These are vertical distances ; In PCA, we are minimizing the shortest distance or shortest orthogonal distances, to our data points. (which is always 90∘ to the blue line),
- In linear regression, we are taking our examples x to predict y ; In PCA, we are taking the number of features and finding a closest common dataset among them. We are not predicting any results.
The shortest distance sounds like Pythagoras Theorem, isn’t it?
Good observation! Remember the main goal of PCA, which it draws the orange line to reduce the projection error?
That projection error is measured as the average squared length of the corresponding orange lines. As I mentioned earlier, the angle between orange and the blue line is always 90∘, and hence the sum of these two quantities is equal to the average squared distance between the center of the cheese cloud and each ‘X’ , which is precisely Pythagoras theorem!
So what does it have to do with eigenvectors and eigenvalues?
Eigenvectors and values exist in pairs: every eigenvector has a corresponding eigenvalue.
- An eigenvector is a direction of the line. Here, it refers to the direction of the blue line whether it is vertical, horizontal, 45 degrees etc.)
- An eigenvalue is a number indicating how disperse are the data in that direction. Here, it refers to how spread out are the red dots on the blue line.
The eigenvector with the highest eigenvalue is, therefore the principal component.
x-y axis is usually perpendicular with 90∘. Remember that the orange and blue lines are perpendicular as well? Here, is where eigenvector comes into the picture: in data compression, a perpendicular (dashed) line X2 is drawn to the blue line X1. Both X1 and X2 are the eigenvectors and are normally expressed in covariance matrix along with eigenvalues.
Next, notice that the reconstruction graph is illustrated in terms of the new axes X1 and X2 instead of the old axes. We did not discard data nor change the data. Eigenvectors just allowed us to look at the data from a different angle with a new set of dimension. In fact, eigenvectors set the new dimension to be equal to the original dimension scale. (For a more interesting illustration, you can check out George Dallas‘ blog posts)
So, now both the texture and milk source characteristics (x1 and x2) are merged into X1, so what does X2 represent? This is where you should consider adding a new uncorrelated variable for X2 to plot a more useful graph and statistics! This is how dimension reduction works!
See, PCA is so useful that it reduces the total data we have to store in our computer memory, which will consequently speed up our learning algorithm!
Any other common misconceptions of PCA I should be aware of?
Misconception 1: PCA checks characteristics that are redundant and discards them
No. Once again, it doesn’t discard the redundant characteristics. Instead, it combines them and reconstructs new characteristics which still summarizes the list of cheese well.
Misconception 2: PCA reduces the number of examples
We are not reducing the number of examples. We are reducing the number of features by compressing the characteristics into a new property.
In the next article, I’ll write about the trading applications of PCA.