Principal Component Analysis: The Lovecraft Experiment (part 1)
I’ve been tweeting my linear algebra experiments today. Here’s a quick rundown of my goal. I want to take the data (tags) people use to describe stuff (in my case books) online and visualize that in such a way that it’s readily apparent which things are similar. The easiest way to map a whole bunch of points (books again) is to plot them on a plane … which gives me 2 dimensions to work with. The problem then becomes one of reducing 1000’s of dimensions (one for each tag) into 2.
Principal Component Analysis is to tool for this job. From the wikipedia page, this is the description of PCA that informs my metaphor:
If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a “shadow” of this object when viewed from its (in some sense) most informative viewpoint.
I acquired a copy of an academic paper on PCA, but first I decided to see how far I could get with a simple example and the insight that wikipedia could provide. I got this far:
The discussion on the PCA page pointed me at the page for Sample Covariance Matrix which was much clearer on the actual method for generating said matrix. Now it looks like I need to calculate the Spectral Decomposition using the suggested method of Singular Value Decomposition. I’m a bit stuck at Householder Transformation, so I figured I’d document my progress so far.
I decided to take one of my most recent reads, Wonder Boys (WB) and compare it to Atrocity Archives (AA). To play with 3 variables (one more than the 2 I’ll end up plotting) and still end up with a nice 3×3 matix, I included Call of Cuthulu (CC). For the purposes of this exercise, I’m describing these three books with the following tags:
In the table below, I’ve set on a 1-10 scale how much the book relates to the particular tag. In my final application that would be compiled by the number of users tagging a book and the number of times each tag was used for that book.
We can quibble extensively over the qualitative nature of the data above (in general I love such discussions), but right now I’m solely interested in the analysis. No matter how I go about the calculation, I’m going to need a vector that represents the mean for each row:
This is how I’m going to calculate the covariance:
In the above equation, x is the 3×3 matrix defined by my table above. x-bar is the “mean vector,” (6,9,7) and N ends up being 3. The results, which I computed by hand, so as to reacquaint myself with the linear algebra:
Stay Tuned. In part 2 of this series I will figure out what to do with my pretty covariant matrix. There will likely be a part 3 with a pretty graphic and some thoughts on the whole experience.
A special thanks goes out to Wolfram Alpha for generating the image of the covariant matrix above.
If there are any friendly mathematicians out there that want to point me at some resources, I’d be eternally grateful. Also, my final build of the code to run all of this will probably be in Java, although I’ve been eyeing Scala as something I’d like to play with. Feel free to contact me if you’re interested in helping out on the coding side of things. And last, but not least, the end use for all of this is going to be a Quantitative Visual Book Recommendation Engine. I’m going to need a ton of user input for this, so feel free to let me know if you’re excited by this project and want to help in any way. As always, I can be emailed at mentatjack dot com with the username mentat1.
~ by mentatjack on July 18, 2009.