## The Setup

I’ve been tweeting my linear algebra experiments today. Here’s a quick rundown of my goal. I want to take the data (tags) people use to describe stuff (in my case books) online and visualize that in such a way that it’s readily apparent which things are similar. The easiest way to map a whole bunch of points (books again) is to plot them on a plane … which gives me 2 dimensions to work with. The problem then becomes one of reducing 1000’s of dimensions (one for each tag) into 2.

Principal Component Analysis is to tool for this job. From the wikipedia page, this is the description of PCA that informs my metaphor:

If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a “shadow” of this object when viewed from its (in some sense) most informative viewpoint.

I acquired a copy of an academic paper on PCA, but first I decided to see how far I could get with a simple example and the insight that wikipedia could provide. I got this far:

The discussion on the PCA page pointed me at the page for Sample Covariance Matrix which was much clearer on the actual method for generating said matrix. Now it looks like I need to calculate the Spectral Decomposition using the suggested method of Singular Value Decomposition. I’m a bit stuck at Householder Transformation, so I figured I’d document my progress so far.

## The Data

I decided to take one of my most recent reads, Wonder Boys (WB) and compare it to Atrocity Archives (AA). To play with 3 variables (one more than the 2 I’ll end up plotting) and still end up with a nice 3×3 matix, I included Call of Cuthulu (CC). For the purposes of this exercise, I’m describing these three books with the following tags:

• Lovecraft
• Fantasy
• Modern

In the table below, I’ve set on a 1-10 scale how much the book relates to the particular tag. In my final application that would be compiled by the number of users tagging a book and the number of times each tag was used for that book.

Lovecraft Fantasy Modern
WB 8 0 10
AA 9 8 10
CC 10 10 1

We can quibble extensively over the qualitative nature of the data above (in general I love such discussions), but right now I’m solely interested in the analysis. No matter how I go about the calculation, I’m going to need a vector that represents the mean for each row:

This is how I’m going to calculate the covariance:

In the above equation, x is the 3×3 matrix defined by my table above. x-bar is the “mean vector,” (6,9,7) and N ends up being 3. The results, which I computed by hand, so as to reacquaint myself with the linear algebra:

Stay Tuned. In part 2 of this series I will figure out what to do with my pretty covariant matrix. There will likely be a part 3 with a pretty graphic and some thoughts on the whole experience.

A special thanks goes out to Wolfram Alpha for generating the image of the covariant matrix above.

If there are any friendly mathematicians out there that want to point me at some resources, I’d be eternally grateful. Also, my final build of the code to run all of this will probably be in Java, although I’ve been eyeing Scala as something I’d like to play with. Feel free to contact me if you’re interested in helping out on the coding side of things. And last, but not least, the end use for all of this is going to be a Quantitative Visual Book Recommendation Engine. I’m going to need a ton of user input for this, so feel free to let me know if you’re excited by this project and want to help in any way. As always, I can be emailed at mentatjack dot com with the username mentat1.

~ by mentatjack on July 18, 2009.

### 7 Responses to “Principal Component Analysis: The Lovecraft Experiment (part 1)”

1. […] some progress with Jama, which allows me to represent and manipulate matrices in Java. I ran my lovecraft example data through the paces of SVD, QR, LU, and such, but the real test will be when I try and manipulate the […]

2. […] Component Analysis: The Java Iteration Much more satisfying than my earlier attempt to do this by hand, I’m actually building the tools that will eventually power TagShadow. […]

3. You might be interested in looking into methods termed as
“multidimensional scaling(MDS)” which seem to address
the same problem as yours. PCA is just one of linear MDS methods.

• Thanks for the heads up!

Just a quick glance at the MDS Wikipedia Entry states that this is rather cumbersome for more than 20 variables.

My current TagShadow Prototype uses 1500 variables and PCA seems to handle it fairly well. I’ll probably explore the results of this particular analysis for awhile before starting from scratch with a new one.

I’d love your input on the project.

4. Most MDS methods operate on the distance matrix, similar
as PCA operates on covariance matrix. There should be no
problem with high dimensionality, at least not more so than PCA.
The number of data points is rather a limiting factor, at it
decides the dimension of the distance matrix.
Some more recent developements in MDS run under term “manifold learning”, you can find more interesting stuff through google.
Anywhere, I am lookking forward to your part2 and part3