## Principal Component Analysis: The Lovecraft Experiment (part 1)

## The Setup

I’ve been tweeting my linear algebra experiments today. Here’s a quick rundown of my goal. I want to take the data (tags) people use to describe stuff (in my case books) online and visualize that in such a way that it’s readily apparent which things are similar. The easiest way to map a whole bunch of points (books again) is to plot them on a plane … which gives me 2 dimensions to work with. The problem then becomes one of reducing 1000’s of dimensions (one for each tag) into 2.

Principal Component Analysis is to tool for this job. From the wikipedia page, this is the description of PCA that informs my metaphor:

If a multivariate dataset is visualised as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a “shadow” of this object when viewed from its (in some sense) most informative viewpoint.

I acquired a copy of an academic paper on PCA, but first I decided to see how far I could get with a simple example and the insight that wikipedia could provide. I got this far:

The discussion on the PCA page pointed me at the page for Sample Covariance Matrix which was much clearer on the actual method for generating said matrix. Now it looks like I need to calculate the Spectral Decomposition using the suggested method of Singular Value Decomposition. I’m a bit stuck at Householder Transformation, so I figured I’d document my progress so far.

## The Data

I decided to take one of my most recent reads, Wonder Boys (WB) and compare it to Atrocity Archives (AA). To play with 3 variables (one more than the 2 I’ll end up plotting) and still end up with a nice 3×3 matix, I included Call of Cuthulu (CC). For the purposes of this exercise, I’m describing these three books with the following tags:

- Lovecraft
- Fantasy
- Modern

In the table below, I’ve set on a 1-10 scale how much the book relates to the particular tag. In my final application that would be compiled by the number of users tagging a book and the number of times each tag was used for that book.

Lovecraft | Fantasy | Modern | |
---|---|---|---|

WB | 8 | 0 | 10 |

AA | 9 | 8 | 10 |

CC | 10 | 10 | 1 |

We can quibble extensively over the qualitative nature of the data above (in general I love such discussions), but right now I’m solely interested in the analysis. No matter how I go about the calculation, I’m going to need a vector that represents the mean for each row:

This is how I’m going to calculate the covariance:

In the above equation, x is the 3×3 matrix defined by my table above. x-bar is the “mean vector,” (6,9,7) and N ends up being 3. The results, which I computed by hand, so as to reacquaint myself with the linear algebra:

Stay Tuned. In part 2 of this series I will figure out what to do with my pretty covariant matrix. There will likely be a part 3 with a pretty graphic and some thoughts on the whole experience.

A special thanks goes out to Wolfram Alpha for generating the image of the covariant matrix above.

If there are any friendly mathematicians out there that want to point me at some resources, I’d be eternally grateful. Also, my final build of the code to run all of this will probably be in Java, although I’ve been eyeing Scala as something I’d like to play with. Feel free to contact me if you’re interested in helping out on the coding side of things. And last, but not least, the end use for all of this is going to be a Quantitative Visual Book Recommendation Engine. I’m going to need a ton of user input for this, so feel free to let me know if you’re excited by this project and want to help in any way. As always, I can be emailed at mentatjack dot com with the username mentat1.

### Rate this:

*Related*

~ by mentatjack on July 18, 2009.

Posted in Uncategorized

Tags: Charles Stross, fantasy, linear algebra, lovecraft, Michael Chabon, modern, principal component analysis, tagShadow

[…] some progress with Jama, which allows me to represent and manipulate matrices in Java. I ran my lovecraft example data through the paces of SVD, QR, LU, and such, but the real test will be when I try and manipulate the […]

Tagshadow: Jama, Weka and Scatter Plots « MentatJack said this on September 11, 2009 at 9:53 am |

[…] Component Analysis: The Java Iteration Much more satisfying than my earlier attempt to do this by hand, I’m actually building the tools that will eventually power TagShadow. […]

Pricinipal Component Analysis: The Java Iteration « MentatJack said this on September 12, 2009 at 12:32 pm |

You might be interested in looking into methods termed as

“multidimensional scaling(MDS)” which seem to address

the same problem as yours. PCA is just one of linear MDS methods.

James said this on October 5, 2009 at 2:34 pm |

Thanks for the heads up!

Just a quick glance at the MDS Wikipedia Entry states that this is rather cumbersome for more than 20 variables.

My current TagShadow Prototype uses 1500 variables and PCA seems to handle it fairly well. I’ll probably explore the results of this particular analysis for awhile before starting from scratch with a new one.

I’d love your input on the project.

mentatjack said this on October 5, 2009 at 4:10 pm |

Most MDS methods operate on the distance matrix, similar

as PCA operates on covariance matrix. There should be no

problem with high dimensionality, at least not more so than PCA.

The number of data points is rather a limiting factor, at it

decides the dimension of the distance matrix.

Some more recent developements in MDS run under term “manifold learning”, you can find more interesting stuff through google.

Anywhere, I am lookking forward to your part2 and part3

blogs about PCA.

James said this on October 5, 2009 at 6:46 pm |

The TagShadow mapes look very interesting. Are the positions calculated with PCA? It would also be interesting create a

2D map for these 1500 tags to see clusters among the tags. To create such map you just need to transpose tag table.

James said this on October 5, 2009 at 7:03 pm |

The matrix that I start from has one row per book and one column per tag. From that compute the covariance matrix and eventually the principal components. The plot is the first and second principal components mapped against each other. The third-fifth principal components contribute respectively to the RGB values of the point.

I divided these into “plots” based on a “pivot tag” as displaying 65,000 books proved a bit unmanageable. That’s where the “Master Cloud” came from.

I’ve made numerous other posts on this blog related to the TagShadow project and I’ve also started collecting my thoughts on the TagShadow forum.

It’s a pleasure to converse with someone who understands this (probably more than me). I’m by no means an expert on any sort matrix based analysis, be it PCA or MDS, but it’s been much fun plugging away at it and getting a chance to stretch my Java development muscles.

mentatjack said this on October 5, 2009 at 7:16 pm |