PCA (Principal Component Analysis) is a core data science technique for not only understanding colinearity of independent variables in a dataset, but can provide a reduced dimensional model by rotating your high-D data into lower dimensions. Here’s some quick info on getting PCA in Spotfire. If you want more info on PCA, of course check out Wikipedia or a great interactive example on Setosa.
I often use PCA as a first step when I get a dataset to understand which variables have a strong relationship and how much “information” is stored in it. By looking at the variation explained in each Principal Component, you get a sense for how much unique information is in each variable (and which ones are worth holding on to for analysis). PCA can’t handle categorical variables, so you would need to dummy code them prior to analysis.
Running in PCA TERR
Because PCA is a core model in R, you can easily drop it into Spotfire as a Data Function.
It’s important to remember that PCA is not robust to missing data. Even if one value is missing from a row of data, it will throw out the sample (or croak). Replace those NA’s with a reasonable value (like the average, 0, or min) or get rid of the row.
At Ruths.ai, we are going to release a Data Science Toolkit that packages this and other common data science techniques into a nice Tools menu. So stay tuned!
The code below also gets the necessary exports.
# Assume your incoming DataTable is called "Data" # Get rid of na's input.data = na.omit(Data) # Run the PCA data.pca = prcomp(input.data, center=TRUE, scale. = TRUE) # Collect outputs Output <- capture.output(print(data.pca)) # For each input data row, calculate the PC values RotatedValues <- predict(data.pca, Data) # Get the rotation matrix RotationMatrix <- as.data.frame(data.pca$rotation) # We need to make the rownames a column export RotationMatrix$Column <- row.names(RotationMatrix) Components <- data.frame(PC=colnames(as.data.frame(data.pca$rotation)), SDev=data.pca$sdev)
Founder and CEO of Petro.ai