• Would you like to know more about the R code featured in the pairwise comparison post earlier this week.

In an earlier post this week, I provided TERR code to create a pairwise comparison table and attach it to a scatter plot.  That post wasn’t terribly long, but I wanted to explain the code line by line so that you get a feel for what’s it’s doing, but when I added that to the post, it was too long, so I have broken it out separately.  Now, for simplicity here, I will use a different table than the previous posting, but the code is exactly the same.

script corrected

(1) checkNumeric <- sapply(PairsInputTable, is.numeric)
(2) pairs <- cor(na.omit(PairsInputTable[, checkNumeric, drop=FALSE]))

(3) which <- row(pairs) > col(pairs)
(4) PairsOutputTable <- data.frame(
(5)     X = rownames(pairs)[row(pairs)[which]],
(6)     Y = colnames(pairs)[col(pairs)[which]],
(7)    Pairwise = pairs[which])

Line 1 —  an object called checkNumeric is created using the sapply function.  Sapply is party of the apply family of functions.  I learned about sapply with lapply.  Lapply returns a list and is good to use when the content returned is of different types, but you can be more efficient with sapply when the content returned is homogeneous, such as with this case.  Both sapply() and lapply() will apply a function to the object referenced.  In this case, the oject is ‘PairsInputTable’ and the function is is.numeric.  Line 1  is checking to see that all the data is numeric and is assigning the result to an object called checkNumeric. Here is a sample set of code in RStudio showing what line 1 is doing.

I created 3 vectors and combine them into a data frame so that I have a small table to work with.

Create vectors

Here is what myTable looks like when you call it.

call table

Here the checkNumeric object is created with sapply and then called.

check numeric

Just for comparison, I created a fourth vector that is not numeric and created a table called mytable2, along with another object checkNumeric2 so that you can see how the results differ.

check numeric with non nulls mytable2

Now that we know what checkNumeric will return, let’s move on to Line 2.

(2) pairs <- cor(na.omit(PairsInputTable[, checkNumeric, drop=FALSE]))

In line 2, three things are happening —

  1. The PairsInputTable[, checkNumeric, drop = FALSE])) code is dropping any non-numeric columns.  When working with data frames, square brackets are used to reference specific elements in the data frame.  The syntax is Table[row,column], so PairsInputTable[,checkNumeric, drop = FALSE] does nothing to the rows, and uses checkNumeric to drop non-numeric columns.
  2. The na.omit function is dealing with NA values.
  3. The cor() function is creating the correlation table.

Here are some screenshots with the code broken down so you can make sense of it.

This is where vector4 is dropped because it is not numeric.

mytableclean

Now, that was a fairly simple example, and here I have added two additional vectors with non-numeric and NA values so that you can see how they are impact and also to be able to see the result of na.omit.  As you can see, na.omit will remove the entire record where there are NA values.

final

Finally, here I have applied the cor() function to mytableclean.  In this case the results are 1 because my simple example is a lot of small numbers that are similar values, but you can see what the code is doing.  It is returning the correlation between each of the vectors.

cor

Line 3 — (3) which <- row(pairs) > col(pairs)

Before I dive into line 3, I am going to clean stuff up a bit now.  As I’ve gone through several different examples and cases, it’s gotten a bit messy, and I need a correlation table that returns several values, not just 1, so I have replaced the vectors and this is our new starting place, but all of the code is the same.

restart

The row and col functions are simply returning position numbers, not the actual correlation table values.  Then, the object ‘which’ is created to hold the result of row(pairs) > col(pairs).  That operation is looking to see if the row position is greater than the column position. The resulting TRUE values are the locations of correlations between the vectors that we want in the final output.

Line3

Lines 4 – 7

(4) PairsOutputTable <- data.frame(
(5) X = rownames(pairs)[row(pairs)[which]],
(6) Y = colnames(pairs)[col(pairs)[which]],
(7) Pairwise = pairs[which])

Lines 4 – 7 are creating the PairsOutputTable object, which is our final result.  The data frame is made up of three columns — X, Y and Pairwise.  In order to figure out how this works, let’s remind ourselves of what ‘pairs’ and ‘which’ are.

pairs which

And the output we are trying to create would look like this:

output

So, the next lines of code are  used to define the X and Y columns:

(5) X = rownames(pairs)[row(pairs)[which]],
(6) Y = colnames(pairs)[col(pairs)[which]],

As you can see in the screenshot below rownames(pairs) and colnames(pairs) simply returns the row and column names of the data frame.

row and col names

Similarly, row(pairs) and col(pairs) returns the row numbers for all positions in the data frame and col numbers for all positions in the data frame.

row pairs col pairs

The which object will return TRUE/FALSE.

which

If we take the code one step further and add in [row(pairs)[which]] and col(pairs)[which]], this is what we get.  It is returning the rownames from pairs but is being influenced by the TRUE/FALSE in WHICH.

onestep further

If you are unfamiliar with R, this is probably the most confusing part.  Single square brackets are used to return elements from a data frame based on position.  In the screenshot below, I am calling our eventual output, the PairsOutputTable.  When I enter PairsOutputTable[,2], the second column is returned.  When I call PairsOutputTable[2,], the second row is returned.  It starts getting a little complicated to explain when and why the single square brackets versus the double square brackets should be used, so I’m going to skip over that part.

calling elements

So, here is what we know:

  1. rownames and colnames returns the names of the rows and columns row and col names
  2. row(pairs) and col(pairs) returns the row  and column numbers  row pairs col pairs
  3. which returns TRUE/FALSE  which
  4. Square brackets pick up elements in certain positions  onestep further
  5. The combined code returns this  onestep further

In X, we get Vector 2, Vector 3, Vector 3 because as you look across the rows, there is one TRUE value in Vector 2 and two TRUE values in Vector 3.

In Y, we get Vector 1, Vector 1, and Vector 2 because as you look down the columns, there are two TRUE values in Vector 1 and one TRUE value in Vector 2.

WHEW! That was a lot of explanation for 2 lines of code.

If we throw in the last part line of code to define the 3rd column, Pairwise, the result this.  This line of code is creating an object called ‘Pairwise’ that looks at the ‘pairs’ object.  Square brackets implies that we want to return specific elements of the ‘pairs’ object based on the what is contained in the ‘which’ object.  ‘Which’ is only TRUE/FALSE, so what will be returned are only the TRUE values.

which  pairwise

When you string it all together, it looks like this:

throw it all together

Or this:  …depending on how much you want to do at one time.

other

 

Wow, that was a much longer write up than I expected, but hopefully it fully explains how the code works.

 

 

Written by Julie Schellberg of Big Mountain Analytics, LLC
Residing in Whitefish, MT, an analytics partner with Ruths.ai. Specializing in Spotfire analytics, dabbling in Power BI and R.