Homework 4 Code Hint



Once you have created your matrix using Python and written both the matrix, as well as the names of the executives to disk, use the following R code to import it and do some cleaning.
enron <- read.table("enron_matrix.dat",header=FALSE)
execs <- read.table("execs.dat",header=FALSE,colClasses="character")
my.vec <- apply(execs, 1, function(an.exec){return(strsplit(an.exec,"@",fixed=TRUE)[[1]][1])})

row.names(enron) <- as.matrix(my.vec)
names(enron) <- my.vec

First, I read in the files enron_matrix.dat without a header. Then I read in execs.dat also without a header. Recall that execs.dat contains the names of all possible senders and receivers.

However, there is a problem. By default, R will convert character vectors to factors. We do not want this! One way to overcome this is to use the colClasses option in read.table. Using this option, we can specify the mode of each column as a character vector. Since execs contains only one column, I do not need a vector of modes, just one mode. In class I provided another way to do this. The following snippet is the general way to extract the desired data from an erroneous detected factor:

levels(execs[i,])[as.numeric(execs[i,1])]

Recall that every element of execs was initially a factor (in lab, that is). Using a loop, I iterate over i and use the levels function to extract the levels of the factor, which is a vector. as.numeric(execs[i,1]) converts the factor level found at execs[i,] into an integer; the integer is the position of the level in the vector of levels. The result is the data value in its native form. In this case, a string. That is useful if you get stuck in factor land.

But there is still one more problem. All of the executive names are actually email addresses and contain that redundant @enron.com everywhere. Let's get rid of it. The strsplit function takes a string and splits it based on a delimiter, or a regular expression. In this case the delimiter is the @ sign. fixed=TRUE tells strsplit that I am not splitting on a regular expression.

Now here is where it gets weird. When I split on a delimiter, a list with one element is returned to me (denoted by [[ ]]. This list contains a vector of strings that have been split. The first element is the name of the executive, and the second element is enron.com. We want to take the first element of the vector contained in the first element of the list returned to us...weird, I know.

Instead of using a loop, I use the apply function. apply takes a vector and applies a function to either the rows or columns of the vector. The first argument is the object I want to apply the function to, execs. The second argument is the index that I want to apply the function to: 1 means apply to each row, 2 means apply to each column. The third argument is a function, either built-in or user-defined. I defined an unnamed function (unnamed since I won't use it again) that accepts a dummy argument an.exec. This dummy argument corresponds to some row of the execs vector. I then use this dummy variable (which is actually an element of a vector) in the strsplit function. You should avoid using loops when possible. Here a loop may be cleaner though.

Now my.vec contains the cleaned up executive names. Now we want to assign these names to the rows of our matrix and to the columns of our matrix. We use row.names(enron) <- as.matrix(my.vec) to assign the row names. The labels must be a column vector, which is why I must coerce the label vector (row vector by default) to a matrix (column vector for this data). I use names(enron) <- my.vec to assign labels to the columns.

Now you can have some fun with your visualization method!


Back to Homework 4