Homework 4 (Translated)



After a relaxing break, your homework for this week consists of two parts. I highly encourage you to come to lab as I will help you get started.

Part 1

From page 23 of lecture 7: Make these plots, or something like them; that is, pick some collection of executives and compare them based on their emailing patterns. You are encouraged to find a solution other than the one I provided you, as the tools I used were (in part) aimed at showing off features of the language and are not necessarily optimal choices for the task at hand. Use your visualization(s) and give me some information about these executives. In particular:

  1. Do the visualization, as is described above, then,
  2. Using the counts as the variable for analysis, what does your visualization tell you? What is being represented?
  3. Based on your response from the previous question, what does clustering in the visualization refer to? Are there even any clusters? What type of people are in each cluster? What does each cluster represent (this may require some Googling of some history of Enron and its collapse)?
  4. What other aspects about the email exchange patterns could be analyzed using these types of visualizations?
Hint: You need to modify your Python code from Homework 3 to create a matrix where each executive has its own row and column.

Commentary: In other words, for each executive, we now know how many times each executive emailed each executive in the company. Your task is to find a way to visualize the patterns in the data using R. Mark has given you a few: page 4 bottom-left, page 22 bottom-right. The variable to analyze is the number of emails sent from person A to person B (each pair of executives). This exercise is an example of a practical use of what you will be studying in depth in 202B. As Mark mentions, you may find better ways to visualize the patterns of email exchange in the data. If so, by all means use that for your answers :-).

In your write-up, do as it says in the blurb above, but also provide your graphics, and your code. Your R code should be on a separate page and in Courier font.

Part 2

From page 34, lecture 7: Characterize the space savings between storing data as a character vector versus a factor.

This is something you will see over and over again in 202B (with respect to execution time), so this is good pratice!

Create a character vector of some fixed length (start with 1 element), and create a factor that has 1 level (1 character, 1 level, make sense?). Measure the space used using the object.size() function given in the notes. Now, vary the size of the character vector and the number of levels of the factor correspondingly. You should now have a character vector of length 2 and a corresponding factor with 2 levels. Keep doing this and measure the amount of space used by each. Do this, say, until you have a factor with 200 levels and a character vector of length 200, so you will need a loop. Plot your results: space consumed by a factor with n levels and a character vector of length n vs. n. Create only ONE plot, so experiment with the different line type options (or colors if you have a color printer) and make a legend. Comment on the plot and compare the two curves.

The mail files are located on lab-compute in the 202A. Please have all of your output print to your own directory, not to the data directory because this causes a mess. We have the /data directory and the 202A directory set to read-only, so you must use your own directories for storing output and your code.

Due Thursday, November 29 in lecture. Please bring a printed copy of your assignment to class. ALSO, email me your R (and Python code if applicable) code on Thursday as well so I can run it. You can email your entire assignment, but make sure you bring a printed copy to class so you can discuss your answers.
Back to 202A Page