Homework 1: Solutions



The answers that I get for each of the questions on this homework are just suggested answers. I accepted many other answers as long as your assumptions and data mining methods were well justified. These solutions are kind of a "motivational" piece and may introduce stuff that you have not learned yet, but that I used to solve the problem (mainly for the last two problems). By the end of 202A, all of these methods will make sense to you.


Due to the length of the document, I have split up the solutions based on problem.

Errors, Caveats and Gotchas Oh My!

I tend to write a ton of comments on papers, but do not be discouraged. I tend to mark several things, especially errors and caveats with using your method. Errors come in many flavors, but I am mainly looking for data mining algorithm errors: not understanding the structure (or the meaning) of the data, and not correctly using your tools to extract the data you need. Additionally, the goal is to extract the data we are interested in as accurately as possible, with as few false positives and as few missed entries as possible so I also pointed out these caveats in your commands.

Overall Comments and Suggestions

The first step in data mining is to understand the format of your data. Some of you made critical errors of extracting the wrong fields for your analysis.

Field Name Description
1 IP address Originating IP address of the hit.
4 Date and time Date and time of hit:

[Date/Month/Year:Hour:Min:Sec
5 Time Zone Time zone of the web server (always -7). This field is not used.
6 HTTP Method The type of action the web server took: GET, POST, HEAD, LINK, OPTIONS, PROPFIND, PUT. (see RFC 2616 for more info)
7 HTTP Request The URL as requested by the visitor by entering an URL (direct access), clicking a link, or by reference.
8 HTTP Version Version of HTTP protocol used: 1.0 or 1.1.
9 HTTP Status Code denoting the result of the request. (this is an abbreviated list containing only codes found in the access log. For the complete list, click here or see RFC2616)
200 OK
206 Partial Content
301 Moved Permanently
302 Found
304 Not Modified
400 Bad Request
401 Unauthorized
403 Forbidden
404 Not Found
405 Method not Allowed
416 Requested Range not Satisfiable
500 Internal Server Error
501 Not Implemented
10 Bytes Transferred Number of bytes transferred for this request.
11 HTTP Referrer The page (if any) that transported the user to the request. (good way to see how people are getting to your page)
12 to end of line. User Agent The client software (browser, application or bot) that made the request.

If you have any questions about this homework assignment or the solutions, email me or see me during office hours.