The answers that I get for each of the questions on this homework are just suggested answers. I accepted many other answers as long as your assumptions and data mining methods were well justified. These solutions are kind of a "motivational" piece and may introduce stuff that you have not learned yet, but that I used to solve the problem (mainly for the last two problems). By the end of 202A, all of these methods will make sense to you.
I tend to write a ton of comments on papers, but do not be discouraged. I tend to mark several things, especially errors and caveats with using your method. Errors come in many flavors, but I am mainly looking for data mining algorithm errors: not understanding the structure (or the meaning) of the data, and not correctly using your tools to extract the data you need. Additionally, the goal is to extract the data we are interested in as accurately as possible, with as few false positives and as few missed entries as possible so I also pointed out these caveats in your commands.
Overall Comments and SuggestionsCourier. This makes your commands and output look much more readable. Everything else should be in a standard font like Arial, or Helvetica etc.
The first step in data mining is to understand the format of your data. Some of you made critical errors of extracting the wrong fields for your analysis.
| Field | Name | Description |
| 1 | IP address | Originating IP address of the hit. |
| 4 | Date and time | Date and time of hit: [Date/Month/Year:Hour:Min:Sec |
| 5 | Time Zone | Time zone of the web server (always -7). This field is not used. |
| 6 | HTTP Method | The type of action the web server took: GET, POST, HEAD, LINK, OPTIONS, PROPFIND, PUT. (see RFC 2616 for more info) |
| 7 | HTTP Request | The URL as requested by the visitor by entering an URL (direct access), clicking a link, or by reference. |
| 8 | HTTP Version | Version of HTTP protocol used: 1.0 or 1.1. |
| 9 | HTTP Status | Code denoting the result of the request. (this is an abbreviated list containing only codes found in the access log. For the complete list, click here or see RFC2616) 200 OK 206 Partial Content 301 Moved Permanently 302 Found 304 Not Modified 400 Bad Request 401 Unauthorized 403 Forbidden 404 Not Found 405 Method not Allowed 416 Requested Range not Satisfiable 500 Internal Server Error 501 Not Implemented |
| 10 | Bytes Transferred | Number of bytes transferred for this request. |
| 11 | HTTP Referrer | The page (if any) that transported the user to the request. (good way to see how people are getting to your page) |
| 12 to end of line. | User Agent | The client software (browser, application or bot) that made the request. |