GPS and WiFi Datasets



Field of Application: mapping, GPS, 802.11, WiFi, Internet, navigation, visualization.
Goal: understanding geographic data and visualization of such data.

routelatlonsp.Rdata - R dataset containing GPS data that can be read in R.
routelatlonsp.csv

ryantrip-wifi-gps.Rdata
- R dataset containing GPS and WiFi data.
ryantrip-wifi-gps.csv

Explanation of the Data
The datasets above contain two types of data recorded on a roadtrip. On our way from home, I recorded wireless access points and their locations, and on the way back I recorded just our GPS coordinates each second. These datasets are grouped together here since they were recorded in the same way, and in the same context.

Dataset Format - ryantrip-wifi-gps

Column # Variable Name Type/Units Layman's Description
1 Latitude degrees Latitude reading at which the AP was discovered.
2 Longitude degrees Longitude reading at which the AP was discovered.
3 SSID text The "name" of the AP.
4 Type factor We won't use this. It specifies the type of infrastructure on which the AP operates: either as part of a network (BSS), or as an ad-hoc cluster of systems.
5 BSSID xx:xx:xx:xx:xx:xx Usually called a MAC address. A unique identifier assigned to every Ethernet device manufactured.
6 TimeGMT time Greenwich Mean Time
7 SNR decibels (dB) "Signal to Noise Ratio"
8 Sig decibels (dB) Signal strength.
9 Noise decibels (dB) Measurement of the amount of noise/randomness (useless information) detected in transmission.
10 Secure boolean True if WEP (Wired Equivalent Privacy) is enabled on the AP, false otherwise (unsecured).
11 Channelbits flag ???
12 Bcnintvl integer Can be adjusted to improve power consumption by clients. Useless in this study.
13 DataRate integer, Mbps The speed at which data is transferred over the air. (megabits per second)
14 LastChannel integer Wifi data is transferred over one of 11 channels. The value of this variable indicates which channel was used.

TASKS for ryantrip-wifi-gps dataset.

These are just some things to think about. These questions can be answered using statistics and data analysis. There are multiple ways to approach most of these, mostly dictated by the reader's depth of knowledge in stats.

  1. There are two common data rates in use for wireless internet: 54Mbps and 11Mbps.
    1. Make a histogram of the data rates of the access points in this dataset.

      hist(data$DataRate)

    2. Which datarate is more common?
    3. Are there any other datarates in the dataset?
  2. The variable SSID in layman’s terms is the access point’s “name.” It can sometimes be blank.
    1. What are the 6 most common SSIDs? Why do you think these SSIDs are the most common? What do you think is “special” about them?

      You can obtain a count by using summary(data$SSID).

    2. Do you notice any patterns in the SSIDs? What do you think these patterns represent?
  3. The variable Sig contains the strength of the wireless signal in decibels.
    1. Make a histogram of signal strengths. What does the histogram suggest about signal strengths encountered?
    2. What is the mean, median, and standard deviation of the signal strengths? What does this suggest?
    3. Make a boxplot of the signal strengths. Are there any outliers? What are the SSIDs corresponding to these access points?

      Outliers are represented on a boxplot with dots. Find the third quartile and let this be c. Then run the following command

      boxplot(data[which(data$Sig > c),8])

    4. Can you provide some reasons why these outliers may have such strong signals?
  4. Take a look at the Noise variable. Do you notice anything strange about this variable?
  5. The variable secure is a Boolean variable. That is, the value is TRUE if the access point is secure (requires a password or key to use), and FALSE if it does not. If an access point is not secure (Secure = FALSE) then anyone can access the network connected to the access point (but not necessarily the Internet).

    Use data[data$Secure==FALSE,] to retrieve the rows corresponding to unsecured APs. Then use the nrow command on this result to help you determine the percentage.

    What percentage of access points in this dataset are not secured?

  6. Businesses, department stores, rest stops, hotels, restaurants, etc. sometimes provide wireless internet service to their customers (for free, or for pay), and sometimes have their own private access points for conducting their own transactions. The SSIDs for access points at each of these places usually follow some pattern. By using this information, we can very roughly approximate the location of the facility in question (or the nearest cross-streets), or at least identify that there is one nearby.

    The table below lists some common SSID naming schemes in this dataset:

    SSID Scheme Business/Venue
    Orangen (n is an integer, or may be blank) The Home Depot
    Wayport_Access McDonald's

    1. At some of these locations, there may be multiple APs associated with one physical location. For example, orange2, orange4 and orange8 may all be located at one Home Depot location.

      Choose from either McDonald’s or Home Depot.

      Devise some way of determining which APs correspond to a physical location of the venue you chose (visually or statistically). Describe your method. After this, you can use the aggregate command to calculate the rough estimate of the location of the actual venue.
    2. Using this dataset and your algorithm from part a, plot the locations of the venues (for example, if you chose McDonald’s, plot all of the McDonald’s locations as suggested by the SSIDs) using some graphical method.
  7. There are a lot of APs in this dataset whose SSIDs begin with “2WIRE.” This is an AP made by a specific manufacturer, for use with internet and multimedia services provided by AT&T, Qwest and Verizon. Assume that each AP is one subscriber.
    1. How many 2WIRE subscribers are there in this dataset?
    2. Attempt to determine which areas 2WIRE serves statistically, or visually.

 


Dataset Format - routelatlong

Column # Variable Name Type/Units
1 Time UTC hhmmss.cc
2 Latitude Degrees, ddmm.ss
3 Longitude Degrees, ddmm.ss
4 Speed Exercise

TASKS for routelatlong dataset.

These are just some things to think about. These questions can be answered using statistics and data analysis. There are multiple ways to approach most of these, mostly dictated by the reader's depth of knowledge in stats.

  1. After setting your working directory, load the R dataset, routelatlonsp.Rdata using the code:

    load(“routelatlonsp.Rdata”)

    This loads a dataframe called data into your workspace.

  2. Create a new variable called Speed. There are two ways one can approach this: use the coordinates at t and t-1, use the distance formula (or a more sophisticated method) and then convert to miles per hour. The other method is to modify the Perl script to return the speed data, and then merge the datasets together in R. Both methods will result in slightly different answers.
  3. The variable speed contains my current speed as reported by GPS. Well documented datasets should always contain the units used for numeric variables (if any), but I have decided to leave them out. It is safe to assume that the speed is measured in some unit, per hour.
    1. Using basic knowledge of the capabilities of the modern automobile, what is the most likely unit in which speed is measured?
    2. Create a new variable called speedTrans to convert this speed to units from part a (use Google Calculator to find the conversion). What expression did you use? Your code should look like:

      speedTrans <- data$Speed*___?____

    3. Plot a histogram of speed.

      hist(speedTrans)

      What do you notice? (it might help to do the plot again, removing speed=0). What does this suggest about the roads and routes that make up the trip?

      You can remove cases where speed is 0 using the code:

      hist(data[-which(data$Speed==0),4])

  4. Plot speed as a function of time plot(data$Speed).
    1. Which time indices most likely correspond to travel on a highway? Explain.
    2. Provide an explanation for speeds of 0 mph. How many times does this happen? Sometimes speed is 0 for long periods, and sometimes just a few seconds. What do each of the situations suggest?
    3. How much time could have been saved if our speed not been 0?
  5. Export your R dataset to a comma-separated values file. Import it into GPSVisualizer and create some type of visualization that illustrates speed throughout the trip. But first, convert the dataset into one that GPSVisualizer can read.

    Copy and paste the following R code into a blank file in your working directory. Name the file convertGPS.R. Use the command  source(“convertGPS.R”) to read in the file.
    You could also just copy and paste the following code into R, but be sure to hit enter after the last line of code after pasting.

    #Function decDeg takes a data frame 'data' and converts the latitude reading in
    #column latcol and the longitude reading from loncol into decimal degrees.
    decDeg <- function(data,latcol,loncol) {

          Deg <- as.integer(data[,latcol]/100)
          Min <- abs(as.integer(data[,latcol])-sign(data[,latcol])*100*abs(Deg))
          DecMin <- Min + abs(data[,latcol] - as.integer(data[,latcol]))
          Decimal <- DecMin / 60
          data[,latcol] <- sign(Deg)*(abs(Deg) + Decimal)

          Deg <- as.integer(data[,loncol]/100)
          Min <- abs(as.integer(data[,loncol])-sign(data[,loncol])*100*abs(Deg))
          DecMin <- Min + abs(data[,loncol] - as.integer(data[,loncol]))
          Decimal <- DecMin / 60
          data[,loncol] <- sign(Deg)*(abs(Deg) + Decimal)
          dataTran <- data.frame(data[,1:ncol(data)])
          varNames <- names(dataTran)
          varNames[latcol] <- "latitude"
          varNames[loncol] <- "longitude"
          names(dataTran) <- varNames
          dataTran
    }

    To convert the dataset, enter the command

    newDatasetName <- decDeg(data,2,3)

    To write the new dataset as a CSV file, enter

    write.csv(file=”routelatlonsp.csv”,newDatasetName)

    Now feed that data file to GPSVisualizer. That part is left as an exercise.

  6. Compute the midpoint (GPS coordinate) of the trip using the following methods
    1. Draw a line from start to finish, find the midpoint.
    2. Calculate the total distance traveled and divide by 2.
    3. Calculate the total time traveled (t_final – t_initial) and divide by 2.
    4. Draw dotted vertical lines on your original plot designating these three calculations.
    5. What city can best be described as the halfway point in this trip?
  7. The elusive variable Time is measured is in a universal time format. Recall that the GPS unit records an entry (a data row) once every second.
    1. Is the variable Time really necessary to answer the rest of this question?
    2. How many hours and minutes of this trip were recorded by the GPS? What statistical concept does this correspond to? (range)
    3. For how much time (hours and minutes) total is the speed 0?

-- END OF EXERCISES --

<< Back