How I Got the Data

<< Back

Getting data is not always easy. This dataset is a perfect example. The original data was contained in the appendix of a PDF file, which can be found here.

The data looked like the following. There is no easy way to extract the data using copy/paste, and regions were not defined to make extraction any easier (newer PDFs sometimes define "regions" that allow software to extract information from the file).


Source:

Fortunately, the data follows a pattern: every page starts with a header such as the one in bold above, as well as the column headings. I used the trial version of Cogniview's PDF2XL utility to extract the 13 pages of data by copying it to the clipboard and then pasting it into a blank Excel file. Cool! University students can request a FREE one-year license here.

I save this new Excel document as a comma separated values (CSV) file.

Note that for each second of recording, there are sometimes multiple readings. SAS would be an excellent choice for analysis with this data, but suppose we do not have access to SAS. Since having multiple readings per second is of little use for my analysis, I decided in advance that I would collapse the data at each second by averaging the data values. I would use the R command aggregate to do this.

I need to clearly group readings from the same timestamp into separate records, each containing a timestamp. I could do this manually (which is how I started), but with over 3,000 rows, this quickly became out of the question. Instead, I write a Perl script that automatically injects the timestamp into the empty timestamp field. If you have data similar to this, you can download my script and change it to your needs.

#!/usr/bin/perl

# Use to correct timestamp problem in TWA data set.

open IN, "< ../../Sites/sandbox/TWA800.csv" or die "Could not open input.";
open(OUT,">TWA800-new.csv");

while($line = <IN>) {
     @tokens = split(',',$line);
     if ($tokens[0] =~ /[0-9]{2}:[0-9]{2}:[0-9]{2}/) {
          print "Found a match\n";
          #We have a new second of recording.
          #Get the token/time
          $time = $tokens[0];
          #This line is ok...output to file.
          print OUT $line;
     }
     elsif ($tokens[0] == "") {
          #More data from same timestamp.
          #Insert last found timestamp into this token.
          $tokens[0] = $time;
          #now the line is ok, create a new line and write it.
          $newline = join(',',@tokens);
          print OUT $newline;
     }
}
close(IN);
close(OUT);

Now the file looks like below. Each row is labeled with the timestamp corresponding to that entry.


In Progress: the dataset is not in its final form yet. There are still some problems with it that I need to resolve.

I create an aggregate dataset containing only one record for each second of flight. One problem is that R reads in the VHF column as a factor with levels "Key", "Off", and "". I use line 2 to get rid of this extraneous factor level. I also need to instruct R to ignore all of the NA values when calculating the means, and I must drop the Time variable and the VHF variable (both factors) from the aggregation because that will cause problems (can't take the average of two unordered levels).

data <- read.csv("TWA800-new.csv",header=T)
data$VHF <- factor(data$VHF,c("Key","Off"))

options(warn = 0)
final <- aggregate(data[,-c(1,16)],by=list(Second = as.ts(data$Time)),mean,na.rm=TRUE)

#Compute pressure from altitude
PressureAlt <- 14.7*(((145569.2-final$Altitude)/145442.2)^5.25593789584)

 

*Ignore the last line. That is "toy data."

To Do: There are still some problems with this dataset.

First, the Time variable is not recognized as a timestamp. Rather, it was detected as a factor.

Second, in the aggregated data, the independent variable Second is the number of seconds that have elapsed in the recording. I would prefer that it be the timestamp of the recording.


Dataset (Version: 09/11/06)

<< Back