November 22, 1999
Questions on the final exam will look like those on the midterm and like the sample questions, though it is difficult for me to determine what "like" means here. It is also difficult to judge in advance how difficult a question is.
The description of the fourth project assignment is now available. The project is due before the last lecture of CSE 130, on Wednesday December 8.
Happy Thanksgiving! The lecture on Wednesday this week is canceled.
Answer: place a random number at the start of each line, sort the file according to these numbers, and finally remove the random numbers.
The time complexity of the entire shuffling process is dominated by the time complexity of sorting, which is O(n log n).
Adding and removing the random numbers is only O(n). However the shuffling process should work for files that do not fit into main memory. Therefore traditional space complexity is less important than the number of I/O operations required, where a sequential access is much less expensive than a random access. Here, adding random numbers can be done in one sequential pass, and removing the random numbers can also be done in another sequential pass.
No programming language can make a slow algorithm fast. Therefore, no PL can compensate for a bad choice of algorithm. If part of a process is intrinsically slow, then a slow implementation of another part of the process is harmless. Adding and removing the random numbers can be done by an interpreted language.
Whatever the PL one uses, one should try to reuse existing code.
An important criterion for choosing a PL is whether it facilitates code
reuse.
where addrand and remrand are small programs in awk as follows:% awk -f addrand myfile | sort -n | awk -f remrand
As a special-purpose language designed for processing text files, awk has a fixed builtin top-level control structure: iteration over each line of its input file. The operations inside the top-level braces { and } are performed once for each line of the input file.# add a random number at the start of every line
{
print rand(), $0
}# remove the first field of every line
{
$1 = ""
print $0
}
The special variable $0 refers to the entire input line, while $1 refers to the first field of the input line. One of the low-level aspects of awk is that it is far from pure: one can assign a new value to any variable, even one that is conceptually unchangeable, like a part of the current input line.
Much of the power of scripting languages comes from combining them with software provided by the underlying operating system. Here we are using two features of Unix. One is the standard sorting program; the other is the pipe | operation.
Reusing code is easier when the code to be reused has a simple interface
that hides as much complexity as possible. One great idea of Unix
is that many different operations can all have the same interface:
one file in, and one file out, where a fileis simply a linear sequence
of bytes of arbitrary length.
Scripting languages such as awk are typically a mix of high-level and low-level features. High-level features include:
This example shows a high-level operation on a string, splitting it into fields stored in an array:
This example shows how dynamic typing simplifies the syntax of a function definition, and how awk reuses familiar syntax from C:indexnames = "^tnx ^irx ^djc ^gox ^spc ^jpn sjpnx btaex"
nin = split(indexnames, iname, " ")
for (i = 1; i <= nin; i++) m = readprices(iname[i],temp)
Any Unix shell command can be assembled as a string and then executed:function pretty (ticker, b, name, a, r2, tstat) {
printf "%sd = %5.3f * %sd' + %6.4f \tr^2 = %5.3f \ttstat = %5.3f\n", \
ticker, b, name, a, r2, tstat;
}
There are high-level commands for reading text files:filename = "data/" ticker
command = "lynx -dump \"" url "\" | tac > " filename
system(command)
Files do not need to be opened explicitly, and closing them is optional.FS = ",";
n = 0;
do { i = getline < filename;
if (i != 1) break;
...
}
while (i == 1);
An interpreter is not significantly slower than a compiler if (1) the program spends most of its time waiting for I/O or other processes, or (2) the program spends most of its time executing builtin operations. For typical applications of scripting languages, at least one of these conditions is true. (1) is true for file processing or for job control. (2) is true when using the builtin data structures, whose operations are implemented by optimized compiled code.
Code is typically of high-quality and reusable only if programmers invest
special effort. Scripting languages are designed for quick-and-dirty
programming, so it is reasonable to provide reuse facilities for system
code but not for programmer code.