CSE130 LECTURE NOTES


November 22, 1999
 
 

ADMINISTRATION

We are returning the midterm exams today.  The maximum is 31 out of 40, and the mean is 17.  The low mean does not imply that most people in the class are doing badly, just that there was a mismatch between the sample questions, the actual midterm, and your preparation.

Questions on the final exam will look like those on the midterm and like the sample questions, though it is difficult for me to determine what "like" means here.  It is also difficult to judge in advance how difficult a question is.

The description of the fourth project assignment is now available.  The project is due before the last lecture of CSE 130, on Wednesday December 8.

Happy Thanksgiving!  The lecture on Wednesday this week is canceled.
 
 

AN EXAMPLE OF A TASK FOR A SCRIPTING LANGUAGE SUCH AS awk

How should we shuffle the lines of a large file into random order?

Answer: place a random number at the start of each line, sort the file according to these numbers, and finally remove the random numbers.

The time complexity of the entire shuffling process is dominated by the time complexity of sorting, which is O(n log n).

Adding and removing the random numbers is only O(n).  However the shuffling process should work for files that do not fit into main memory.  Therefore traditional space complexity is less important than the number of I/O operations required, where a sequential access is much less expensive than a random access.  Here, adding random numbers can be done in one sequential pass, and removing the random numbers can also be done in another sequential pass.

No programming language can make a slow algorithm fast.  Therefore, no PL can compensate for a bad choice of algorithm.  If part of a process is intrinsically slow, then a slow implementation of another part of the process is harmless.  Adding and removing the random numbers can be done by an interpreted language.

Whatever the PL one uses, one should try to reuse existing code.  An important criterion for choosing a PL is whether it facilitates code reuse.
 
 

SHUFFLING THE LINES OF A FILE IN AWK

We write two tiny scripts to add and remove the random numbers and then use Unix pipes to combine the three operations:
% awk -f addrand myfile | sort -n | awk -f remrand
where addrand and remrand are small programs in awk as follows:
# add a random number at the start of every line
{
print rand(), $0
}

# remove the first field of every line
{
$1 = ""
print $0
}

As a special-purpose language designed for processing text files, awk has a fixed builtin top-level control structure: iteration over each line of its input file.  The operations inside the top-level braces { and } are performed once for each line of the input file.

The special variable $0 refers to the entire input line, while $1 refers to the first field of the input line.  One of the low-level aspects of awk is that it is far from pure: one can assign a new value to any variable, even one that is conceptually unchangeable, like a part of the current input line.

Much of the power of scripting languages comes from combining them with software provided by the underlying operating system.  Here we are using two features of Unix.  One is the standard sorting program; the other is the pipe | operation.

Reusing code is easier when the code to be reused has a simple interface that hides as much complexity as possible.  One great idea of Unix is that many different operations can all have the same interface: one file in, and one file out, where a fileis simply a linear sequence of bytes of arbitrary length.
 
 

FEATURES OF SCRIPTING LANGUAGES

A scripting language is one used to write scripts.  A script is a small program that controls the operation of some high-level system such as a robot or an operating system shell.  (A shell is a human-oriented command-line interface for an operating system that provides commands for file management and job management.)

Scripting languages such as awk are typically a mix of high-level and low-level features.  High-level features include:

Low-level features include: One major feature of scripting languages is itself a mixture of high-level and low-level: dynamic type-checking.  On the one hand, this facilitates rapid prototyping. On the other hand, dynamic type-checking means that the interpreter (or compiler) can catch fewer bugs so the programmer must do more work.
 
 

MORE EXAMPLES OF awk FEATURES

For a full explanation of all these examples, see the online awk manual.

This example shows a high-level operation on a string, splitting it into fields stored in an array:

indexnames = "^tnx ^irx ^djc ^gox ^spc ^jpn sjpnx btaex"
nin = split(indexnames, iname, " ")
for (i = 1; i <= nin; i++) m = readprices(iname[i],temp)
This example shows how dynamic typing simplifies the syntax of a function definition, and how awk reuses familiar syntax from C:
function pretty (ticker, b, name, a, r2, tstat) {
  printf "%sd = %5.3f * %sd' + %6.4f \tr^2 = %5.3f \ttstat = %5.3f\n", \
  ticker, b, name, a, r2, tstat;
}
Any Unix shell command can be assembled as a string and then executed:
filename = "data/" ticker
command = "lynx -dump \"" url "\" | tac > " filename
system(command)
There are high-level commands for reading text files:
FS = ",";
n = 0;
do { i = getline < filename;
 if (i != 1) break;
 ...
 }
while (i == 1);
Files do not need to be opened explicitly, and closing them is optional.
 
 

SCRIPTING LANGUAGE DESIGN DECISIONS

Let's think about the consequences and interactions of the design decisions manifested in awk.

An interpreter is not significantly slower than a compiler if (1) the program spends most of its time waiting for I/O or other processes, or (2) the program spends most of its time executing builtin operations.  For typical applications of scripting languages, at least one of these conditions is true.  (1) is true for file processing or for job control.  (2) is true when using the builtin data structures, whose operations are implemented by optimized compiled code.

Code is typically of high-quality and reusable only if programmers invest special effort.  Scripting languages are designed for quick-and-dirty programming, so it is reasonable to provide reuse facilities for system code but not for programmer code.
 



Copyright (c) by Charles Elkan, 1999.