Writings of a techie wizard

RSS Feed
Atom Feed

Blog Home
Old Site
email the wizard

Indexes

Archives

2015‑Jan (1)
2014‑Sep (2)
2014‑Jul (1)
2014‑Jun (1)
2014‑May (1)
2014‑Apr (2)
2014‑Mar (2)
2014‑Jan (1)
2013‑Nov (1)
2013‑Oct (1)
2013‑Sep (2)
2013‑Aug (2)
2013‑Jun (1)
2013‑Apr (1)
2013‑Mar (1)
2013‑Feb (1)
2013‑Jan (2)
2012‑Dec (3)
2012‑Nov (2)
2012‑Oct (1)
2012‑Sep (3)
2012‑Aug (1)
2012‑Jul (1)
2012‑Jun (1)
2012‑Apr (2)
2012‑Mar (2)
2012‑Feb (1)
2012‑Jan (2)
2011‑Dec (1)
2011‑Nov (3)
2011‑Oct (2)
2011‑Sep (4)
2011‑Aug (6)
2011‑Jul (7)
2011‑Jun (8)

Single Entry

Fri, 27 Jul 2012

Still Another Nerd Interlude

(Note: there is a good discussion of this post and the Knuth-McIlroy exchange on Hacker News.)

The other day I came across a blog post about an interesting exchange between two world-class programmers, Donald Knuth and Doug McIlroy. I talked about McIlroy in my last nerd interlude. To briefly summarize what happened (the article goes into more detail and is worth reading, at least if you're into this sort of thing), Knuth was asked by a computer magazine to write a program that illustrated "literate programming", a technique that Knuth had spent much time developing, and McIlroy was asked to critique what Knuth did. The program was supposed to solve the following problem, as described in the blog post:

Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.

As it turned out, the heart of McIlroy's critique was simple but devastating: a six-line Unix shell script that accomplished exactly the same task as Knuth's 10-page Pascal program. McIlroy then went on to explain why this was a better method of solving the problem than Knuth's, his main argument being that his method re-used general purpose utilities instead of writing custom code. Just to lay my cards on the table at once, I'm with McIlroy on this one, as the author of the blog post appears to be, and I'll spend most of the rest of this post giving some reasons why. Unfortunately it's too late to comment on the blog post itself, or I'd be doing it there, since I'll be responding to comments that were made there.

A number of commenters on the post said that McIlroy's critique was unfair, because Knuth's program was intended to illustrate his literate programming techniques, not to serve as an actual real-world solution of the problem he was given. This misses the point. McIlroy specifically points out, not just that reusable code is a good thing in general, but that Knuth's program, specifically, does not use or produce any. It is a one-off program to solve a one-off problem, and it doesn't break the problem down into generic sub-problems, as the shell pipeline does. It implements everything specifically for this particular problem. As McIlroy notes:

Everything there--even input conversion and sorting--is programmed monolithically and from scratch. In particular the isolation of words, the handling of punctuation, and the treatment of case distinctions are built in. Even if data-filtering programs for these exact purposes were not at hand, these operations would well be implemented separately: for separation of concerns, for easier development, for piecewise debugging, and for potential reuse.

It's quite true that this criticism is orthogonal to the question of whether literate programming, in general, is a good thing. But it is certainly not orthogonal to the question of how to use any programming methodology, literate or otherwise. McIlroy is not saying that Knuth's program is bad; he's saying it could have been a lot better, and served as a much better showpiece for literate programming, if it had taken a different approach to solving the actual problem.

Other commenters noted that you can have reusable code with libraries just as easily as with separate Unix utilities called by shell scripts. That is quite true. (For example, see my Python solution to the problem, linked to below, which is entirely built out of built-in Python commands and routines from the Python standard library.) But either way, you have to have the reusable code available. What if you don't? That was McIlroy's point: if you don't have any reusable code yet (which Knuth didn't in the Pascal he was working with), why not build some as part of building your program? Instead of writing a one-off program to solve a one-off problem, why not write a library, or set of libraries, that provides a set of generic tools that you can then use to compose the solution to your specific problem?

Of course McIlroy already had the generic tools, the Unix utilities. But that's because he helped build them. He invented Unix pipes, remember? Knuth had an opportunity to do for Pascal what McIlroy and those he worked with did for Unix: Knuth could have emerged from his effort with a bunch of Pascal libraries that did a lot of the same general tasks as the Unix utilities. But he didn't.

This is also why the commenters who talked about "portability" are missing the point. Yes, McIlroy's specific solution would only work on a Unix system where those utilities are available. But again, they're there because he helped build them. Knuth's solution is "portable" to any OS that has a Pascal compiler, but so what? Once you've compiled it, you still have a one-off program to solve a one-off problem. Why not build the generic utilities in Pascal instead, and have them be portable? (And in any case the Unix utilities are written in C and so are portable to any OS that has a C compiler, i.e, to a superset of those that have Pascal compilers. Yes, some of the system calls would have to be changed, but that would be true of the Pascal version as well.)

Finally, another commenter noted that Knuth's program is easily extensible and claimed that McIlroy's shell script is not. He gave as an example handling contractions. Now, I am certainly no McIlroy at shell script (or programming in general, for that matter--I noted in my previous nerd interlude that McIlroy had written a set of Haskell one-liners that do the same job as the multiple files of Python code I wrote to implement power series as generators), but even I can see how to handle contractions in shell script: just add four lines to the beginning of McIlroy's pipeline, and modify one of his lines:

sed s/[^A-Za-z]\'/\ / |
sed s/^\'/\ / |
sed s/\'[^A-Za-z]/\ / |
sed s/\'$/\ / |
tr -cs \'A-Za-z '\n' |
tr A-Z a-z |
sort |
uniq -c |
sort -rn |
sed ${1}q

The first four lines just change quotes that don't lie between two letters--i.e., that aren't part of contractions--to spaces. (There's probably a slicker way to do this with sed, but what's there works, and as I said, I'm not McIlroy.) Then we just modify the tr line to include the remaining quotes in the set of characters we don't change to newlines. Boom. Done.

The commenter gives further challenges: count multiple spellings (for example, "color" vs. "colour") as single words, and hyphenated words. He seems to think those would be difficult in shell script. But having seen how contractions were handled just now, it should be obvious that those other cases could be handled as well. Hyphenated words can be treated the same way contractions were--in fact just adding the hyphen character to the regular expressions in the first and third sed lines above, as an alternate to the quote, should do it. Multiple spellings require a little more, since the script would have to take as input a table of what different spellings count as the same word. But given such a table, using it to canonicalize the spellings of words prior to sorting them is straightforward. True, it might actually require writing a small separate filter script to do the canonicalization--but such a script would just be another generic, reusable tool in the same spirit as the Unix utilities. (The actual programming is left as an exercise for the reader--though at some point I may post my own solutions.)

Of course all these cases could be handled in Pascal (or Python--my Python solution includes an extended version that handles contractions) as well. But once again, debating whether they would be harder to handle in shell is missing the point. Knuth's program could no doubt have been modified to handle contractions, multiple spellings, hyphenation, etc., etc. But how would he have modified it? Would he just have written more custom code? In shell, you just keep composing the same simple utilities in new ways. That's what makes them so powerful. (And as I noted above, I have tried to take the same approach in my Python solutions.)

So with all due respect to Donald Knuth (which is a lot, make no mistake), I have to agree with McIlroy's final assessment:

Knuth has shown us here how to program intelligibly, but not wisely. I buy the discipline. I do not buy the result.

Postscript

As I mentioned above, I couldn't resist the temptation to program a solution to this problem in Python. It's quite a bit more than six lines. :-) You can see it on github here, along with an enhanced version that handles contractions, and McIlroy's shell pipeline along with my enhanced version above, for comparison. There's also a test text file on which you can run the different versions to see the output, along with the expected output for each version.

Posted at 23:59 | Category: opinions | Tags: computers | Permalink