A Princeton professor, finding some time for himself during the summer academic lull, wrote a letter to an old friend a couple of months ago. Brian Kernighan said hello, asked how their visit to the US was going, and disembarked hundreds of lines of code it may add Unicode support to AWK, the text parsing tool he helped create for Unix at Bell Labs in 1977.
“I’ve tested it quite a bit, but clearly more testing is needed,” Kernighan wrote in an email published in late May as a pseudo-commitment on the onetrueawk repo from longtime contributor Arnold Robbins. “Once I figure out how … I’ll try to submit a pull request. I wish I had a better understanding of git, but despite your help, I still don’t have a proper understanding, so it might take a while.”
Kernighan is the letter “K”. AWK, a special-purpose language for extracting and manipulating language that was key to Unix’s pipeline features and intersystem interoperability. Working
awk function (AWK is a language
awk command to invoke it) is critical to both the UNIX Standard Specification and the IEEE POSIX certification for interoperability. There are countless options
awk— including modern Unicode-enabled derivatives — but the “One True AWK”, sometimes known as
nawkis a sort of canonical version based on Kernighan’s 1985 book AWK programming language and its subsequent input.
Kernighan is also the “K” in “K&R C”, a seminal 1978 book C programming language he co-authored with Dennis Ritchie that follows programmers, in mind and on paper with ears. C’s roots go much deeper. Kernighan taught C to Bell Labs workers and convinced its creator, Richie, collaborate on a book to spread knowledge. This book spawned the “one true style of parentheses,” the endless debates that accompany it, and the structure that underlies every modern programming language.
The onetrueawk repository, where Kernighan appeared in late May, is a relatively quiet place, with 21 contributors, 46 GitHub followers, and commits every few months. As noted by RegisterKernighan’s Unicode fix became known mainly because it was mentioned in an interview with a professor Computerphile’s YouTube channel.
“It was always confusing that AWK only worked with ASCII or maybe 8-bit input, but it doesn’t actually handle Unicode at all,” Kernighan tells Professor David Brailsford. “A few months ago I spent some time working with (laughs) an incredibly old program. At the moment I have it where it will actually handle UTF-8 input and output so you can have regular expressions that, you know, pick Japanese characters, things like that.’
Kernighan, now 80, fondly mentions in an interview that he also did something “quick and dirty” to enable AWK to process CSV files.