hn-classics/_stories/1997/16080505.md

[Source](http://www.pement.org/sed/bookindx.txt "Permalink to ")

(12 Nov. 2001 - some minor typos corrected from earlier versions.) \---------- Forwarded message ---------- Date: Sat, 15 Mar 1997 03:10:29 -0600 From: Eric Pement  To: Al Aab  Subject: using sed to make indexes for books (long) SUBJECT: using sed to make indexes for books (long) I work with book- and magazine-publishing, and some time ago I needed to create an index for a book after typesetting. On our proof pages (hard copy), we used a yellow marker to highlight the terms we wanted to index, and then several volunteers used the computer to enter the terms and the page numbers, separating them with a semicolon. Each term was entered on one line. The initial input file looked something like this: Buddhism, Zen; 1 atheism; 1 dualism; 1 Solomon; 2 Lausanne Covenant; 4 Lewis, C.S.; 4 Lausanne Covenant; 5 Mormonism; 6 Latter-day Saints; 6 Trinity; 6 Lausanne, Switzerland; 8 Trinity; 8 . . . . Note that the data was entered in the order that we completed each page or chapter. Next, we sorted the file with a sort utility: case-insensitive and numeric-aware (i.e., the number "3" must come before "19"; in a normal ASCII sort, "19" would appear before "3"). To get a sort which satisfied both conditions was extremely difficult, even using the GNU sort program (the manual pages for GNU sort don't explain the switches very well). The proper syntax to use is: sort -t";" +0f -1 +1n input.file Briefly explaining the switches, -t";" sets the field delimiter to be a semicolon. Fields are numbered beginning at zero (0), not one. Thus, "+0f -1" means the first sort key will begin at field 0 (the 1st field, to normal people) and end before reaching field 1, and be case-insensitive ("f" for folded). "+1n" means that the next sort key will begin at field 1 (the 2nd field) and, being followed by no "-NUM" value, will continue to the end of the line. The "n" means this field will be sorted according to numeric values, even including decimal points, instead of in ASCII order. If you use other sort utilities, the command syntax will probably differ. The entries for the sorted file now looked like this: Adam; 13 Adam; 21 Adam; 30-32 agnosticism; 9 agnosticism; 120 atheism; 1 atheism; 9 atheism; 40-41 atheism; 118 Bible; 3 Bible; 11-14 Bible; 22 We wanted to convert the data shown above to look like this, in a format ready for printing: Adam, 13, 21, 30-32 agnosticism, 9, 120 atheism, 1, 9, 40-41, 118 Bible, 3, 11-14, 22 . . . Four years ago I used an awk script to perform this conversion, but I have now realized that a sed script could do this simply and with fewer lines of code. The sed script I came up with to perform this conversion looked like this (at first, anyway): # INDEXER.SED v1.0 - indexes sorted input file # Annotated for seders mailing list { # on every line of the file... :loop $! N; # if not the last line, get the Next line s/^([^;]*;) (.*)n1 (.*)/1 2, 3/ t loop; # if previous substitution occurred, goto :loop s/;/,/; # replace the semicolon with a comma P; # print first line of pattern buffer D; # delete 1st line of buffer & redo the loop } This script works! Well, sort of. As long as the input file is perfectly formatted, the script works fine. But I discovered that if *one* line anywhere in the file was in error, the script would fail to change every other line after that. Consider the two following sets of input files (made very short for explanation here): ===SET 1====== ===SET 2====== Adam; 13 Adam; 13 Adam; 21 Adam; 21 Adam; 30-32 Adam; 30-32 agnosticism; 9 agnosticism; 9 agnosticism; 120 agnosticism; 120 atheism; 9 atheism, 9 # this line differs in SET2 atheism; 40-41 atheism; 40-41 atheism; 118 atheism; 118 Bible; 3 Bible; 3 Bible; 11-14 Bible; 11-14 Bible; 22 Bible; 22 binitarian; 82 binitarian; 82 Now, compare the output produced by "sed -f INDEXER.SED set1 set2": Adam, 13, 21, 30-32 Adam, 13, 21, 30-32 agnosticism, 9, 120 agnosticism, 9, 120 atheism, 9, 40-41, 118 atheism, 9 Bible, 3, 11-14, 22 atheism, 40-41 binitarian, 82 atheism, 118 Bible, 3 Bible, 11-14 Bible, 22 binitarian, 82 As you can see, the absence of a sing
Initial commit 2018-02-23 18:19:40 +00:00			`[Source](http://www.pement.org/sed/bookindx.txt "Permalink to ")`

			(12 Nov. 2001 - some minor typos corrected from earlier versions.) \---------- Forwarded message ---------- Date: Sat, 15 Mar 1997 03:10:29 -0600 From: Eric Pement To: Al Aab Subject: using sed to make indexes for books (long) SUBJECT: using sed to make indexes for books (long) I work with book- and magazine-publishing, and some time ago I needed to create an index for a book after typesetting. On our proof pages (hard copy), we used a yellow marker to highlight the terms we wanted to index, and then several volunteers used the computer to enter the terms and the page numbers, separating them with a semicolon. Each term was entered on one line. The initial input file looked something like this: Buddhism, Zen; 1 atheism; 1 dualism; 1 Solomon; 2 Lausanne Covenant; 4 Lewis, C.S.; 4 Lausanne Covenant; 5 Mormonism; 6 Latter-day Saints; 6 Trinity; 6 Lausanne, Switzerland; 8 Trinity; 8 . . . . Note that the data was entered in the order that we completed each page or chapter. Next, we sorted the file with a sort utility: case-insensitive and numeric-aware (i.e., the number "3" must come before "19"; in a normal ASCII sort, "19" would appear before "3"). To get a sort which satisfied both conditions was extremely difficult, even using the GNU sort program (the manual pages for GNU sort don't explain the switches very well). The proper syntax to use is: sort -t";" +0f -1 +1n input.file Briefly explaining the switches, -t";" sets the field delimiter to be a semicolon. Fields are numbered beginning at zero (0), not one. Thus, "+0f -1" means the first sort key will begin at field 0 (the 1st field, to normal people) and end before reaching field 1, and be case-insensitive ("f" for folded). "+1n" means that the next sort key will begin at field 1 (the 2nd field) and, being followed by no "-NUM" value, will continue to the end of the line. The "n" means this field will be sorted according to numeric values, even including decimal points, instead of in ASCII order. If you use other sort utilities, the command syntax will probably differ. The entries for the sorted file now looked like this: Adam; 13 Adam; 21 Adam; 30-32 agnosticism; 9 agnosticism; 120 atheism; 1 atheism; 9 atheism; 40-41 atheism; 118 Bible; 3 Bible; 11-14 Bible; 22 We wanted to convert the data shown above to look like this, in a format ready for printing: Adam, 13, 21, 30-32 agnosticism, 9, 120 atheism, 1, 9, 40-41, 118 Bible, 3, 11-14, 22 . . . Four years ago I used an awk script to perform this conversion, but I have now realized that a sed script could do this simply and with fewer lines of code. The sed script I came up with to perform this conversion looked like this (at first, anyway): # INDEXER.SED v1.0 - indexes sorted input file # Annotated for seders mailing list { # on every line of the file... :loop $! N; # if not the last line, get the Next line s/^([^;];) (.)n1 (.)/1 2, 3/ t loop; # if previous substitution occurred, goto :loop s/;/,/; # replace the semicolon with a comma P; # print first line of pattern buffer D; # delete 1st line of buffer & redo the loop } This script works! Well, sort of. As long as the input file is perfectly formatted, the script works fine. But I discovered that if one* line anywhere in the file was in error, the script would fail to change every other line after that. Consider the two following sets of input files (made very short for explanation here): ===SET 1====== ===SET 2====== Adam; 13 Adam; 13 Adam; 21 Adam; 21 Adam; 30-32 Adam; 30-32 agnosticism; 9 agnosticism; 9 agnosticism; 120 agnosticism; 120 atheism; 9 atheism, 9 # this line differs in SET2 atheism; 40-41 atheism; 40-41 atheism; 118 atheism; 118 Bible; 3 Bible; 3 Bible; 11-14 Bible; 11-14 Bible; 22 Bible; 22 binitarian; 82 binitarian; 82 Now, compare the output produced by "sed -f INDEXER.SED set1 set2": Adam, 13, 21, 30-32 Adam, 13, 21, 30-32 agnosticism, 9, 120 agnosticism, 9, 120 atheism, 9, 40-41, 118 atheism, 9 Bible, 3, 11-14, 22 atheism, 40-41 binitarian, 82 atheism, 118 Bible, 3 Bible, 11-14 Bible, 22 binitarian, 82 As you can see, the absence of a sing