Perl Tips
Perl is the Swiss Army Knife of scripting languages. It can do everything
shell scripts and standard Unix tools like grep, sed, and awk can do, plus
much, much more.
Documentation
A unique feature of Perl is that it doensn't come with just one man page,
but with a whole slew of man pages that describe various aspects of Perl,
and serve as tutorials, reference manuals, and FAQ pages. Here are some
of the most useful of these pages for beginners:
-
man perl
Main man page, lists the various auxiliary pages
available.
-
man perlintro
Perl introduction for beginners.
-
man perlrequick
Perl regular expressions quick start.
-
man perlcheat
Perl cheat sheet (very neat!).
-
man perlfaq1
The first Perl FAQ page. (There are 9 such pages,
perlfaq1 through perlfaq9.)
-
perldoc -q keyword
Extracts entries matching "keyword" from the
perlfaq pages. Example: "perldoc -q reverse".
-
perldoc -f function
Man page for perl function "function".
Example: "perldoc -f reverse".
-
perldoc -q books
The Perl books section of the Perl FAQ pages.
A large listing of recommended books, classified by level.
My own top three recommendations are "Learning Perl" by
Schwartz/Phoenix/Foy; "Programming Perl" by Wall/Christiansen/Orwant;
and "The Perl Cookbook" (Christiansen/Torkington), all published
by O'Reilly. The first is a beginner's tutorial; the second is a "must
have" reference for anyone seriously into Perl; and the third is icing on
the cake, with lots of nifty tricks.
Line-based operation
The simplest way to use perl in commandline mode is as a filter
that operates on a file (or on standard input), manipulates the file
one line at a time, and outputs the result to standard output,
much like standard Unix utilities like grep, sed, and awk work.
The basic structure of the command is one of the following:
-
perl -lape'.....' file
For each line in "file",
apply the command(s) (e.g., a substitution) '....', then
print the line to standard output.
-
perl -lane'.....' file
For each line in "file",
apply the command(s) '....', but
do not print the line. In this case, '....' usually will contain a
print command (possibly conditional), and only output generated by such an
explicit print command will get printed.
The commandline options used here have the following meaning:
- The "e" option indicates that the following string
is to be interpreted as a perl script (i.e., sequence of commands).
To prevent interfering with the shell, it is best to enclose the script
in single right quotes (').
-
The "l" (character "ell", not the bar symbol)
option ensures proper end-of-line handling; without it, linebreaks
may get chopped off.
- The
"a" option causes perl to autosplit each line into an array of fields
$F[0], $F[1], ..., with blank space acting as default field separator.
Note that in Perl, array indices start at 0, so the first array element
has index 0.
-
The "p" and "n" options indicate whether or not each line is
printed by default.
The following examples illustrate the use of Perl for line-by-line
processing of files.
-
perl -lane'print $F[1]' file
Print second field of each line (i.e., the output consists of the
second column of the file).
Note the $F[0] is the first field, $F[1] the second, etc.
-
perl -lane'print $F[-1]' file
Print the last column of the file.
In Perl, negative array indices denote array elements counted from the
right. Thus $F[-1] denotes the last field (column), $F[-2] the second
last, etc.
-
perl -lane'print "$F[2],$F[1]"' file
Print the second and third columns of the file in reverse order,
separated by a comma.
-
perl -lpe's/\s+/,/g' file
Replace any sequence of consecutive blank spaces
by a comma.
This converts a tabular list
with fields separated by blanks
to one in which the fields are separated by commas. (The latter format
is the csv format, a common spreadsheet format that can be used to import
files into Excel).
The s/.../.../g syntax is similar to that of sed;
The "g" modifier in the substitution command denotes a "global"
substitution; without it, only the first occurrence of the substitution
pattern would get substituted.
"\s" stands for any whitespace character (blank, tab, etc.). The plus
sign "+" indicates that the substitution pattern should match one or more
instances of "\s"; thus, any chunk of consecutive whitespace characters
gets replaced by a single comma.
(The "a" (autosplit) option is not needed here since no use of the field
array $F[...] is made; however, it would not hurt to leave it in.)
-
perl -pe 's/3/1/g' file
Replace 3 in the file by 1.
(Here the "a" (autosplit) option is not needed, nor is the "l"
(end-of-line processing) option, though one could, of course, leave
those options in.)
-
perl -i.bak -pe 's/3/1/g' file
The same, but with "in place" editing.
The "i" option is a powerful option of Perl that causes the commands to
be performed on the file itself. Thus, there is no need to save the modified
file under a temporary filename and then copy that file over the original
file. In the above form of this option,
the original version of the file is saved onto
a file with extension ".bak". Saving the original version onto a backup
file is safety mechanism; the name of the backup file can be changed by
replacing the string ".bak" by something else. If no such string is
provided in the "-i" option, then the file is modified without backing
up.
-
perl -lpe's/\d+/NNN/g' file
Replace any string of digits by "NNN".
Here "\d" stands for any single digit, the plus sign indicates one or more
instances of whatever precedes it. Thus, \d+ stands for any string of
digits.
-
perl -lpe's/^/$. /' file
Print the file, with line numbers prepended to each line.
The "$." variable denotes the line number; the caret symbol (^) denotes a
match at the beginning of the line. In this case the substitution pattern
in s/.../.../ is empty, so the "substitution" simply amounts to tacking on
the replacement string at the beginning of the line.
-
perl -lpe's/^\s+//' file
Delete any blank spaces at the beginning of each line.
-
perl -lpe's/^\s+//;s/\s+$//' file
Same, but also delete any blank spaces at the end of each line.
The two substitutions specified by the s/.../.../ commands are separated
by a semicolon and are executed sequentially. In the second substitution
command, the dollar sign ($) plays a role analogous to the caret sign and
denotes the end of the line.
-
perl -lane'print if (/\d\d\d\d/)' file
Print all lines in file that contain (at least) four consecutive
digits.
The string enclosed in /.../ is interpreted as a pattern that needs to be
matched in order for the if clause to evaluate as true. The string
\d\d\d\d stands for 4 consecutive digits. (This is a grep-like operation,
but accomplishing the same with grep would be messy since grep has
very limited regular-expression matching capabilities.)
-
perl -lane'print if (/\S/)' file
Print any line that contains a non-whitespace character. This effectively
deletes blank lines (or lines containing only whitespace) from the file.
"\S" stands for the complement of "\s", i.e., any character that is not
a whitespace.
-
perl -lane'print length($_)' file
Print the length (measured in characters) of each input line.
-
perl -lane'print if (length($_) > 40)' file
Print all lines in file that have
length (measured in characters) greater than 40.
Operating on entire files
Perl's power really shines when one wants to perform operations on chunks
of files that extend over multiple lines (e.g., deleting line breaks in
paragraphs). Standard Unix utilities like sed or awk are ill-suited for
that, but with Perl this is easy by changing the record separator
(which defaults to a linebreak) to something else using the '-0' option.
Of particular interest are the following cases:
-
Slurp mode: perl -0777
The "0777" string (note that "0" here is the digit 0)
causes the record separator to be set to "undefined", which in turn
causes Perl to operate on the entire file as if it were one line.
("slurp mode").
-
Paragraph mode: perl -00
The "00" (two digits 0) string causes Perl to interpret one or more consecutive
blank lines as record separator. Thus Perl operates on each paragraph
as if it were a line.
Here are some examples using these modes:
-
perl -00 -lpe's/\n/ /g' file
Delete all linebreaks within each paragraph, replacing them by a single
blank space. The net effect is that each paragraph becomes a single line.
Here "\n" stands for a linebreak character.
-
perl -00 -lpe's/\n/ /g; s/\.\s*/\.\n/g' file
Same, but after having deleted all linebreaks within paragraphs
reinsert linebreaks at the end of each sentence. As a result, each sentence
gets its own line.
The asterisk (*) in "\s*" denotes 0 or more instances of "\s". Thus,
"\.\s*" matches a period, plus any whitespace following it.
The period is used here as an end-of-sentence marker. It must be escaped
with a backslash (\.) since an unescaped period has a different meaning in
Perl.
-
perl -00 -lpe's/\n/ /g; s/\.\s*/\.\n/g' file | perl -lane'print "$#F+1"'
Same as before, but pipe the output into another command that prints out
the number of "words" (in the sense of any consecutive string of
nonblanks) for each sentence.
$#F denotes the last index in the array $F[...]. Since the indexing starts
with 0, one has to add 1 to obtain the number of elements in this array.
-
perl -0777 -lape's/\s+/,/g' file
Replace all whitespace in file by commas, crossing line boundaries.
The resulting file consists of a single long line, with fields separated
by commas. Such a format may be useful for importing to other
programs.
-
perl -0777 -lape's/\s+/\n/g' file
Replace each chunk of one or more whitespace characters in file by
a single newline.
The resulting file consists of one "word" per line. This is useful for
getting word statistics, as in the next example.
-
perl -0777 -lape's/\s+/\n/g' file | sort | uniq -c | sort -nr | less
A one line word frequency counter: It generate a list of all distinct
"words" in the file, with their frequency of occurrence, and sorted
from the most frequent to least frequent.
The "sort" command sorts the words alphabetically.
The "uniq" command eliminates duplicate words; with the "-c" option it
also prints the number of occurrences. The second "sort" command, with
the "-nr" option sorts the resulting file numerically in descending order.
Finally, the "less" command shows the result one page at a time.
Cool stuff
-
perl -lne 'print if "$_" eq reverse' file
Finds all palindromic lines in file. In particular, if each line
contains a single number, it displays all palindromes among these
numbers. If applied to a dictionary file (such as
/usr/share/lib/dict/words), with one word per line, it displays the
palindromes among the words.
-
perl -e '$n=1;while ($n++){sleep 1;print "\n$n is prime" if (("p" x $n) !~ /^((p)\2+)\1+$/)}'
Print out all prime numbers, one per second. (Note that the entire command
must be on a single line.)
-
perl -lape'tr/a-z/n-za-m/' file
A one-line encrypter. Rotate all (lower case) letters by 13 characters: a
-> n, b -> o, etc.
-
perl -lape's/(\w/)\U$1/g' file
Change all letters in file to upper case.
-
perl -lape's/(\w)/\L$1/g' file
Change all letters in file to lower case.
Sources of practice material
-
/usr/share/dict/words
The standard Unix dictionary file. List of "words", one per line.
-
Project Gutenberg
A large repository of ebooks in plain text format.
Last modified: Mon 20 Jul 2009 10:07:23 AM CDT
A.J. Hildebrand