String Processing With Perl

Perl comes with a powerful string manipulation API. Be afraid.

Perl Of Wisdom

When Larry Wall first introduced Perl to an unsuspecting world, he had only one goal in mind: to create a language that simplified the task of extracting text and formatting it for display by simultaneously combining the best of sed, awk and C. The result of his efforts, the Practical Extraction and Reporting Language, was so powerful that developers all over the world started using it...and not just for text processing either!

Over the next few years, Perl underwent quite a few rewrites, gradually evolving into a full-featured programming language that could handle almost everything you threw at it. A modular system allowed developers to add to and enhance the language's features with their own custom code, thereby ensuring that the language was always at the forefront of new technologies, and guaranteeing a continuous stream of new converts. As a result, by 1995, Perl was a de facto component of almost every developer's toolkit.

Throughout all this, though, Perl stayed true to its roots. The language was originally designed for string processing and, today, still boasts one of the more powerful string handling APIs in the business. Perl's constructs and functions can deal with strings faster and more efficiently than those in competing languages, and its speed at text processing - string manipulation, comparison, sorting, extracting - make it a great choice for Web developers, who find both speed and power essential to their daily development activities.

Over the next few pages, this article will offer you a broad overview of Perl's string manipulation capabilities, serving as both a handy reference and a tool to help you write more efficient code. Regardless of whether you're new to Perl or if you've been working with the language for a while, you should find something interesting in here.

Let's get started!

Jumping Jacks

We'll begin right at the top, with some very basic definitions and concepts.

In Perl, the term "string" refers to a sequence of characters. The following are all valid examples of strings:

"I’m back!"

"by golly miss molly"

"a long time ago in a galaxy far, far away"

String values can be assigned to a variable using the standard assignment operator.

$identity = "Jedi";

String values may be enclosed in either double quotes ("") or single quotes('') - the following variable assignments are equivalent"

$character = "Luke";

$character = 'Luke';

String values enclosed in double quotes are automatically parsed for variable names; if variable names are found, they are automatically replaced with the appropriate variable value.

#!/usr/bin/perl

$character = "Chewbacca";
$race = "Wookie";

# this would contain the string "Chewbacca is a Wookie"
$sentence = "$character is a $race";

Perl also allows you to create strings which span multiple lines. The original formatting of the string, including newlines and whitespace, is retained when such a string is printed.

# multi-line block
$html_output = <<EOF;
<html>
<head></head>
<body>
<ul>
    <li>Human
    <li>Wookie
    <li>Ewok
</ul>
</body>
</html>
EOF

The << symbol indicates to Perl that what comes next is a multi-line block of text, and should be printed as is right up to the marker "EOF". This comes in very handy when you need to output a chunk of HTML code, or any other multi-line string.

Strings can be concatenated with the string concatenation operator, represented by a period(.)

#!/usr/bin/perl

# set up some string variables
$a = "the cow ";
$b = "jumped over ";
$c = "the moon ";

# combine them using the concatenation operator

# this returns "the cow jumped over the moon"
$statement = $a . $b . $c;

# and this returns "the moon jumped over the cow"
$statement = $c . $b . $a;

Note that if your string contains quotes, carriage returns or backslashes, it's necessary to escape these special characters with a backslash.

# will cause an error due to mismatched quotes
$film = 'America's Sweethearts';

# will be fine
$film = 'America\'s Sweethearts';

The print() function is used to output a string or string variable.

#!/usr/bin/perl

# string
print "Last Tango In Paris";

# string variable
$film = "Last Tango In Paris";
print $film;

But if you thought that all you can do is concatenate and print strings, think again - you can also repeat strings with the repetition operator, represented by the character x.

#!/usr/bin/perl

# set a string variable
$insult = "Loser!\n";

# repeat it
print($insult x 7);

Here's the output:

Loser!
Loser!
Loser!
Loser!
Loser!
Loser!
Loser!

Nasty, huh?

Choppy Waters

With the basics out of the way, let's now turn to some of the other string functions available in Perl. Other than print(), the two functions you're likely to encounter most often are chop() and chomp().

There's a very subtle difference between these two functions. Take a look at this next example, which demonstrates chop() in action.

#!/usr/bin/perl

$statement = "Look Ma, no hands";

print chop($statement);

The chop() function removes the last character from a string and returns the mutilated value - as the output of the above program demonstrates:

Look Ma, no hand

Now, how about chomp()?

#!/usr/bin/perl

# ask a question...
print "Gimme a number! ";

# get an answer...
$number = <STDIN>;

# process the answer...
chomp($number);
$square = $number * $number;

# display the result
print "The square of $number is $square\n";

In this script, prior to using the chomp() function, the variable $number contains the data entered by the user at the prompt, together with a newline (\n) character caused by pressing the Enter key. Before the number can be processed, it is important to remove the newline character, as leaving it in could adversely affect the rest of the program. Hence, chomp().

The chomp() function's sole purpose is to remove the newline character from the end of a variable, if it exists. Once that's taken care of, the number is multiplied by itself, and the result is displayed.

Gimme a number! 5
The square of 5 is 25

The length() function returns the length of a particular string, and can come in handy for operations which involve processing every character in a string.

#!/usr/bin/perl

$str = "The wild blue fox jumped over the ripe yellow pumpkin";

# returns 53
print length($str);

Making New Friends

The split() function splits a string into smaller components on the basis of a user-specified pattern, and then returns these elements as an array.

#!/usr/bin/perl

$str = "I'm not as think as you stoned I am";

# split into individual words on whitespace delimiter and store in array
@words
@words = split (/ /, $str);

This function is particularly handy if you need to take a string containing a list of items (for example, a comma-delimited list) and separate each element of the list for further processing.

Here's an example:

#!/usr/bin/perl

$str = "Rachel,Monica,Phoebe,Joey,Chandler,Ross";

# split into individual words and store in array
@arr = split (/,/, $str);

# print each element of array
foreach $item (@arr)
{
        print("$item\n");
}

Here's the output:

Rachel
Monica
Phoebe
Joey
Chandler
Ross

Obviously, you can also do the reverse - the join() function creates a single string from all the elements of an array, glueing them together with a user-defined separator. Reversing the example above, we have:

#!/usr/bin/perl

@arr = ("Rachel", "Monica", "Phoebe", "Joey", "Chandler", "Ross");

# create string from array
$str = join (" and ", @arr);

# returns "Rachel and Monica and Phoebe and Joey and Chandler and Ross are friends"
print "$str are friends";

Not My Type

The chr() and ord() functions come in handy when converting from ASCII codes to characters and vice-versa. For example,

#!/usr/bin/perl

# returns "A"
print chr(65);

# returns 97
print ord("a");

If you prefer numbers to letters, you can use the hex() and oct() functions to convert between decimals, hexadecimals and octals.

#!/usr/local/bin/perl

#returns 170
print hex(AA);

# returns 40
print oct(50);

And if you ever find the need to reverse a string, well, you can always reach for the reverse() function...

#!/usr/bin/perl

$str = "Wassup, dood?";

# reverse string
# $rts now contains ?dood ,pussaW
$rts = reverse($str);

# returns "Sorry, you seem to be talking backwards - what does ?dood ,pussaW mean?"
print "Sorry, you seem to be talking backwards - what does $rts mean?";

Of Jumping Cows And Purple Pumpkins

Next up, the substr() function. As the name implies, this is the function that allows you to slice and dice strings into smaller strings. Here's what it looks like:

substr(string, start, length)

where "string" is a string or string variable, "start" is the position to begin slicing at, and "length" is the number of characters to return from "start".

Here's an example which demonstrates how this works:

#!/usr/bin/perl

$str = "The cow jumped over the moon, purple pumpkins all over the world rejoiced.";

# returns "purple pumpkin"
print substr($str, 30, 14);

You can use the case-sensitive index() function to locate the first occurrence of a character in a string,

#!/usr/bin/perl

$str = "Robin Hood and his band of merry men";

# returns 29
print index($str, "r");

and the rindex() function to locate its last occurrence.

#!/usr/bin/perl

$str = "Robin Hood and his band of merry men";

# returns 33
print  rindex($str, "m");

You can also tell these functions to skip a certain number of characters before beginning the search - consider the following example, which demonstrates by beginning the search after skipping the first five characters in the string.

#!/usr/bin/perl

$str = "Robin Hood and his band of merry men";

# skips 5 "o"s
# returns 7 for the first "o" in "Hood"
print index($str, "o", 5);

On The Case

The next few string functions come in very handy when adjusting the case of a text string from lower- to upper-case, or vice-versa:

lc() - convert string to lower case

uc() - convert string to upper case

ucfirst() - convert the first character of string to upper case

lcfirst() - convert the first character of a string to lower case

Here's an example:

#!/usr/bin/perl

$str = "Something's rotten in the state of Denmark";

# returns "something's rotten in the state of denmark"
print lc($str);

# returns "SOMETHING'S ROTTEN IN THE STATE OF DENMARK"
print uc($str);

# returns "something's rotten in the state of Denmark"
print lcfirst($str);

# re-initialize for next bit of code
$str = "something's rotten in the state of Denmark";

# returns "Something's rotten in the state of Denmark"
print ucfirst($str);

You've already used the print() function extensively to display output. However, the print() function doesn't allow you to format output in any significant manner - for example, you can't write 1000 as 1,000 or 1 as 00001. And so clever Perl developers came up with the sprintf() function, which allows you to define the format in which data is printed.

Consider the following example:

#!/usr/bin/perl

# returns 1.6666666666667
print(5/3);

As you might imagine, that's not very friendly. Ideally, you'd like to display just the "significant digits" of the result. And so, you'd use the sprintf() function:

#!/usr/bin/perl

# returns 1.67
print sprintf("%1.2f", (5/3));

A quick word of explanation here: the Perl’s sprintf() function is very similar to the printf() function that C programmers are used to. In order to format the output, you need to use "field templates", templates which represent the format you'd like to display.

Some common field templates are:

%s - string

%d - decimal number

%x - hexadecimal number

%o - octal number

%f - float number

You can also combine these field templates with numbers which indicate the number of digits to display - for example, %1.2f implies that Perl should only display two digits after the decimal point. If you'd like the formatted string to have a minimum length, you can tell Perl which character to use for padding by prefixing it with a single quote (').

Desperately Seeking Susan

You can also search for specific patterns in your strings with regular expressions, something that Perl supports better than most other languages.

In Perl, all interaction with regular expressions takes place via an equality operator, represented by =~

$flag =~ m/susan/

$flag returns true if $flag contains "susan" using the "m" operator.

You can also perform string substitution with regular expressions with the "s" operator, as in the following exanple.

$flag =~ s/susan/JANE/

This replaces "susan" in the variable $flag with "JANE" using the "s" operator.

Here is a simple example that validates an email address:

#!/usr/bin/perl

# get input
print "So what's your email address, anyway?\n";
$email = <STDIN>;
chomp($email);

# match and display result
if($email =~
/^([a-zA-Z0-9])+([\.a-zA-Z0-9_-])*@([a-zA-Z0-9_-])+(\.[a-zA-Z0-9_-]+)+/)
{
    print("Ummmmm....that sounds good!\n");
} else
{
    print("Hey - who do you think you're kidding?\n");
}

Obviously, this is simply an illustrative example - if you're planning to use it on your Web site, you need to refine it a bit. You have been warned!

If you want to find out the number of times a particular pattern has been repeated in a string, Perl offers the very cool "tr" operator.

#!/usr/bin/perl

# get input
print "Gimme a string: ";
$string = <STDIN>;
chomp($string);

# put string into default variable for tr
$_ = $string;

# check string for spaces and print result
$blanks += tr/ / /;
print ("There are $blanks blank characters in \"$string\".");

Here's an example session:

Gimme a string: This is a test.
There are 3 blank characters in "This is a test.".

You can have Perl return the position of the last match in a string with the pos() function,

#!/usr/bin/perl

$string = "The name's Bond, James Bond";

# search for the character d
$string =~ /d/g;

# returns 15
print pos($string);

and automatically quote special characters with backslashes with the quotemeta() function.

#!/usr/bin/perl

$string = "#@!#@!#@!";

$string = quotemeta($string);

# returns \#\@\!\#\@\!\#\@\!
print $string;

Sadly, that's about all we have time for. In case you want more, consider visiting the following links:

The Perl string API, at http://www.perldoc.com/perl5.6/pod/perlfunc.html

The Perl 101 series, at http://www.melonfire.com/community/columns/trog/article.php?id=5

A discussion of regular expressions, at http://www.melonfire.com/community/columns/trog/article.php?id=2

Until next time...be good.

Note: Examples are illustrative only, and are not meant for a production environment. Melonfire provides no warranties or support for the source code described in this article. YMMV!

This article was first published on10 Apr 2003.