OCLUG, 2009-05-05, Richard Guy Briggs, REGEXes

rgb@tricolour.net, http://tricolour.net

What is a REGEX?

From Wikipedia:

In computing, regular expressions provide a concise and flexible means for identifying strings of text of interest, such as particular characters, words, or patterns of characters. Regular expressions (abbreviated as regex or regexp, with plural forms regexes, regexps, or regexen) are written in a formal language that can be interpreted by a regular expression processor, a program that either serves as a parser generator or examines text and identifies parts that match the provided specification.

Why is this useful?

REGEXes can be used for searching text for items of interest to isolate it from a large volume of text. They can also be used in search and replace patterns for automating the replacement of specific patterns of text. There are Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE) with their differences depending on the environment. BREs need some special characters escaped.

What is some of the history?

In the 1950s, mathematician Stephen Kleene came up with the notation for "regular sets". UNIX founder Ken Thompson later incorporated this notation first into the QED editor and subsequently into the ed editor and the seperate command "grep" which gets its name from the ed regular expression search command g/re/p (global / regular expression / print).

The first free REGEX library was written by Henry Spencer in 1986. It is the basis for most of the REGEX implementations to which you would be exposed.

PerlRE was derived from it and extended.

PCRE library developed by Philip Hazel for EXIM, intended to closely mimic this extended functionality of PerlRE.

Concepts:

Atoms: An atom is a unit to be matched, possibly with a quantification.
Literal: any normal character that matches itself.
- ```
cat
```
  matches "cat" anywhere in the input to be searched.
Branch: "|" is used to separate alternatives
- ```
a|b
```
  matches "a" or "b"
Grouping: "(" and ")" can be used to group a pattern for alternatives, or for later reference for another match or for replacement.
- ```
grey|gray
```
  is the same as
```
gr(e|a)y
```
Quantification, greedy:

? 0 or 1 occurance

* 0 or more occurances

+ 1 or more occurances

{N} N occurances

{M,N} between M and N occurances

{,N} at most N occurances

{N,} at least N occurances

Adding a "?" will match the shortest string possible (lazy)

Special Sequences:

\t	tab
\r	newline
\b	match any word boundary
\B	not word boundary
^	beginning of line
$	end of line
[]	"bracket expression" is a set of characters
^	within [], means "not the following characters"
.	match any character except newline
\.	match "."
\\	match "\"
\^	match "^"
\$	match "$"
\{	match "{"
\}	match "}"
\(	match "("
\)	match ")"

Character Classes:

POSIX	Perl	ASCII	Description
[:alnum:]		[A-Za-z0-9]	Alphanumeric characters
[:word:]	\w	[A-Za-z0-9_]	Alphanumeric characters plus "_"
	\W	[^\w]	non-word character
[:alpha:]		`[A-Za-z]`	Alphabetic characters
`[:blank:]`		`[ \t]`	Space and tab
`[:cntrl:]`		`[\x00-\x1F\x7F]`	Control characters
`[:digit:]`	`\d`	`[0-9]`	Digits
	`\D`	`[^\d]`	non-digit
`[:graph:]`		`[\x21-\x7E]`	Visible characters
`[:lower:]`		`[a-z]`	Lowercase letters
`[:print:]`		`[\x20-\x7E]`	Visible characters and spaces
`[:punct:]`		[-!"#$%&'()*+,./:;<=>?@[\\\]_`{\|}~]	Punctuation characters
`[:space:]`	`\s`	`[ \t\r\n\v\f]`	Whitespace characters
	`\S`	`[^\s]`	non-whitespace character
`[:upper:]`		`[A-Z]`	Uppercase letters
`[:xdigit:]`		`[A-Fa-f0-9]`	Hexadecimal digits

Examples:

email address:
- \b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b
html bold tag:
- ]*>(.*?)
10-digit phone number:
- $?\d{3}[) ]\s?\d{3}[- ]\d{4} (poor)
- (\(\d{3}$|\d{3})\s?\d{3}[- ]\d{4} (better)
IP address:
- (\d{1,3}\.){3}\d{1,3} (poor)
- ((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?) (better)
grep:
- grep "\bFred\b" addresses.txt
sed:
- sed "s/$[a-zA-Z0-9._%+-]\+$@$[a-zA-Z0-9.-]\+\.[a-zA-Z]\{2,4\}$/\1-at-\2/g" < addresses.txt > addresses2.txt

Where is it used?

regex: Vim, expr, lex, EMACS, SED, GREP, AWK

PerlRE: Perl, Python, Java, Ruby, TCL

PCRE: Perl-Compatible Regular Expressions: PHP, Apache HTTP Server, Exim MTA, KDE, Postfix, Analog, Nmap, Safari

More information:

general concept: http://en.wikipedia.org/wiki/Regex
manpage: man 7 regex (general concept)
manpage: man 3 regex (C function)
manpage: man grep
manpage: man sed
vim: :help regex
http://tricolour.net/oclug/regex-tutorial.html

?	0 or 1 occurance
*	0 or more occurances
+	1 or more occurances
{N}	N occurances
{M,N}	between M and N occurances
{,N}	at most N occurances
{N,}	at least N occurances