Login

Username:

Password: 

Join us Now |  Forgot Password? | Forgot UserName?

C#

Learn step by step



Regular Expressions

Regular expressions is one of ways of search substrings in strings. It is carried out by means of viewing a string in searches of some pattern. A well-known example can be symbols "*" and "?", used in command line DOS. First of them replaces a zero or more any symbols, the second - one any symbol. So, use of a pattern of search of type "text?.*" will find files textf.txt, text1.asp and others similar, but will not find text.txt or text.htm.

.NET Regular Expressions

The Regexp classes are allowed to use regular expressions in .NET. All classes are contained in the System.Text.RegularExpressions assembly, and you will have to reference the assembly at compile time in order to build your application. For example: csc /r:System.Text.RegularExpressions.dll foo.cs will build the foo.exe assembly, with a reference to the System.Text.RegularExpressions assembly.

There are only six classes and one delegate which it is necessary to overview:

  • Capture: Contains the results of a single match
  • CaptureCollection: A sequence of Capture's
  • Group: The result of a single group capture, inherits from Capture
  • Match: The result of a single expression match, inherits from Group
  • MatchCollection: A sequence of Match's
  • MatchEvaluator: A delegate for use during replacement operations
  • Regex: An instance of a compiled regular expression

The Regex class also contains several static methods:

  • Escape: Escapes regex metacharacters within a string
  • IsMatch: Methods return a boolean result if the supplied regular expression matches within the string
  • Match: Methods return Match instance
  • Matches: Methods return a list of Match as a collection
  • Replace: Methods that replace the matched regular expressions with replacement strings
  • Split: Methods return an array of strings determined by the expression
  • Unescape: Unescapes any escaped characters within a string


using System.Text.RegularExpressions;
using System;

namespace RegularExpressionsSample
{
class Program
{
static void Main(string[] args)
{
string regularExpression = "s[ia]mple";
string inputString = "Is it simple sample?";
Match m = Regex.Match(inputString, regularExpression);
Console.WriteLine("Match = " + m.ToString());
Console.WriteLine("Next match = " + m.NextMatch().ToString());
}
}
}

Output

Match = simple
Next match = sample

Regular Expression Options:

Regular Expression Options can be used in the constructor for the Regex class.

  • RegexOptions.None - Specifies that no options are set.
  • RegexOptions.IgnoreCase - Specifies case-insensitive matching.
  • RegexOptions.Multiline - Multiline mode. Changes the meaning of ^ and $ so they match at the beginning and end, respectively, of any line, and not just the beginning and end of the entire string.
  • RegexOptions.Singleline - Specifies single-line mode. Changes the meaning of the dot (.) so it matches every character (instead of every character except \n).
  • RegexOptions.ExplicitCapture - Specifies that the only valid captures are  groups that are explicitly named or in the form (?<name>...).
  • RegexOptions.IgnorePatternWhitespace - Eliminates unescaped white space from the pattern and enables comments marked with the hash sign (#).
  • RegexOptions.Compiled - Specifies that the regular expression is compiled to an assembly. The regular expression will be faster to match but it takes more time to compile initially. This option (although tempting) should only be used when the expression will be used many times. e.g. in a foreach loop
  • RegexOptions.ECMAScript - Enables ECMAScript-compliant behavior for the expression. This flag can be used only in conjunction with the IgnoreCase, Multiline, and Compiled flags. The use of this flag with any other flags results in an exception.
  • RegexOptions.RightToLeft - Specifies that the search will be from right to left instead of from left to right.

Bases of syntax of regular expressions

I will not begin to try to write the full directory on all symbols used in patterns of regular expressions. For this purpose there is MSDN. Here we will result only the basic metasymbols. In double inverted commas the values which are given out by regular expressions, and in unary - syntax of regular expressions will be used further.

In C# metasymbols which you wish to use not as those and as actually symbols, should be covered by an escape-symbol \ as in C++ (in other languages can be differently, for example, in VB it is not necessary). That is, to find "[" , it is necessary to write '\['.The symbol \ means, that the symbol following it is special symbol, a constant and so on. For example, 'n' means the letter "n." '\n' means a symbol of a new line. The sequence '\\' corresponds "\", and '\('  corresponds "(".

Classes of symbols (Character class)

Classes of symbols (Character class)
Using square brackets, it is possible to specify group of symbols (it name a class of symbols) for search. For example, the design 'b[ai]rge' would correspond to words "barge" and "birge", i.e. the words beginning with "b" which follow "a" or "i", and coming to an end on "rga". Probably and the return, that is, it is possible to specify symbols which should not contain in found substring. So, '[^1-6]' finds all symbols, except for figures from 1 up to 6. It is necessary to mention, that inside of a class of symbols '\b' designates a symbol backspace (deletings).

Quantifiers
If it is not known, how many signs should contain required substring, it is possible to use special symbols, called by an odd word quantifiers. For example, it is possible to write "hel+o", that will mean a word beginning with "he", with following for it one or the several "l", and coming to an end on "o". It is necessary to understand, that quantifier concerns to previous expression, instead of a separate symbol.

Symbol Overview
* Corresponds 0 or more occurrences of previous expression.
For example, 'zo*' corresponds "z" and "zoo".
+ Corresponds 1 or more previous expressions.
For example, 'zo+'  corresponds "zo" and "zoo", but not "z".
? Corresponds 0 or 1 previous expressions.
For example, 'do(es)?' corresponds "do" in "do" or "does".
*? Corresponds 0 or more  previous expressions as few characters as possible.
For example, 'dog*?' corresponds "do" in "do" or "does"
+? Corresponds 1 or more  previous expressions as few characters as possible.
For example, 'dog+?' corresponds "dog" in "dog" or "dogger".
{n} n - the non-negative whole. Corresponds to exact quantity of occurrences.
For example, 'o{2}' will not find "o" in "Bob", but will find two "o" in "food".
{n,} n - the non-negative whole. Corresponds to the occurrence repeated not less n of time.
For example, 'o{2,}' does not find "o" in "Bob", but finds all "o" in "foooood".
'o{1,}' it is equivalent 'o+'.
'o{0,}' it is equivalent 'o*'.
{n,m} m and n - non-negative integers, where n <= m. There corresponds a minimum n and a maximum m occurrences.
For example, 'o{1,3}' finds three first "o" in "fooooood".
'o{0,1}' it is equivalent 'o?'.
The blank space between a comma and figures is inadmissible.
| When between two characters or groups, matches one or the other (this is called an alternating operation, because it chooses among two alternatives).

The important feature of quantifiers '*' and '+' is their greed. They find everything, that can - instead of that is necessary.

Other special characters

In the table below I have displayed other special symbols:

Symbol Overview
\a Matches a bell (alarm).
\b Matches a backspace if in a [] character class; otherwise, see above.
\t Matches a tab.
\r Matches a carriage return.
\v Matches a vertical tab.
\f Matches a form feed.
\n Matches a new line.
\e Matches an escape.
\w Matches a character(a-z, A-z, 0-9 and underscore).
\W Matches any character that is not a letter.
\s Matches any white spaces(space or tab).
\S Matches any character that is not white space.
\d Matches a digit(0-9).
\D Matches any character that is not a digit.
. Matches any character, except the end of line or the end of text.
$ Matches the end of the string or line.
Grouping and Backreferences

You can group patterns by placing them in parenthesis. You can give a name to the group as well. Here are some of the grouping constructs you'll be using:

Construction Overview
() Defines a simple group.
(?<name>) Group named "name"
(?i:) Igonre case when matching within the group
\n Matches a previous group(group #n)
For example, (\w)\1 finds doubled word characters.
\k<name> Matches a previous group with the specified name.
For example, (?<char>\w)\k<char> finds doubled word characters. The expression (?<43>\w)\43 does the same. You can use single quotes instead of angle brackets; for example, \k'char'.

using System.Text.RegularExpressions;
using System;

namespace RegularExpressionsSample
{
class Program
{
static void Main(string[] args)
{
// Should match everything except the last two.
string regularExpression = @"\$(\d+)\.(\d\d)";
string inputString = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";

for (Match m = Regex.Match(inputString, regularExpression); m.Success; m = m.NextMatch())
{
GroupCollection gc = m.Groups;

Console.WriteLine("The number of captures: " + gc.Count);
// Group 0 is the entire matched string itself
// while Group 1 is the first group to be captured.
for (int i = 0; i < gc.Count; i++)
{
Group g = gc[i];
Console.WriteLine(g.Value);
}
}
}
}
}

Output

The number of captures: 3
$1.57
1
57
The number of captures: 3
$316.15
316
15
The number of captures: 3
$19.30
19
30
The number of captures: 3
$0.30
0
30
The number of captures: 3
$0.00
0
00
The number of captures: 3
$41.10
41
10
Replacement

Substitutions are allowed only within a replacement pattern. For similar functionality within a regular expression, use a backreference such as \1.

Character escapes and substitutions are the only special constructs recognized in a replacement pattern. All other syntactic constructs are allowed in regular expressions only and not recognized in replacement patterns. For example, the replacement pattern 'a*${test}b' inserts the string "a*" followed by the substring matched by the "test" capturing group, if any, followed by the string "b". The * character is not recognized as a metacharacter within a replacement pattern. Similarly, $-patterns are not recognized within a regular expression matching pattern. Within a regular expression, $ denotes the end of the string. Other examples are: '$123' substitutes the last substring matched by group number 123 (decimal), and ${name} substitutes the last substring matched by a (?<name>) group.


Formatting string with replace method.


using System.Text.RegularExpressions;
using System;

namespace RegularExpressionsSample
{
class Program
{
static void Main(string[] args)
{
string regularExpression = @"(\s*)Dim\s+(\w+)\s+As\s+(\w+)";
string inputString = "Dim abc As Integer";
string replacement = "$1$3 $2;";
Console.WriteLine(Regex.Replace(inputString, regularExpression, replacement));
}
}
}

Output

Integer abc;

Lookaround
There are two directions of lookaround - lookahead and lookbehind - and two flavors of each direction - positive assertion and negative assertion. The syntax for each is:

  • (?=...) - Positive lookahead
  • (?!...) - Negative lookahead
  • (?<=...) - Positive lookbehind
  • (?<!...) - Negative lookbehind

Understanding look(ahead|behind) requires an understanding of the difference between matching text and matching position. To help with this understanding I should state first that lookaround assertions are non-consuming. To see what I mean, let's look at the following simple example.

regularExpression = "stop";
inputString = "stopping";

When the above pattern is applied to the text the "context" of the parser sits at a position in the text between the "s" and the "i" in the word stopping. This is because the regular expression parser bumps along the string as it gets a match, like so:

  • Start - ^stopping
  • Match "s" - ^topping
  • Match "t" - s^opping
  • Match "o" - st^pping
  • Match "p" - sto^ping

Once the parser has moved beyond a position there is no way to reverse up and re-attempt a match. To understand where this causes difficulty, consider this, what if you needed to match the word "stop" but only when it was contained in the word "stopped" and not any other possible combination such as "stopper". With lookahead you can simply assert that condition like so: (?=stopped\b)stop

This works because, with lookaround, the parser is not bumped along the string. This can be especially useful for finding a position in a document by combining a lookahead assertion with a lookbehind assertion. To demonstrate, let's consider that we need to match the string "stop" when it was contained within the string "estopped" but not "astopped". To do this you can do a negative, lookbehind assertion on "a" and a positive lookahead assertion on "stopped", like this: (?<!a)(?=stopped\b)stop

In other words you are matching a position at which to start matching text. The above pattern would set the parser at the following position in the string "estopped"

Start - e^stopped
Match "s" - e^topped
Match "t" - es^opped
Match "o" - est^pped
Match "p" - esto^ped


Example of using lookaround would be to validate "special" password conditions such as: "Password must be between 8 and 20 characters, must contain at least 2 letter characters and at least 2 digit characters. It can only contain either letter or digit characters."

For such a password constraint, the following expression would probably do quite nicely: ^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$

using System.Text.RegularExpressions;
using System;

namespace RegularExpressionsSample
{
class Program
{
static void Main(string[] args)
{
string regularExpression = @"^(?=.*?\d.*?\d)(?=.*?\w.*?\w)[\d\w]{8,20}$";
Console.WriteLine("Please input password for check:");
string inputString = Console.ReadLine();
if (inputString != "" && Regex.IsMatch(inputString, regularExpression))
{
Console.WriteLine("It's correct security password");
}
else
{
Console.WriteLine("It's incorrect password.");
}
Console.Read();
}
}
}

Output

Please input password for check:
abc4D5678
It's correct security password
The most actual regular expression patterns
Pattern Description
^[\w-\.]+@([\w-]+\.)+[\w-]{2,4}$ This expression matches email addresses, and checks that they are of the proper form. It checks to ensure the top level domain is between 2 and 4 characters long.
^(\(?\+?[0-9]*\)?)?[0-9_\- \(\)]*$ A regular expression to match phone numbers, allowing for an international dialing code at the start and hyphenation and spaces that are sometimes entered.
^\d{1,2}\/\d{1,2}\/\d{4}$ This regular expressions matches dates of the form XX/XX/YYYY where XX can be 1 or 2 digits long and YYYY is always 4 digits long.
^([0-1][0-9]|[2][0-3]):([0-5][0-9])$ This regular expressions matches time in the format of HH:MM
\b(([01]?\d?\d|2[0-4]\d|25[0-5])\.){3}([01]?\d?\d|2[0-4]\d|25[0-5])\b This regular expressions matches Decimal IPs.
^\d{5}$|^\d{5}-\d{4}$ This regular expression will match either a 5 digit ZIP code or a ZIP+4 code formatted as 5 digits, a hyphen, and another 4 digits.
^\d{3}-\d{2}-\d{4}$ This regular expression will match a hyphen-separated Social Security Number (SSN) in the format NNN-NN-NNNN.
^((4\d{3})|(5[1-5]\d{2})|(6011))-?\d{4}-?\d{4}-?\d{4}|3[4,7]\d{13}$ Matches major credit cards including: Visa (length 16, prefix 4), Mastercard (length 16, prefix 51-55), Discover (length 16, prefix 6011), American Express (length 15, prefix 34 or 37). All 16 digit formats accept optional hyphens (-) between each group of four digits.
(http|https|ftp)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(:[a-zA-Z0-9]*)?/?([a-zA-Z
0-9\-\._\?\,\'/\\\+&%\$#\=~])*
This regular expression will match some URL.


Related Videos