String pattern

A string pattern is a string composed of combinations of characters that can be used to describe all strings of a certain form. String patterns are used with some functions provided by Lua, for example to tell them to look in a string for substrings of a certain form. The following is an example of what can be done with string patterns:

local s = "I am a string!"
for i in string.gmatch(s, "%S+") do --Where "%S+" is the string pattern.
    print(i)
end
I
am
a
string!

But what makes the code above so cool? Perhaps you've wanted to make a list of people without using a table, or maybe you need to parse a string. String patterns can help do this!

As said before, string patterns are strings that look a little different and are used for a different purpose than what strings are usually used for. Here we will look at the basics of just what make a string pattern up. Here we will look at just what the different parts of a string pattern mean.

In these examples, we will use the string.match and string.gmatch functions.

Contents

[edit] Simple matching

Guess what? You already know some string patterns! Any string is a pattern!

pattern = "Roblox"
print( ("Welcome to Roblox"):match(pattern) )
print( ("Welcome to the Wiki"):match(pattern) )
Roblox
nil

[edit] Character Classes

There's only so far we can go by using this kind of pattern matching. Sometimes, we want to match any of a set of characters. Here's an example:

pattern = "%d words"
print( ("This sentence has 5 words"):match(pattern) )
print( ("This one has more than 2 words"):match(pattern) )
5 words
2 words

The following table shows the meaning of each character class:

Pattern Represents Example matches
. Any character #32kasGJ1%fTlk?@94
%a An uppercase or lowercase letter aBcDeFgHiJkLmNoPqRsTuVwXyZ
%l A lowercase letter abcdefghijklmnopqrstuvwxyz
%u An uppercase letter ABCDEFGHIJKLMNOPQRSTUVWXYZ
%p Any punctuation character #^;,.
%w An alphanumeric character - either a letter or a digit aBcDeFgHiJkLmNoPqRsTuVwXyZ0123456789
%d Any Digit 0123456789
%s A whitespace character  , \n, and \r
%c A control character
%x A hexadecimal (Base 16) digit 0123456789ABCDEF
%z The NULL character, '\0'
%f The frontier pattern (not officially documented)
%bxy The balanced capture. It matches x, y, and everything in between. It allows for the nesting of balanced captures as well. (Note: x and y must be different)  %b() captures everything between parentheses (including them).

Any non-magic character (not one of ^$()%.[]*+-?), represents itself in a pattern. To search for a literal magic character, precede it by a % - for example, to look for a percent symbol, use %%.

Here's an example that shows how the . character can be used. . matches any character, and %. matches an actual period.

pattern = "A t.n of spam%."
print( ("A tin of spam."):match(pattern) )
print( ("A tun of spam."):match(pattern) )
print( ("A ton of spam!"):match(pattern) ) -- not a period!
A tin of spam.
A tun of spam.
nil

One of the things you might notice about the character classes above is that they are all lowercase. Making them uppercase reverses their effect. For instance, %s represents whitespace, but %S represents any non-whitespace character. %l represents a lowercase letter while %L represents its compliment - any characters but a lowercase letter. Let's look at this example, which matches a digit, followed by five non-digits:

pattern = "%d%D%D%D%D%D"
print( ("This sentence has 5 words"):match(pattern) )
print( ("21 times 3 equals 63"):match(pattern) )
5 word
1 time

[edit] Quantifiers

Character classes allow you to match any character. Quantifiers allow you to match any number of characters.

Quantifier Meaning
 ? Match 0 or 1 of the preceding character specifier
* Match 0 or more of the preceding character specifier
+ Match 1 or more of the preceding character specifier
- Match as few of the preceding character specifier as possible

[edit] The + quantifier

Let's say you have a string that contains a number, such as "It costs 100 tix", and you want to extract the number. If you know how many digits the number has, you could use the pattern %d%d%d which would match three digits in a row. But what happens if you don't know how many digits there are? For this, you can use quantifiers. In this example, the + quantifier is suitable.

pattern = "%d+"
print( ("It costs 100 tix"):match(pattern) ) 
print( ("It costs OVAR 9000 tix"):match(pattern) )
100
9000

Now how does this work exactly? As we know, a character class followed by a '+' matches one or more repetitions. For this example, it means that it would match the first digits it finds until it reaches the end of the string or a non-digit.

[edit] The * quantifier

The difference between + and * is that + matches 1 or more characters, while * matches 0 or more. This means that if the character class that is followed by this quantifier isn't represented in the string, it doesn't matter, because the match isn't required.

pattern = "%d,%s*%d" -- digit, comma, whitespace, digit
print( ("1,    2"):match(pattern) )
print( ("3,4"):match(pattern) )
1,    2
3,4

As you can see, it matches a digit, followed by a comma, any amount of whitespace (if there is any), and then another digit. If you had used +, the second example would have returned nil, because + requires at least one match. The * pattern is very useful when you have something in the string that is optional.

[edit] The - quantifier

The - quantifier is a little different from the previous two. Like *, it matches 0 or more characters. However, + and * try to match the longest possible sequence, whereas - attempts to match the shortest possible sequence. Here's an example that shows the difference:

example = "C:/Users/Telamon/Documents"
print( example:match("/.-/") )
print( example:match("/.*/") )
/Users/
/Users/Telamon/

From the example, you see that the - found the shortest possible sequence and stopped only at the second /, while the * matched the longest sequence and stopped at the last / in the string.

This concept is usually referred to as "greediness". Quantifiers that match the longest possible sequence are considered greedy, while ones that match the shortest possible sequence are considered non-greedy.

[edit] The ? quantifier

The ? quantifier is used to make certain characters in the string optional.

pattern = "wik?is?"
print( ("This is the wiki"):match(pattern) )
print( ("There are multiple wikis"):match(pattern) )
print( ("You do not spell it wikki"):match(pattern) )
print( ("This is not a wii"):match(pattern) )
wiki
wikis
nil
wii

From the example you can see that the ? made the s and k optional, allowing the pattern to match "wii" and "wikis". However, only one k was allowed, so wikki was not matched

[edit] Anchors

Anchors are characters in the pattern which ensure that the pattern matches at either the beginning and/or the end of the string.

Consider the following example, which matches any sequence of digits in a string. This example will be modified later on:

pattern = "%d+"
example = "123,456,789"
for number in example:gmatch(pattern) do
    print(number)
end
123
456
789

[edit] Anchor to beginning

If you put a ^ character at the beginning of the pattern, then the pattern will match only at the beginning of the string. In other words, the match must contain the beginning of the string.

Let's try anchoring the pattern to the beginning of the string:

pattern = "^%d+"
example = "123,456,789"
for digit in example:gmatch(pattern) do
    print(digit)
end
123

As you can see, it matched only the first sequence, because it was at the beginning of the string.

Also, consider the following example:

pattern = "^%d+"
example = "nondigit,123,456,789"
print(example:match(pattern))
nil

Even though there are sequences of digits, none of them were matched, because none of them were at the beginning of the string.

[edit] Anchor to end

Additionally, if you put a $ character at the end of the pattern, then the pattern will match only at the end of the string.

pattern = "%d+$"
example = "123,456,789"
for digit in example:gmatch(pattern) do
    print(digit)
end
789

As you can see, only the last sequence was matched, because it was located at the end of the string.

[edit] Using both anchors

Both anchors can be used in a pattern at the same time. Let's try it:

pattern = "^%d+$"
example = "123,456,789"
for digit in example:gmatch(pattern) do
    print(digit)
end
nil

That didn't work very well. That's because there are non-digit characters between the beginning and the end of the string. Let's eliminate those characters:

pattern = "^%d+$"
example = "123456789"
for digit in example:gmatch(pattern) do
    print(digit)
end
123456789

That's more like it. As you can see, using both anchors can be useful for validating the entire string, instead of just parts of it.

[edit] Sets

Sets are used when a single character class cannot do the whole job. For instance, you might want to match both lowercase letters (%l) as well as punctuation characters (%p) using a single class. So how would we do this? Let's take a look at this example:

pattern = "[%l%p]+"
example = "123 Hello! I am another string."
print( example:match(pattern) )
ello!

As you can see from the example, sets are defined by the [ and ] around them. You also see that the classes for lowercase letters and punctuation are contained within. This means that the set will act as a class that represents both lowercase and punctuation, unlike if you used %l%p, which would match a lowercase letter and a punctuation character following it.

You aren't restricted to using only character classes, though! You can also use normal characters to add to the set.

-- a sequence of digits, letters "t", "n", and "a", and underscores.
pattern = "[%dtna_]+"
example = "tan_30, word, natt, 99ants, banner"
 
for match in example:gmatch(pattern) do
    print(match)
end
tan_30
natt
99ant
ann

[edit] Ranges

You can also specify a range of characters. A range is written as x-y, where x is the character at the start of the range, and y is the character at the end of the range (ex: "0-9" for all digits). Let's see how this works in the following example:

pattern = "[a-kq-z3-6]+"    -- a-k, q-z, 3-6
example = "abcdefghijklmnopqrstuvwxyz0123456789"
 
for match in example:gmatch(pattern) do
    print(match)
end
abcdefghijk
qrstuvwxyz
3456

As you can see, sequences of the characters "a" through "k", "q" through "z", and "3" through "6" were successfully matched.

In a more technical explanation, a range of characters is specified as every character between the byte values of the starting and ending characters. Common ranges like "a-z" or "0-9" work because their byte representations are sequenced (see ASCII Table). Because of this, you could use the pattern "[(-/]" to match the characters "()*+,-./", or even "[%z\1-\255]" to match any character.

[edit] Complements

There's still one last thing you can do. Like with character classes, sets have complements of themselves. These are indicated by using a ^ character.

-- a sequence of characters which are neither whitespace nor commas
pattern = "[^%s,]+"
example = "Hello, I like strings!"
 
for match in example:gmatch(pattern) do
    print(match)
end
Hello
I
like
strings!

This pattern is the complement of [%s,]. From the example, you can see that the complement of a set is defined by using the ^ character at the beginning of the set. All this does is reverse the meaning of the set.

[edit] Captures

Captures are used to get pieces of a string that match a capture. Captures are defined by parentheses around them. For instance, (%a%s) is a capture for a letter and a space character. When a capture is matched, it is then stored for future use. Let's look at this example:

pattern = "(%a+)%s=%s(%d+)"
example = "TwentyOne = 21"
 
key, val = example:match(Pattern)
print( key )
print( val )
TwentyOne
21


Now what happens if you want to get a list by using captures? You can use string.gmatch to do this.

pattern = "(%a+)%s?=%s?(%d+)" --Captures a string of letters seperated by an optional space, an equal, and an optional space and then captures a string of numbers
example = "TwentyOne = 21 Two=2 One =7 Four= 4"
for key, val in example:gmatch(pattern) do --You see how gmatch returns the captures instead of the matches to the pattern here.
    print( key, val )
end
TwentyOne 21
Two 2
One 7
Four 4

Note that 'key' and 'val' are actually referring to capture 1 and capture 2. The name does not matter, but it is still a good practice to choose a relevant name. As you can see, string.gmatch iterated through all the matches in the string and returned only the captures which is basically what captures are for, to capture a certain part of the string to use.


A final thing you can do with captures is that you can leave the captures empty. In these cases they will capture the current position on the string. This means that unlike the other, non-empty captures, a number is returned instead of a string. Look at this example:

pattern = "()%a+()" --Captures the location of the first character, skips over a string of letters, and then captures the next character's position.
example = "Hello!"
 
cap1, cap2 = example:match(pattern)
print( cap1, cap2 )
1 6

From the example, once a match was found, string.match returned the first and second captures' positions in the string instead of returning the characters 'H' and '!'.


[edit] Practice

Exercise 1

Instructions

Make a function that validates a password. The password must have 2+ letters, 3+ numbers, a comma, and the word ROBLOX (must be all caps).

Solution
Select
function validatePassword(password) 
	local valid = password:match("%d.*%d.*%d") ~= nil
	valid = valid and password:match("%a.*%a") ~= nil
	valid = valid and password:match(",") ~= nil
	valid = valid and password:match("ROBLOX") ~= nil
	return valid
end
 
print("First Password: ")
print(validatePassword("123eoROBLOX43g,r")) --should output true
print("Second Password: ")
print(validatePassword("23h,robloxw")) --should output false


Exercise 2

Instructions

Make a function that parses these instructions as a string, finds all instances of the letter E, and stores the indexes of all occurrences in an array.

Solution
Select
local instructions = "Make a function that parses these instructions as a string, finds all instances of the letter E, and stores the indexes of all occurrences in an array. "
local indexes = {}
local i = 0
 
while true do
	i = instructions:find("%s.-e.-%s", i + 1)    -- find 'next' newline
	if i == nil then break end
	table.insert(indexes, i)
	print(i)
end


[edit] See also