- Approximate equality
- Ranges
- Character classes
- Quantifiers
- Ruby .match method
- Ruby .scan method
- Ruby ^ and $ validators
- Ruby flags
- Regex greedy vs reluctant <-- This is a common issue, look at the examples
- Regex gsub
Regular expressions are used on strings to search and filter. Syntax for regular expressions is outlined below, where =~ represents approximate equality and / / represents the syntax where regular expressions are written.
# Syntax "string" =~ / regex /
"bob" =~ / bob / # => returns 0 index position
"bob" =~ / cat / # => returns nil
Use square brackets to search for characters within a range [a-z] or [0-9]. You can also search for capitalized or lower case [Bb].
# Syntax "string" =~ / [Bb] /
"bob" =~ /[Bb]ob/
"Bob" =~ /[Bb]ob/
"bob" =~ /[a-zA-Z]ob/
"012345" =~ /[0-9]/
"MY NAME IS bob" =~ /[a-z]/ # => returns 11 index position where a first lowercase alphabet is found
Use the ^ to find anything but. Equivalent to !== in JavaScript.
"012345" =~ /[^A-Z]/
Used to check for capitalized and uncapitalized versions, digits and spaces.
Lowercase class looks for anything that IS a character class and Uppercase class looks for anything that IS NOT that class. Finds the first instance of the class rule and returns the index position within that string.
# \w any word character e.g. all instances of [a-zA-Z] and [0-9]
"!! cat01" =~ /\w/ # => returns 3 first instance of word or digit
# \W any non-word character
"!! cat01" =~ /\W/ # => returns 0 first instance of non-word or digit
# \d any digit character
"!! cat01" =~ /\d/ # => returns 6 first instance of digit
# \D any non-digit character
"!! cat01" =~ /\D/ # => returns 0 first instance of non-digit
# \s any space character and new lines
"!! cat01" =~ /\s/ # => returns 2 first instance of space character
# \S any not-space character
"!! cat01" =~ /\S/ # => returns 0 first instance of non-space character
# . wild card character. Represents any value except a new line
"!! cat01" =~ /./ # => returns 0 first anything
We can use ()
and |
to check for multiple things
"Bob" =~ /(Joe|Bob)/ # returns 0
"Joe" =~ /(Joe|Bob)/ # returns 0
Symbols which matches the preceding regex character based on its occurance. Note: Remember it references the preceding character. So in the example of /hello*/ it would check if the character o
came after hell
any amount of times.
# char* equals 0 or more - any amount of times
"howdy hell" =~ /hello*/ # return 6
# char+ equals 1 or more - check if char occurs once or more
"howdy hell" =~ /hello+/ # return nil
# char? equals 0 or 1 - check if char occurs once or none
"howdy hell" =~ /hello?/ # return nil
Find a certain character which occurs exactly n times. Syntax use curly brackets passing in a number argument for preceding character
# char{2} must occur exactly twice
"hello" =~ /hel{2}/ # => find hell (ll twice consecutively)
0 # => match at index position 0
# char{2,} must occur at least twice
"hellllllllo" =~ /hel{2,}o/ # => at least two l's
0 # => match at index position 0
# char{,3} must occur three times or less
"hello" =~ /hel{,1}o/ # => l's three times or less
nil # => no l characters matching once or less
Ruby .match method returns a hash and stores matched regular expressions in number properties. We wrap the matches we want to return in parenthesis which will be accessible in the ruby MatchData hash.
matches = "202-55-1701".match(/(\d+)-(\d+)-(\d+)/) # => #<MatchData "202-55-1701" 1:"202" 2:"55" 3:"1701">
matches[1] # => "202"
matches[2] # => "55"
matches[3] # => "1701"
We can use the symbol syntax to name the matched groups (?<name>(regex))
matches = "202-55-1701".match(/(?<first>\d+)-(?<second>\d+)-(?<banana>\d+)/) # => #<MatchData "202-55-1701" first:"202" second:"55" banana:"1701">
matches['first'] # => "202"
matches[:second] # => "55"
matches['banana'] # => "1701"
Ruby .scan method returns an array of matched regex strings.
scanned = "202-55-1701".scan(/\d+/) # => ["202", "55", "1701"]
scanned[0] # => '202'
scanned[1] # => '55'
scanned[2] # => '1701'
Use ^ symbol before the regex to say nothing else comes before it and $ at the end of the regex to say nothing comes after.
"ruby " =~ /^ruby/ # => 0
" ruby " =~ /^ruby/ # => nil
" ruby" =~ /ruby$/ # => 1
" ruby " =~ /ruby$/ # => nil
"ruby" =~ /^ruby$/ # => 1
Use i at end of regex to check based on case insensitive.
"RUBY" =~ /ruby/i # => 0
"RUBY" =~ /ruby/ # => nil
Use x at end of regex to allow writing of code on new lines for readability of code. Note: Multi-line regex can only be done in Ruby, JavaScript cannot do this.
some_num = "Number: 202-555-1701."
matchesMulti = some_num.match(/
(?<first>\d+)- # This should match 202
(?<second>\d+)- # This should match 55
(?<banana>\d+)- # This should match 1701
/x)
matchesSingle = "Number: 202-55-1701.".match(/(?<first>\d+)-(?<second>\d+)-(?<banana>\d+)/)
NOTE: THIS IS IMPORTANT, YOU WILL COME ACROSS THE BUG OFTEN
When we use the .* regex matcher regex will try and match all results. To limit the number of matches we suffix it with a ? to tell regex to match as little as possible. (Note: this is quite complicated, refer to explanation below for more information).
some_num = "202-555-1701"
# When just using the .* suffix, causes matching error
some_num.match(/(\d+).*(\d+).*(\d+)/) # => #<MatchData "202-555-1701" 1:"202" 2:"0" 3:"1">
# When just using the .*? suffix, solves the issue
some_num.match(/(\d+).*?(\d+).*?(\d+)/) # => #<MatchData "202-555-1701" 1:"202" 2:"555" 3:"1701">
# Another rexample
html = "<html><div></div></html>"
# Greedy approach matches everything until the last character of the string is >
html.match(/<(.*)>/) # => #<MatchData "<html><div></div></html>" 1:"html><div></div></html">
# Reluctant approach matches the first instance of the closing arrow bracket
html.match(/<(.*?)>/) # => #<MatchData "<html>" 1:"html">
Summary for ? symbol: This is confusing because the ? symbol has three separate meanings and uses in regex
/(?<symbol>/d+)/
- when used at the beginning of parenthesis it creates a match name variable/(optional)?/
- when used at the end of a character it means that character is optional/.*?/
- when used at the end of a quantifier it refers to reluctant matching (see above)
Enter your regex: .*foo // greedy quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
Enter your regex: .*?foo // reluctant quantifier
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.
Enter your regex: .*+foo // possessive quantifier
Enter input string to search: xfooxxxxxxfoo
No match found.
A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success!
A reluctant or "non-greedy" quantifier first matches as little as possible. So the .* matches nothing at first, leaving the entire string unmatched. Then the matcher tries to match the f following, but the unmatched portion of the string starts with "x" so that doesn't work. So the matcher backtracks, making the non-greedy quantifier match one more thing (now it matches the "x", leaving "fooxxxxxxfoo" unmatched). Then it tries to match the f, which succeeds, and the o and the next o in the regex match too. Success!
In your example, it then starts the process over with the remaining unmatched portion of the string, following the same process.
A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.
Find and replace method. Syntax: .gsub(/regex/,'replace')
'RUBY RED'.gsub(/R/,'B') # => 'BUBY RED'
Perform ruby method on any string that matches regex
'RUBY RED'.gsub(/R/) {|letter| letter.downcase} # => 'rUBY rED'
# Syntatic sugar
'RUBY RED'.gsub(/R/, &:downcase) # => 'rUBY rED'