6,691 unanswered questions tagged

The road to hell is paved with regular expressions
- Ilian Iliev

  1. Dot is a special char
  2. Carriage return in content
  3. Denial of Service by regexp
  4. Non escaping of special chars
  5. Over specifying
  6. Validating an URI with a regexp
  7. Write once debug everywhere DSL

1. Dot is a special char

Simple elegant, you read it, you understand it

# process all ruby files 
process_ruby_file(file_name) if file_name.match(/.rb/)
# and profit

But ouch it matches also sample.erb or paris_rb.html
You know what? A regular expression is a program in itself and it deserves tests!

Want to test it?

  • Take the matching code and depending on your OO taste, make a
    • class : class RubyFileMatcher
    • method : def ruby_file?(file_name) or Regex.ruby_source_file_regex
    • constant RUBY_FILE_REGEXP (perhaps too ugly)
  • Take your favorite testing framework and throw valid/invalid string to verify the behavior.
  processor.ruby_file?('sample/app.rb').must_equal true

  processor.ruby_file?('views/paris_rb.html').must_equal false
  processor.ruby_file?('views/paris_rb').must_equal false
  processor.ruby_file?('views/paris.erb').must_equal false

How can I fix it?
You can fix the regular expression by escaping the . and matching the end of the string with:

/\.rb$/

In fact you don't really need a regex. Let's drop the regex and use

File.extname(file) == '.rb'

to detect all ruby files like rubocop's TargetFinder

More on the subject:

  • small ads for the Paris.rb meetup
  • File.extname
  • GitLab repository/user regexp are centralized in a utility class Regex and can be reused in validations, routes,...

2. Carriage return in content

You have a Git repository model and want to validate the fullname (rails/rails)

class Repository < ActiveRecord::Base
   ...
   validates(:full_name, allow_blank: false,
      format: {
         with: /^[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*$/
      }
    )
end

But when we provide a multiline content

"rails/rails\nalert('pwned!');"

... it's still valid!

You can catch that trap with Brakeman

Repository | Format Validation | Insufficient validation for 'full_name'
 using /^[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*$/. Use \A and \z as anchors 
 near line 41

You can fix the code by using the anchors with:

     /\A[A-Za-z0-9\-_\.]*\/[A-Za-z0-9\-_\.]*\z/

Shameless Plug: in fact no need to launch Brakeman, PullReview will detect it as you push to GitHub, BitBucket or your GitLab instance.

More on the subject:

3. Denial of Service by regexp

Some input can cause pathological lookup and performance degradation

require 'bm'
ANCHORED_VERSION_PATTERN = /\A\s*(#{VERSION_PATTERN})*\s*\z/
Benchmark.measure do
  '2.3422222.222.222222222.22222.ads0as.dasd0.ddd2222.2.qd3e.' =~ ANCHORED_VERSION_PATTERN
end

55 seconds later, you finally got the result

=> #<Benchmark::Tms:0xb7f203fc @label="", @real=55.452385805, 
   @cstime=0.0, @cutime=0.0, @stime=0.010000000000000009, 
   @utime=55.30000000000007, @total=55.310000000000066>

That issue was reported in Bundler regexp validate version string, removing the backtracking helped the performance

ANCHORED_VERSION_PATTERN = /\A\s*(#{VERSION_PATTERN})?\s*\z/

You use that regexp in an online form, you could face a Denial of Service when "well crafted" content is submitted. With a few HTTP posts all your Unicorm/Passenger process would get exhausted.

As a corollary, don't let user submit regexp in your app content.

More on the subject:

4. Non escaping of special chars

You have a Cucumber step like this one

Then /^I should see "(.*)"$/ do |text|
  assert_match /#{text}/, @response.body
end

What if I pass the following text to the step to check my HTML content

Security (2)

It won't work as expected... due to the parentheses interpreted as a regexp group.
The fix is to escape the text

Then /^I should see "(.*)"$/ do |text|
  assert_match /#{Regexp.escape(text)}/, @response.body
end

Remember escaping white spaces (\t,...) and other special chars like "(.*/) ?+"

More on the subject:

5. Over specifying

You want to parse a line like this one and extract each attributes of the soldier:

 line = 'Name: Some Soldier, Rank: Major, Serial: 426879824, Boots: black';
 line.match /(\w*): (\w*\s*\w*), (\w*): (\w*), (\w+): (\d{9}), (\w*):  (\w*)/
 => #<MatchData "Name: Some Soldier, Rank: Major, Serial: 426879824, Boots:  black"
    1:"Name" 2:"Some Soldier" 3:"Rank" 4:"Major" 
    5:"Serial" 6:"426879824" 7:"Boots" 8:"black">

Small variation in format (a new attribute Gender), different spacing may easily break your code, your are doing too much with a regexp, over-specifying the format.

It's often better to identify pattern in a pattern with scan or split method:

 line = 'Name: Some Soldier, Rank: Major, Serial: 426879824, Boots: black';
 data = line.split(',').map do |label_value| 
   match_data = label_value.match(/\s?(.*):\s*(.*)/) 
   [match_data[1],match_data[2]]
 end
 Hash[data]
 => {"Name"=>"Some Soldier", "Rank"=>"Major", 
        "Serial"=>"426879824", "Boots"=>"black"}

More on the subject:

  • Common Regex Gotchas (perl)
  • regex crossword
  • again in this sample, you can replace the last regexp with a split(':') and a strip of extra white spaces.
  • Want to parse known format like CSV, HTML, XML, JSON, ...? Don't roll your own regular expression use a dedicated Gem or stdlib !

6. Validating an URI with a regexp.

Please don't use a home made/googled regexp

validates(
   :website, 
   allow_blank: true, 
   uri: { format: /(^$)|(^(http|https):\/\/[a-z0-9]+([\-\.]{1}[a-z0-9]+)*\.[a-z]{2,5}(([0-9]{1,5})?\/.*)?$)/ix }
)

The URI scheme is tricky to implement with a single regexp.

Relying on the same validation as your network library is probably safer if you want to restrict the url to https:// and https:// use the following code.

require 'uri'

def valid?(url)
  uri = URI.parse(url)
  uri.kind_of?(URI::HTTP)
rescue URI::InvalidURIError
  false
end

More on the subject:

7. Write once debug everywhere DSL

As explained by Brandon Beacher in his article
Regexp in ruby can be cryptic:

/(\d{3})-(\d{3})-(\d{4})/.match("678-248-2440")
if $~
  PhoneNumber.new(
    area_code: $1,
    prefix: $2,
    line_number: $3
  )
end

Named group might help keeping the intent clearer

match_data = /(?<area_code>\d{3})-(?<prefix>\d{3})-(?<line_number>\d{4})/.match("678-248-2440")

if match_data
  PhoneNumber.new(
     area_code: match_data[:area_code],
     prefix: match_data[:prefix],
     line_number: match_data[:line_number]
  )
end

Named group helps on readability and make the regexp "self-documenting". In other user cases this would also keep the code more flexible in case of small variation in the output, you receive a provider name before the phone number.

match_data = /(?<provider_name>.*)\s(?<area_code>\d{3})-(?<prefix>\d{3})-(?<line_number>\d{4})/.match("BelgiumTelecom 678-248-2440")

The phone number parsing stays the same, as the code doesn't depend on the position/index, you just add a provided named group to extract the provider.

Closing words

xkcd regular expression

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
- Jamie Zawinski

First regular expressions are powerful, their simplicity hides often complex processing. Solving your text parsing/validation problems in one line is great but this line needs way much more testing than the other lines in your codebase. You don’t have a PhD in Regular Expression (and most of your colleague either), so don’t leave them blind when they’ll have to adapt your regexp. Leave a few unit tests to show what is an expected valid/invalid content.

Second "when you have a hammer, everything looks like a nail", most of the time, you can drop your homemade expression and favor a well tested implementation (eg File.extname).

Want to learn more about ruby and regexp?

 

Want to know if you forgot a \A…\z in your validation’s regex?
Audit your code with PullReview (free for opensource!) or launch Brakeman.