How Can We Help?

< All Topics
Print

Regex – Intro and Use Cases

For everybody who has a need to manipulate data – & if you run or market an online store this means you – an understanding of regular expressions is a fantastic string to add to your bow.

You may not be convinced right now – but hopefully by the end of this tutorial you will be to some degree.

Regular expressions (sometimes abbreviated to Regex) most simply described is a code or short-hand way of ‘pattern matching’ against a string of text.

The building blocks in the language allows a user to very specifically identify those strings that do or just as importantly do not match the expression & depending on the use case – take action.

Its use within a feed builder application like Feed Donkey is many fold – but in essence it gives our users more control and options to match, extract or replace their data depending on their requirements.

Regex sometimes looks on the face of it a bit inaccessible, but this really should not put you off from at least learning the basics – in fact the basics are probably enough to give you much more from your data and perhaps solve some stubborn problems.

The Basics

The beauty of regular expressions is they are so, so flexible – if there is a pattern you can write an expression to match it – and it can be a huge ‘net’ to catch lots of matches or laser-guided to something very specific.

To learn the skill it is more productive to get a handle on the basic principles – then try it yourself, it is much more simple to write your own expressions than to read and decipher somebody else’s.

In this tutorial we are going to cover the following areas.

  • Meta-characters
  • Basic functional elements
  • Quantifiers
  • Positional indicators
  • Character sets & group constructs

Once we have covered all of these areas & you understand how to use them in combination, you have enough to unlock a big proportion of what regex has to offer. There are some more advanced topics apart from these but you probably will not need them 99% of the time.

Before you start you are going to need a tool for testing your expressions – to check that they having the desired effect. I use Atom this is a great text editor that is available for Windows, Mac and Linux too.

There are a number of others that you could use, there is a fantastic web-based tool regex101.com and this equally suitable.

Meta-characters

Meta-characters are characters that perform a function. You will see these functions described in the sections to come. The characters in question are the following:

Meta-characters (Need to be escaped):

.[{()\\^$|?*+

The first basic principle though is that – if you are trying to match any of these characters ‘literally’, they need to be ‘escaped’.

For example, the ‘period’ or ‘full-stop’ character in regular expression speak – this will match with any character.

You can see here that in order to match the ‘literal’ period – you can escape the character with a back-slash.

\\.

Basic Elements

. – Any character except new line

\\d – Digit (0-9)

\\D – Not a digit (0-9)

\\w – Word character (a-z, A-Z, 0-9, _)

\\W – Not a word character

\\s – Whitespace (space, tab, newline)

\\S – Not whitespace (space, tab, newline)

These are the basic elements or ‘tokens’ they match with characters that meet the described definition. Use these in combination with quantifiers and positional elements to build your pattern match.

Quantifiers

* – 0 or more

+ – 1 or more

? – 0 or one

{3} – Exact number

{3,4} – Range of numbers (minimum, maximum)

Example that to match a three digit number you could use

\\d\\d\\d

but you could shorten this to

\\d{3}

Another example matching with a list of urls that are either http or https

The ‘s’ character could be there or not with a valid url format, but if it is present there will be one of them. Use ? to indicate this

https?://

Positional Indicators

\\b – Word boundary

\\B – Not a word boundary

^ – Beginning of a string

$ – End of a string

Positional indicators are ‘non-matching tokens’ or sometimes described as ‘anchors’. They are not matching with any characters within the string, but they can be used in combination with character matching parts of the expression.

For example, you are looking to match only with the string ‘gin’ thats means the drink out of this set of words:

orginal

gin

ginger

origin

Here you can use the ‘word boundary’ positional indicators to only match with the string that meets the pattern that you require. A word boundary occurs where block of consecutive characters, a ‘word’, is broken by white space, a non-word character or the start/ end of a line.

For example

\\b

Would find word boundaries in this text in these positions:

Out of all these words – gin ‘the drink’ is the only one that is both pre-fixed and suffixed by a word boundary.

\\bgin\\b

So this expression would find the match that you required given this scenario.

Character Sets & Groups

[] – Matches characters in brackets

[^ ] – Matches characters NOT in brackets

| – Either or

( ) – Capturing group

(?: ) – Matching group

CHARACTER SETS

Character sets are a way of grouping together character match tokens – in simple terms ‘match if the character is one of these characters in-between the square brackets’.

For example – lets take a simple list of random words

abba

jabba

yabba-dabba

[ba]+

So the character set is partnered with the quantifier ‘one or more’.

would only fully match with ‘abba’, as it is constructed of a’s and b’s.

To match with all the words in this list we could use.

[baydj-]+

So the characters within the brackets can be in any order, other acceptable contents could be

  • a range of either digits or alphabetical characters
[a-e0-5]+

this would match with:

lowercase alphabetical characters a, b, c, d & e

&

numbers 0, 1, 2, 3, 4, & 5

  • basic tokens
[\\w\\D]+

match with all alphabetic characters

&

anything that is not a digit.

Note, if you trying to match with the literal version of meta-characters – when they are within the square brackets of a character set they do not need to be escaped.

[a-z+.]+

So in this expression this would match with the literal characters + ‘plus’ and . ‘period’, they do not need to be noted as \\+ \\. as they would normally.

GROUPS

The next topic we are going to cover is groups. This area is important for two basic reasons.

  1. Using groups or create ‘either or’ statements. Groups can be used in combination with the | ‘pipe’ character for creating these.
  2. Creating ‘capturing groups’. We will come on to more details about this understanding capturing groups is important for using regular expressions within Feed Donkey.

So for example, lets take this random list of names:

Dr T Jones

Mr C Benjamin

Rev F King

Ms G Finley

Mrs R Nutall

And we want to identify and match with part of that list depending upon their title, so we write an expression with a grouped, ‘either or’ statement.

(Dr|Rev)\\s[A-Z]\\s[a-zA-Z]+

So this expression can be broken down into its constituent elements

  • String must start with Dr or Rev (Dr|Rev)
  • Single white-space \\s
  • Single uppercase letter [A-Z]
  • Single white-space \\s
  • Lowercase or uppercase letters, one or more [a-zA-Z]+

When we wrap part of an expression in brackets as we have here, we are identifying this as a ‘capturing group’ – meaning we intend to use this part.

Here if we look at this in regex101.com testing tool

So this is important because Feed Donkey functions REGEX EXTRACT and REGEX REPLACE recognise the 1st capture group.

In the example there for we would either extract or replace the part of the identified list – just the Dr or Rev part.

This may not be exactly what you want to achieve however – so lets re-write this same expression as a non-capturing (or matching group).

(?:Dr|Rev)\\s[A-Z]\\s[a-zA-Z]+

Now we don’t have a group captured in the statement, so where there is a match extract and replace functions would act on the full match i.e.

‘Dr T Jones’ & ‘Rev F King’

But now we could introduce group capture into another part of the expression if required. So say we wanted to match with the either or group on the title, but capture the surname

(?:Dr|Rev)\\s[A-Z]\\s([a-zA-Z]+)

This is the topic areas complete, there are many other advanced areas within the regular expression ‘language’, some eye-wateringly complicated, but will you need anymore than the above in practice? I would argue that its fairly unlikely.

This is the one of the great things about this ‘skill’ you absolutely do not need to reach full on grand-master level before it starts to become useful – you simply need to know a good spread of the basics.

Real World Examples

So lets now look at a couple of real life examples of how regular expressions can help you with building feed data columns.

  1. Validating Barcodes

The data problem: I have lots of number string barcodes in my product database, but not all of them meet the Google gtin (Global Trade Item Number) standards.

In your Google Merchant Centre feed you only want to submit barcodes that meet the standards i.e. barcodes that are either 12, 13 or 14 digits in length.

[0-9]{12,14}

The above expression will only match with strings that consist of digits and are a minimum of 12 digits and maximum of 14.

If you were to build a column using your stores data source – but then you added transformation rule REGEX EXTRACT with this expression.

This would ensure that only valid barcodes were used to populate this column.

2. Creating a Custom Label From Tags Data

Data problem: I have useful attribute data in tags that I could use to enhance my advertising – but it is difficult to separate out.

I’m an online wine merchant and I wish to use the grape variety as a custom label in my shopping feed. The barrier to me easily accessing this is

  • It is mixed in with other attributes like colour and country of origin
  • a further complication is that in the list the data is labelled either ‘Grape’ or ‘Grapes’

This tags data might look something like this

So we could write an expression that find that label in the list + the specific grape variety text

Grapes?:\\s([a-zA-Z\\s]+),?

Within the expression there is a group capture, that allows us to just extract the variable grape variety text

In the Feed Donkey builder this would look something like this

Hopefully this has been a worthwhile read for you, and of course the team here at Feed Donkey are happy to help you get the best out of your data. If you need us simply reach out to us.

This tutorial was inspired by this great (but longer) video introduction to Regex created by Corey Schafer check it out too.