Hide

--- TEST SYSTEM --- TEST SYSTEM --- TEST SYSTEM ---

Hide

How to use Regular Expressions in Bulk Operations

hide
Hide

          Help and Guidance 2021: New Page: Version 1.1

Hide

Introduction


If you want to do a regex (Regular Expression) search as part of a Bulk Operation you probably know something of what that entails. It enables the use of much more specific and complex search and replace strings. It allows for wildcards (ie strings that contain any character in part of the search and replace) and provides access to characters that are not usually allowed in simple search methods. 

It therefore has the potential to be very dangerous! We offer a couple of external links to give you more background or to refresh your knowledge.:

If you have an editing need that you feel could be solved by regex but are not familiar with the technique you are advised to raise the issue on the maintainers' group

 

 


 

Everything below must be considered experimental and is currently under review (only retained to assist with that review )

 

A worked example in GENUKI

For most maintainers the most likely need to use regex is to correct problem external links.  What follows is a real example that came up in May 2021.

The problem

The error report for County Armagh showed over 200 'Not Found' (404) errors for the same domain spread over several parish pages.

One of these was the existing link:

http://www.craigavonhistoricalsociety.org.uk/rev/haddendoctoratsea.html

The correct url was now shown as:

http://www.craigavonhistoricalsociety.org.uk/rev/haddendoctoratsea.php

In other words,  .html had become .php. 

The part highlighted in red is the variable (targeting different pages within that domain). Using a simple VBO would not identity each separate instance and therefore could not correct them all in one process. With regex there is a way of doing so.

The Regex Approach

This is how regex was used in this example

On the Bulk Ops page;

  • The first step in the process, after setting up county, content type etc, is to Choose Nodes and inserting "craigavonhistoricalsociety" in the Topic Content / Contains box produces a list of the 10 parish pages which contain these links.
  • Select them all and click Search & Replace
  • On the next page click on Options and then tick the box "

    • The search and replace fields contain regular expressions. Enclose the search pattern in slashes." which ONLY applies to the Search box

  • In the Search box insert 
    • /craigavonhistoricalsociety\.org\.uk\/rev\/(.*?)\.html\"/
  • In the Replace box insert
    • craigavonhistoricalsociety.org.uk/rev/$1.php"
  • Note that the highlighted part of these urls is all that needs to be changed when this example is used as a guide to your own regex processes for link correction from html to php suffixes.

The Warning Again!

Do not run a full bulk operations process using regex without first testing it on at least a single instance, maybe a couple if you're the nervous sort ! If in doubt ask.

 


Other examples

Here is another example of a recent (10/2024) Regex solution which would have avoided the manual link error correction on 534 place nodes had it been used.

The problem was the need to delete dud links and accompanying text where they all pointed at different areas of the target site i.e distinguished by parish name
 

Starting from the code;

           <a href="http://www1.somerset.gov.uk/archives/asp/parish.asp?Parish=Clutton">Details of Somerset Heritage Centre holdings</a> relating to this parish.

First wrap it all in /s at both ends then put a back slash before all the forward slashes in the url, as in

/<a href="http:\/\/www1.somerset.gov.uk\/archives\/asp\/parish.asp?Parish=Clutton">Details of Somerset Heritage Centre holdings<\/a>relating to this parish./

Now insert a back slash between the second asp and the ?

And replace the parish name as in Parish=[^”]+”>

Which gives you this

Regex;

 /<a href="http:\/\/www1\.somerset\.gov\.uk\/archives\/asp\/parish\.asp\?Parish=[^"]+">Details of Somerset Heritage Centre holdings<\/a>/

Then use Bulk Operations to replace it with nothing, don’t forget to click the Regex button under Options.

Essential to try it on a couple to start with to ensure it works.

In the above example case we were left with other common text and superfluous code which were removed using Bulk Ops processes


Just to illustrate the complexity of regex these next two examples didn't work and as at 17/10/2024 the reason isn't clear

 

Second example

Start with

 <a href="http://www1.somerset.gov.uk/archives/ASP/pics.asp?Place=Claverton">Postcards of Claverton</a>

Regex; 

 /<a href=“http:\/\/www1\.somerset\.gov\.uk\/archives\/ASP\/pics\.asp\?Place=[^”]+”>Postcards of [^<]+<\/a>/

 

Third example

 The <a href="http://www1.somerset.gov.uk/archives/maps/os62htm/1407.htm">Ordnance Survey 1:10560 County Series 2nd edition (c.1900) map of the area</a> provided by Somerset Heritage Centre.

Regex;

/<a href=“http:\/\/www1\.somerset\.gov\.uk\/archives\/maps\/[^”]+”>Ordnance Survey 1:10560 County Series 2nd edition \(c.1900\) map of the area<\/a>/


 As an experiment a different approach was taken regarding the first example in this section and this one used in BO removed all coding in that topic

The regex used was

#^.*href.*www1.somerset.gov.uk/archives/asp/parish.asp.*details.*somerset.*heritage.*centre.*relating.*this.*parish.*$#i


The original html as viewed under Source was:

<ul>
              <li>
                <a href="http://www1.somerset.gov.uk/archives/ASP/pics.asp?Place=Alcombe">Postcards of Alcombe</a>
                <p>
                  <li>
                    <a href="http://www1.somerset.gov.uk/archives/asp/parish.asp?Parish=Alcombe">Details of Somerset Heritage Centre holdings</a>relating to this parish.
                    <p>
            </ul>