Tag Archives: Sanitization

Data Sanitization with Regular Expressions, Sort of

A DBA on Twitter, using the #sqlhelp hashtag, posted about a particular problem they were having in which they wanted to sanitize some data which was presumably entered in some sort of free-form text field. Their basic idea was that they would get data like any of the following:

my id is 12345
id=12345
user name (12345)

The DBA would then want to be able to sanitize this data and come up with a string containing the simple 5-digit number (in the case presented, the id should always be 5 digits).

This is a common enough problem, one that most people working with human-input data would be familiar with. There are several ways to approach it, but some work better than others, and when you are talking about a high-load system, you want the quickest and most efficient method. In my opinion, this is a perfect opportunity to use Regular Expressions. However, T-SQL does not have native support for full-fledged Regular Expressions, which is a shame, but it does have support for a small subset of Regular Expressions, and that is all we need for this particular case.

So, first, what are we trying to accomplish? We want to take any string and look for digits, between 0 and 9. T-SQL has a function named PATINDEX that takes a subset of of regular expression patterns and an expression that evaluates to some string data type and returns an integer of the index in the string where the pattern first occurs, or zero if the pattern is not matched anywhere in the string. Combining this function with SUBSTRING should give us all of the tools we need to solve this problem.

The code to return the five-digit id as a string is as follows:

DECLARE @yourStringToTest VARCHAR(20)            -- as an example for testing
DECLARE @pattern VARCHAR(20)
SET     @yourStringToTest = 'my id is 12345'
SET     @pattern = '%[0-9]%'
SELECT
    SUBSTRING(
                @yourStringToTest,               -- Source String (for the substring)
                PATINDEX(
                            @pattern,            -- The pattern we are matching
                            @yourStringToTest    -- Source String (for the pattern matching)
                            ),
                5                                -- The number of characters we are wanting to return
                )

As you can see, this is a very simple solution, but what about if the data has some really strange artifacts? What if they have data like:

my 1st ID is 12345
requesting 3 actions, my id is 12345

This poses an additional problem, how do we get to the actual id, which is the five-digit number, but skip over the integers that occur earlier in the string. If we try the code we have above, we are going to get “1st I” and “3 act”, respectively. Clearly we need to add some additional logic to our pattern. We need to ensure that the pattern looks for 5 consecutive digits, as opposed to just the first digit we encounter. Additionally, what if we want to get any id that is at least 3 digits, but not greater than 8, for example? These are all issues that would be solved very easily with Regular Expressions, unfortunately T-SQL does not support the full set of Regular Expression syntax.

There are, however, some interesting alternatives that have been created by members of the SQL community. They involve using COM objects, so I would be wary using them in any production system, but as a one-off script you need to run, it is definitely worth a shot. You can read about this method and see the code necessary to accomplish it by checking out the post made by yoinky on this SQLTeam.com forum thread.

Additionally, you could solve the problem by creating a simple SSIS package to iterate through your source data and run a Script Component to utilize the Regular Expression pattern. I’ll work on an example of that for a future post.

I would like to see Microsoft include full support of Regular Expressions in a future release of SQL Server, but until then, at least we have PATINDEX and LIKE.