Contact Us

String Manipulation on large data using LIKE operator or PATINDEX

The idea is to perform data cleansing on the Date and Time field from different formats into a unify format such as: YYYYMMDD. Here is a sample data to be used for the data cleansing that I will perform below.

Call us for Free Consultation for Remote DBA services at our SQL Consulting Firm at: 732-536-4765

CREATE TABLE #T 
    (
        D1 [VARCHAR] (20) NULL 
    )
INSERT INTO #T VALUES('5/15/2006 18:11')
INSERT INTO #T VALUES('5/13/2006 17:24')
INSERT INTO #T VALUES('2007-02-07 11:38:07')
INSERT INTO #T VALUES('9/4/2008 6:55 19a')
INSERT INTO #T VALUES('5/11/2006 0:47:52')
INSERT INTO #T VALUES('2007-01-04 12:43:00')
INSERT INTO #T VALUES('')
INSERT INTO #T VALUES('  6/11/2006 0:47:52')
INSERT INTO #T VALUES('9/22/2002 07:20:07')
INSERT INTO #T VALUES('09/22/2002 07:20:07')
INSERT INTO #T VALUES('4/13/2007')
INSERT INTO #T VALUES('2/1/2007 20:23') 
INSERT INTO #T VALUES('04-21-2011 18:47:42')
INSERT INTO #T VALUES('04-21-2008 17:11:00')
INSERT INTO #T VALUES('2/1/2007 11:44') 
INSERT INTO #T VALUES('5/9/2007 9:22')
INSERT INTO #T VALUES('04/13/2007')
INSERT INTO #T VALUES('')
INSERT INTO #T VALUES('2004-04-21 16:00:00')
INSERT INTO #T VALUES('04-21-2008 17:11:00')
INSERT INTO #T VALUES('2004')
INSERT INTO #T VALUES('2010-04-21 16:00:00')
INSERT INTO #T VALUES('0000-04-21 16:00:00')
INSERT INTO #T VALUES('2011')
INSERT INTO #T VALUES('2004')
INSERT INTO #T VALUES('1700')
INSERT INTO #T VALUES('1800')
INSERT INTO #T VALUES('Nov 2 2010')
INSERT INTO #T VALUES('02-22-03 00:00:00')
INSERT INTO #T VALUES('02-22-03 00:00:00')
INSERT INTO #T VALUES('10/5/1941 8/2/2005')
INSERT INTO #T VALUES('30.09.2002')
INSERT INTO #T VALUES('05:05:06' )

ALTER TABLE #T ADD [D2] VARCHAR(20) 
ALTER TABLE #T ADD [Update_Type] TinyInt 

SELECT * FROM #T

The reason that I am using string manipulation instead of the CONVERT() function is: Since the table could have millions of dates that could not be converted using the CONVERT() function and therefore string manipulation is required. Even though most records could use the CONVERT() function, I will still would rather use the string manipulation strategy since in one transaction I convert all dates whether they can use the CONVERT() or not. See examples for these dates:

SELECT CONVERT(VARCHAR(15),CAST('30.09.2002' AS DATE),112)

The ISDATE() function will return 0 but this date could be converted as 20090930 and I could have 5 millions of these dates.
If you ran the above code you will get the below error:

Msg 241, Level 16, State 1, Line 1 Conversion failed when converting date and/or time from character string.
Here is another example of date that should be using string manipulation:

SELECT CONVERT(VARCHAR(15),CAST('1/27/2007 6:55 17a' AS DATE),112)


The actual date could be 20070127 but ISDATE() returns 0 as seems below.

SELECT ISDATE('1/27/2007 6:55 17a') 


The following are 4 patterns that I will use for this article: Each different pattern will be converted to this format as: YYYYMMDD. So the date that appears as this one 9/19/2009 will be shown as this one: 20090919. The above is a realistic data that I need to deal when import data and then perform data cleaning using MS SQL scripts.
1. 9/19/2009
2. 1/1/2009
3. 09/25/2010
4. 2009-01-09

Extracting this format of date: 9/19/2009

SELECT 
    D1
FROM
    #T  
WHERE  D1 <> '' AND D1 LIKE '%[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' 

Adding the left wild card will bring un desirable results of dates as shown below with different formats. You need to use only the right wild card so the requested pattern will show up as needed. This way you can perform the string manipulation without any surprises. As a rule of thumb I also make sure all data is trimmed before starting working on it. (For the import I use SSIS).

--5/15/2006 18:11
--5/13/2006 17:24 --5/11/2006 0:47:52 -- 6/11/2006 0:47:52 --9/22/2002 07:20:07 --09/22/2002 07:20:07 --4/13/2007 --04/13/2007

1. When running the below code for this format: 9/19/2009 you get the right data.

SELECT 
    D1
FROM
    #T  
WHERE  D1 <> '' AND D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' 

2. Extracting this format of this kind of date: 1/1/2009

SELECT 
    D1
FROM
    #T  
WHERE D1 <> '' AND D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%' 

Running the above code with the right wild card guarantees the right format as shown below.

--9/4/2008 00:38:45
--2/1/2007 20:23
--2/1/2007 11:44
--5/9/2007 9:22

3. Extracting this format of date: 09/25/2010

SELECT 
    D1
FROM
    #T  
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' 

The below data shows up

09/22/2002 07:20:07
04/13/2007 

4. Handle this format: 2009-01-09

SELECT
    D1
FROM 
    #T 
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' 

Running the above query will bring the below records:

--2007-02-07 11:38:07
--2007-01-04 12:43:00
--2004-04-21 16:00:00

Every pattern I will test as shown below to make sure I am getting the right data. In real table I will have a Clustered index on the PK and will add WHERE clause like this one: User_PK BETWEEN 1 AND 10000 and this way the data comes much quicker.

SELECT
    D1,
    REPLACE(LEFT(D1,10),'-','') AS ResultDate
FROM 
    #T 
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' 

Running the above query will bring the below records:

--2007-02-07 11:38:07  20070207
--2007-01-04 12:43:00   20070104
--2004-04-21 16:00:00   20040421

For the table that you perform the data cleansing, the idea is to create another column that all updates will be implemented there. This way I can compare and see if the conversion was done correct. The Tiny column that also was added, I like to use it as a flag for the different type of conversions. In reality there could be at least 20 types of conversions that could be needed for the full conversion for the date field. See below code that will convert all different types of dates into YYYYMMDD in one script.

UPDATE #T
    SET D2 = CASE WHEN D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 
                SUBSTRING(D1,6,4) + '0' + REPLACE(LEFT(D1,5),'/','')            -- 9/19/2009
             WHEN D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%' THEN 
                SUBSTRING(D1,5,4) + '0' + LEFT(D1,1) + '0' + SUBSTRING(D1,3,1)  -- 9/9/2009
             WHEN D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 
                SUBSTRING(D1,7,4) + REPLACE(LEFT(D1,5),'/','')                  -- 09/19/2009
             WHEN D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' THEN
                REPLACE(LEFT(D1,10),'-','')                                     -- 2009-01-09
         ELSE D2 END, 
    Update_Type = CASE WHEN D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 1 -- 9/19/2009
           WHEN D1 LIKE '[0-9][/][0-9][/][0-9][0-9][0-9][0-9]%' THEN 2                  -- 9/9/2009
           WHEN D1 LIKE '[0-9][0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' THEN 3        -- 09/19/2009
           WHEN D1 LIKE '[0-9][0-9][0-9][0-9][-][0-9][0-9][-][0-9][0-9]%' THEN 4        -- 2009-01-09
        ELSE D2 END
FROM 
    #T  
WHERE D1 <> ''    

SELECT * FROM #T 

Running the above SELECT statement shows the updated columns as shown in the below:

--D1                    D2          Update_Type
--5/15/2006 18:11       20060515    1
--5/13/2006 17:24       20060513    1
--2007-02-07 11:38:07   20070207    4
--9/4/2008 00:38:45     20080904    2
--5/11/2006 0:47:52     20060511    1
--2007-01-04 12:43:00   20070104    4
--                      NULL        NULL
--  6/11/2006 0:47:52   NULL        NULL
--9/22/2002 07:20:07    20020922    1
--09/22/2002 07:20:07   20020922    3
--4/13/2007             20070413    1
--2/1/2007 20:23        20070201    2
--04-21-2011 18:47:42   NULL        NULL
--04-21-2008 17:11:00   NULL        NULL
--2/1/2007 11:44        20070201    2
--5/9/2007 9:22         20070509    2
--04/13/2007            20070413    3
--                      NULL        NULL
--2004-04-21 16:00:00   20040421    4
--04-21-2008 17:11:00   NULL        NULL

The entire update technique should be used within a batch and please see this article I wrote about it and how to implement it:

Processing-hundreds-of-millions-records-got-much-easier

With this technique it keeps the log file very small

If you prefer to run each process one at the time then you can use the code as shown below for this format: 9/19/2009

UPDATE #T
    SET D2 = SUBSTRING(D1,6,4) + '0' + REPLACE(LEFT(D1,5),'/',''),
    Update_Type = 1
WHERE D1 <> '' AND D1 LIKE '[0-9][/][0-9][0-9][/][0-9][0-9][0-9][0-9]%' 

I also tried to test the PATINDEX() function instead of the LIKE operator as shown below. In this case I actually compared the PATINEX() function to the LIKE operator. Running in a 300 million records table, the LIKE was always faster in most cases at least in 1 second. The LIKE is getting to be by far very slow when there are only small amount of records such as: 14030 records found for a specific pattern in big tables as mentioned above.

Running the same pattern using the PATINDEX() function took only 10 seconds while using the LIKE took 1 min and 9 seconds. Usually I ran the COUNT() to get the idea how many records to be processed for each pattern. In most cases I will use the LIKE operator for this kind of data cleansing. Make sure to create an index on the field that you process data on it. Remember to create the same type of field for UPDATE purpose. This one took only 10 seconds to count 14030 records out of 300 million records.

SELECT  -- 10 seconds
    COUNT(D1)
FROM
    #T  
WHERE PATINDEX('[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]%',D1)=1 AND D1 <>'' 

This one took 1 min and 9 seconds for the same table.

SELECT 
    COUNT(D1)
FROM
    #T 
WHERE D1 <> '' AND D1 LIKE '[0-9][0-9][-][0-9][0-9][-][0-9][0-9][0-9][0-9]%' 

----------------------------------------------------------
---------------------- Conclusion: -----------------
----------------------------------------------------------

I covered some steps I take for different types of the data conversion and in the process I realized that the PATINDEX() could be much faster versus the LIKE operator for some conditions. There could be different types of data that cannot be converted and must be ignored. That's why a second field is nice to have, since that will be the final field for the production database.

There are many dates that have different patterns and could be legit ones for conversion but will fail the CONVERT() function. Only string manipulation provides the ultimate control and minimize any errors of date conversion.

Go back to our: Main Blog

Written by: Doron Farber - The Farber Consulting Group, Inc

There are 1 comments for "String Manipulation on large data using LIKE operator or PATINDEX"

avatar
By Tony Basir | May 27,2011

Nice article. I remember doing this type of thing on text files downloaded from a mainframe using procedural code 20 years ago. It's a lot faster these days with SQL

Reply

Add your comment