String Manipulation on large data using LIKE operator or PATINDEX
The idea is to perform data cleansing on the Date and Time field from different formats into a unify format such as: YYYYMMDD. Here is a sample data to be used for the data cleansing that I will perform below.
Call us for Free Consultation for Remote DBA services at our SQL Consulting Firm at: 732-536-4765
The reason that I am using string manipulation instead of the CONVERT() function is: Since the table could have millions of dates that could not be converted using the CONVERT() function and therefore string manipulation is required. Even though most records could use the CONVERT() function, I will still would rather use the string manipulation strategy since in one transaction I convert all dates whether they can use the CONVERT() or not. See examples for these dates:
The ISDATE() function will return 0 but this date could be converted as 20090930 and I could have 5 millions of these dates.
If you ran the above code you will get the below error:
Msg 241, Level 16, State 1, Line 1 Conversion failed when converting date and/or time from character string.
Here is another example of date that should be using string manipulation:
The actual date could be 20070127 but ISDATE() returns 0 as seems below.
The following are 4 patterns that I will use for this article: Each different pattern will be converted to this format as: YYYYMMDD. So the date that appears as this one 9/19/2009 will be shown as this one: 20090919. The above is a realistic data that I need to deal when import data and then perform data cleaning using MS SQL scripts.
1. 9/19/2009
2. 1/1/2009
3. 09/25/2010
4. 2009-01-09
Extracting this format of date: 9/19/2009
Adding the left wild card will bring un desirable results of dates as shown below with different formats. You need to use only the right wild card so the requested pattern will show up as needed. This way you can perform the string manipulation without any surprises. As a rule of thumb I also make sure all data is trimmed before starting working on it. (For the import I use SSIS).
1. When running the below code for this format: 9/19/2009 you get the right data.
2. Extracting this format of this kind of date: 1/1/2009
Running the above code with the right wild card guarantees the right format as shown below.
3. Extracting this format of date: 09/25/2010
The below data shows up
4. Handle this format: 2009-01-09
Running the above query will bring the below records:
Every pattern I will test as shown below to make sure I am getting the right data. In real table I will have a Clustered index on the PK and will add WHERE clause like this one: User_PK BETWEEN 1 AND 10000 and this way the data comes much quicker.
Running the above query will bring the below records:
For the table that you perform the data cleansing, the idea is to create another column that all updates will be implemented there. This way I can compare and see if the conversion was done correct. The Tiny column that also was added, I like to use it as a flag for the different type of conversions. In reality there could be at least 20 types of conversions that could be needed for the full conversion for the date field. See below code that will convert all different types of dates into YYYYMMDD in one script.
Running the above SELECT statement shows the updated columns as shown in the below:
The entire update technique should be used within a batch and please see this article I wrote about it and how to implement it:
Processing-hundreds-of-millions-records-got-much-easier
With this technique it keeps the log file very small
If you prefer to run each process one at the time then you can use the code as shown below for this format: 9/19/2009
I also tried to test the PATINDEX() function instead of the LIKE operator as shown below. In this case I actually compared the PATINEX() function to the LIKE operator. Running in a 300 million records table, the LIKE was always faster in most cases at least in 1 second. The LIKE is getting to be by far very slow when there are only small amount of records such as: 14030 records found for a specific pattern in big tables as mentioned above.
Running the same pattern using the PATINDEX() function took only 10 seconds while using the LIKE took 1 min and 9 seconds. Usually I ran the COUNT() to get the idea how many records to be processed for each pattern. In most cases I will use the LIKE operator for this kind of data cleansing. Make sure to create an index on the field that you process data on it. Remember to create the same type of field for UPDATE purpose. This one took only 10 seconds to count 14030 records out of 300 million records.
This one took 1 min and 9 seconds for the same table.
----------------------------------------------------------
---------------------- Conclusion: -----------------
----------------------------------------------------------
I covered some steps I take for different types of the data conversion and in the process I realized that the PATINDEX() could be much faster versus the LIKE operator for some conditions. There could be different types of data that cannot be converted and must be ignored. That's why a second field is nice to have, since that will be the final field for the production database.
There are many dates that have different patterns and could be legit ones for conversion but will fail the CONVERT() function. Only string manipulation provides the ultimate control and minimize any errors of date conversion.
Comments