Mar 28/2011

String Manipulation on large data using LIKE operator or PATINDEX

The idea is to perform data cleansing on the Date and Time field from different formats into a unify format such as: YYYYMMDD. Here is a sample data to be used for the data cleansing that I will perform below.

Call us for Free Consultation for Remote DBA services at our SQL Consulting Firm at: 732-536-4765

The reason that I am using string manipulation instead of the CONVERT() function is: Since the table could have millions of dates that could not be converted using the CONVERT() function and therefore string manipulation is required. Even though most records could use the CONVERT() function, I will still would rather use the string manipulation strategy since in one transaction I convert all dates whether they can use the CONVERT() or not. See examples for these dates:

The ISDATE() function will return 0 but this date could be converted as 20090930 and I could have 5 millions of these dates.
If you ran the above code you will get the below error:

Msg 241, Level 16, State 1, Line 1 Conversion failed when converting date and/or time from character string.
Here is another example of date that should be using string manipulation:

The actual date could be 20070127 but ISDATE() returns 0 as seems below.

The following are 4 patterns that I will use for this article: Each different pattern will be converted to this format as: YYYYMMDD. So the date that appears as this one 9/19/2009 will be shown as this one: 20090919. The above is a realistic data that I need to deal when import data and then perform data cleaning using MS SQL scripts.
1. 9/19/2009
2. 1/1/2009
3. 09/25/2010
4. 2009-01-09

Extracting this format of date: 9/19/2009

Adding the left wild card will bring un desirable results of dates as shown below with different formats. You need to use only the right wild card so the requested pattern will show up as needed. This way you can perform the string manipulation without any surprises. As a rule of thumb I also make sure all data is trimmed before starting working on it. (For the import I use SSIS).

1. When running the below code for this format: 9/19/2009 you get the right data.

2. Extracting this format of this kind of date: 1/1/2009

Running the above code with the right wild card guarantees the right format as shown below.

3. Extracting this format of date: 09/25/2010

The below data shows up

4. Handle this format: 2009-01-09

Running the above query will bring the below records:

Every pattern I will test as shown below to make sure I am getting the right data. In real table I will have a Clustered index on the PK and will add WHERE clause like this one: User_PK BETWEEN 1 AND 10000 and this way the data comes much quicker.

Running the above query will bring the below records:

For the table that you perform the data cleansing, the idea is to create another column that all updates will be implemented there. This way I can compare and see if the conversion was done correct. The Tiny column that also was added, I like to use it as a flag for the different type of conversions. In reality there could be at least 20 types of conversions that could be needed for the full conversion for the date field. See below code that will convert all different types of dates into YYYYMMDD in one script.

Running the above SELECT statement shows the updated columns as shown in the below:

The entire update technique should be used within a batch and please see this article I wrote about it and how to implement it:

Processing-hundreds-of-millions-records-got-much-easier

With this technique it keeps the log file very small

If you prefer to run each process one at the time then you can use the code as shown below for this format: 9/19/2009

I also tried to test the PATINDEX() function instead of the LIKE operator as shown below. In this case I actually compared the PATINEX() function to the LIKE operator. Running in a 300 million records table, the LIKE was always faster in most cases at least in 1 second. The LIKE is getting to be by far very slow when there are only small amount of records such as: 14030 records found for a specific pattern in big tables as mentioned above.

Running the same pattern using the PATINDEX() function took only 10 seconds while using the LIKE took 1 min and 9 seconds. Usually I ran the COUNT() to get the idea how many records to be processed for each pattern. In most cases I will use the LIKE operator for this kind of data cleansing. Make sure to create an index on the field that you process data on it. Remember to create the same type of field for UPDATE purpose. This one took only 10 seconds to count 14030 records out of 300 million records.

This one took 1 min and 9 seconds for the same table.

----------------------------------------------------------
---------------------- Conclusion: -----------------
----------------------------------------------------------

I covered some steps I take for different types of the data conversion and in the process I realized that the PATINDEX() could be much faster versus the LIKE operator for some conditions. There could be different types of data that cannot be converted and must be ignored. That's why a second field is nice to have, since that will be the final field for the production database.

There are many dates that have different patterns and could be legit ones for conversion but will fail the CONVERT() function. Only string manipulation provides the ultimate control and minimize any errors of date conversion.

Written by: Doron Farber - The Farber Consulting Group Inc.

I started to develop custom software since 1985 while using dBase III from Aston Tate. From there I moved to FoxBase and to FoxPro and ended up working with Visual FoxPro until Microsoft stopped supporting that great engine. With the Visual FoxPro, I developed the VisualRep which is Report and Query Engine. We are also a dot net development company, and one of our projects is a web scrapping from different web sites. We are Alpha AnyWhere developers, and the Avis Car Rental company trusted us with their contract management software that we developed with the Alpha Five software Engine.

String Manipulation on large data using LIKE operator or PATINDEX

Comments

Recent Posts

Remote SQL DBA Best Practices: Managing Databases from Anywhere

The Human Side of Remote SQL DBA: Cultivating Connection in a Virtual World

Remote SQL DBA: Navigating the Future of Data Management

Unveiling the Journey: From Concept to Execution in Web Scraping Projects

Navigating the World of Remote SQL DBA Services - A Comprehensive Guide

Popular Tags

Got questions about unleashing the full potential of your project?
We’ve got the answers!

Remote SQL DBA Best Practices: Managing Databases from Anywhere

The Human Side of Remote SQL DBA: Cultivating Connection in a Virtual World

Search

String Manipulation on large data using LIKE operator or PATINDEX

Comments

Recent Posts

Remote SQL DBA Best Practices: Managing Databases from Anywhere

The Human Side of Remote SQL DBA: Cultivating Connection in a Virtual World

Remote SQL DBA: Navigating the Future of Data Management

Unveiling the Journey: From Concept to Execution in Web Scraping Projects

Navigating the World of Remote SQL DBA Services - A Comprehensive Guide

Popular Tags

Got questions about unleashing the full potential of your project?We’ve got the answers!

Remote SQL DBA Best Practices: Managing Databases from Anywhere

The Human Side of Remote SQL DBA: Cultivating Connection in a Virtual World

Search

Got questions about unleashing the full potential of your project?
We’ve got the answers!