window-functions

Remove duplicates

De-duplication of data is a common need, especially when dealing with data-quality issues in environments that end up with duplicate rows due to lack of enforcement of uniqueness with constraints. As an example, the following code prepares sample data with duplicate orders in a table called MyItem:

IF EXISTS (SELECT 1 FROM sys.tables WHERE name = 'MyItems')
        DROP TABLE dbo.MyItems
GO
               
               
CREATE TABLE dbo.MyItems
(
        IdItem INT IDENTITY(1,1),
        Name VARCHAR(50),
        Description VARCHAR(256)
)
GO

INSERT INTO dbo.MyItems(Name, Description)
VALUES
        ('Item1', 'Description 1'),
        ('Item1', 'Description 1'),
        ('Item2', 'Description 2'),
        ('Item3', 'Description 3'),
        ('Item3', 'Description 3'),
        ('Item3', 'Description 3 - different'),
        ('Item4', 'Description 4')
       
SELECT *
FROM dbo.MyItems
GO

An option is to filter either just the distinct rows or all but the distinct rows. You compute both ROW_NUMBER and RANK based on Name + Description ordering, like so:

SELECT
        IdItem, Name, Description,
        ROW_NUMBER() OVER (ORDER BY Name, Description) as row_num,
        RANK() OVER (ORDER BY Name, Description) as rnk
FROM dbo.MyItems
GO

Suppose that you need to de-duplicate the data, keeping only one occurrence of each unique Name+Description combination.

WITH cte
AS
(
        SELECT
                IdItem, Name, Description,
                ROW_NUMBER() OVER (ORDER BY Name, Description) as row_num,
                RANK() OVER (ORDER BY Name, Description) as rnk
        FROM dbo.MyItems
)
DELETE FROM cte
WHERE row_num != rnk

SELECT IdItem, Name, Description
FROM dbo.MyItems

When you’re done, run the following code for cleanup:

IF EXISTS (SELECT 1 FROM sys.tables WHERE name = 'MyItems')
        DROP TABLE dbo.MyItems
GO

Islands problem - identifying consecutive ranges

I will present a T-SQL query that uses ROW_NUMBER for identifying consecutive ranges of existing values (also known as islands problem).
Given the below script, the expected ranges are: [1,2], [11-14], [25], [32-33].

SET NOCOUNT ON;
IF OBJECT_ID('dbo.MyTable') IS NOT NULL
        DROP TABLE dbo.MyTable;
GO

CREATE TABLE dbo.MyTable
(
col1 INT NOT NULL
CONSTRAINT PK_MyTable PRIMARY KEY
);

INSERT INTO dbo.MyTable(col1)
VALUES (1),(2),(11),(12),(13),(14),(25),(31),(32),(33);
GO

Step 1: use the ROW_NUMBER function to calculate row numbers based on col1 ordering

SELECT col1, ROW_NUMBER() OVER (ORDER BY col1) as row_num
FROM dbo.MyTable

Step 2: identify islands
The difference between two columns is constant and unique for each island as shown in this query:

SELECT col1, col1 - ROW_NUMBER() OVER (ORDER BY col1) as grp
FROM dbo.MyTable

Step 3: complete solution

;WITH cte
AS
(
        SELECT col1, col1 - ROW_NUMBER() OVER (ORDER BY col1) as grp
        FROM dbo.MyTable
)
SELECT MIN(col1) as start_range, MAX(col1) as end_range
FROM cte
GROUP BY grp

Subscribe to RSS - window-functions