sql-server-2008

Remove duplicates

De-duplication of data is a common need, especially when dealing with data-quality issues in environments that end up with duplicate rows due to lack of enforcement of uniqueness with constraints. As an example, the following code prepares sample data with duplicate orders in a table called MyItem:

IF EXISTS (SELECT 1 FROM sys.tables WHERE name = 'MyItems')
        DROP TABLE dbo.MyItems
GO
               
               
CREATE TABLE dbo.MyItems
(
        IdItem INT IDENTITY(1,1),
        Name VARCHAR(50),
        Description VARCHAR(256)
)
GO

INSERT INTO dbo.MyItems(Name, Description)
VALUES
        ('Item1', 'Description 1'),
        ('Item1', 'Description 1'),
        ('Item2', 'Description 2'),
        ('Item3', 'Description 3'),
        ('Item3', 'Description 3'),
        ('Item3', 'Description 3 - different'),
        ('Item4', 'Description 4')
       
SELECT *
FROM dbo.MyItems
GO

An option is to filter either just the distinct rows or all but the distinct rows. You compute both ROW_NUMBER and RANK based on Name + Description ordering, like so:

SELECT
        IdItem, Name, Description,
        ROW_NUMBER() OVER (ORDER BY Name, Description) as row_num,
        RANK() OVER (ORDER BY Name, Description) as rnk
FROM dbo.MyItems
GO

Suppose that you need to de-duplicate the data, keeping only one occurrence of each unique Name+Description combination.

WITH cte
AS
(
        SELECT
                IdItem, Name, Description,
                ROW_NUMBER() OVER (ORDER BY Name, Description) as row_num,
                RANK() OVER (ORDER BY Name, Description) as rnk
        FROM dbo.MyItems
)
DELETE FROM cte
WHERE row_num != rnk

SELECT IdItem, Name, Description
FROM dbo.MyItems

When you’re done, run the following code for cleanup:

IF EXISTS (SELECT 1 FROM sys.tables WHERE name = 'MyItems')
        DROP TABLE dbo.MyItems
GO

Pagination in SQL Server

Solution for SQL Server 2008
The ROW_NUMBER function can be used for paging purposes. You assign row numbers to the result rows based on the desired ordering, and then filter the right range of row numbers based on given page-number and page-size arguments.

DECLARE
        @pagenum AS INT = 2,
        @pagesize AS INT = 10;
       
WITH cte AS
(
        SELECT ROW_NUMBER() OVER( ORDER BY Column1) AS rownum, Column1, Column2, Column3
        FROM dbo.MyTable
)
SELECT Column1, Column2, Column3
FROM cte
WHERE rownum BETWEEN (@pagenum - 1) * @pagesize + 1 AND @pagenum * @pagesize
ORDER BY rownum;

For optimal performance, you want to have an index defined on the window ordering elements as the index keys (and eventually include in the index the rest of the attributes that appear in the query for coverage purposes).

Solution for SQL Server 2012 or SQL Server 2014
An alternative solution to paging is to use the new OFFSET/FETCH filtering option:

DECLARE
        @pagenum AS INT = 2,
        @pagesize AS INT = 10;

SELECT Column1, Column2, Column3
FROM dbo.MyTable
ORDER BY Column1
OFFSET (@pagenum - 1) * @pagesize ROWS FETCH NEXT @pagesize ROWS ONLY;

How to avoid recompilation

- Don’t interleave DDL and DML statements.

- Avoid recompilation caused by statistics changes: use the KEEPFIXED PLAN option or disable the auto update statistics on the relevant table.

- Use table variables instead of temporary tables (as statistics are not created for table variables)

- Avoid changing SET options within the stored procedure.

- Use the OPTIMIZE FOR query hint that uses parameter values
supplied by you to compile the plan, regardless of the values of the parameter passed in by the calling application. It is recommended to use it in case you have a set of parameters that are "representative" from statistics point of view. This way you'll have a plan that works best most of the time.

CREATE PROCEDURE dbo.GetSales
@CustomerId INT
AS
SELECT ...
FROM
       dbo.Sales S
       INNER JOIN ...
WHERE S.CustomerId = @CustomerId
OPTION (OPTIMIZE FOR (@CustomerId = 1)) ;
GO

- Using plan guides. A plan guide allows you to use query hint or other optimization techniques without having to modify the query or procedure text. For example when the query is part of a third-party application and you are not able to modify it by including OPTION (OPTIMIZE FOR).
Example of creating a plan guide:

EXEC sp_create_plan_guide
       @name = N'MyGuide',
       @stmt = N'SELECT ...
                           FROM
                                 dbo.Sales S
                                 INNER JOIN ...
                                 WHERE S.CustomerId = @CustomerId'
,
       @type = N'OBJECT',
       @module_or_batch = N'dbo.CustomerList',
       @params = NULL,
       @hints = N'OPTION (OPTIMIZE FOR (@CustomerId = 1))' ;

Causes of recompilations

- The schema of regular tables, temporary tables, or views referred to in the stored procedure statement have changed. Schema changes include changes to the metadata of the table or the indexes on the table.

- Bindings (such as defaults) to the columns of regular or temporary tables have changed.

- Statistics on the table indexes or columns have changed past a certain threshold.

- An object did not exist when the stored procedure was compiled, but it was created during execution. This is called deferred object resolution.

- SET options have changed. (SET NOCOUNT doesn’t cause stored procedure recompilation)

- The execution plan was aged and deallocated.

- An explicit call of sp_recompile on a table, view, trigger or stored procedure

- There was an explicit use of the RECOMPILE hint:
CREATE PROCEDURE <spName> WITH RECOMPILE => prevents the caching of the stored procedure plan
EXEC <spname> WITH RECOMPILE; => a new plan is generated temporarily, without being cached of affecting any other existing cached plan
OPTION (RECOMPILE)</sql> => causes statement recompilation

Finding queries that are using table scans

The T-SQL script used to discover queries that are using table scans by searching within query plan for TableScan tag:

SELECT TOP 20
        DB_NAME(ST.dbid)AS database_name
        , ST.text as sql_Statement
        , CP.usecounts
        , QP.query_plan
FROM
        sys.dm_exec_cached_plans CP
        CROSS APPLY sys.dm_exec_sql_text(CP.plan_handle) ST
        CROSS APPLY sys.dm_exec_query_plan(CP.plan_handle) QP
WHERE
        ST.dbid = DB_ID()
        AND CAST(QP.query_plan AS NVARCHAR(MAX)) LIKE '%<TableScan%'
ORDER BY CP.usecounts DESC

Determining when a query was last run

Running the T-SQL script given will identify the top 20 times a given query was run, as ordered by last_execution_time. You can identify the relevant query by supplying part of it as a filter via the WHERE clause.

For example, determine when a specific stored procedure was last time changed:

SELECT DISTINCT TOP 20
        DB_NAME(QT.dbid) AS DatabaseName
        , QT.text as sql_statement
        , QS.last_execution_time
FROM
        sys.dm_exec_query_stats QS
        CROSS APPLY sys.dm_exec_sql_text(QS.sql_handle) as QT
WHERE
        QT.dbid = DB_ID()               -- limit search on current database
        AND QT.text LIKE '%CREATE PROCEDURE%GetEmployees%'
ORDER BY QS.last_execution_time DESC;

Find missing indexes using DMVs

The following query determines which are the missing indexes and displays their column details based on dm_db_missing_index_group_stats DMV.

USE <DatabaseName>;

SELECT 
        REPLACE(ID.[Statement], '[' + DB_NAME(ID.database_id) + '].', '') as table_name,
        CAST(S.avg_total_user_cost * S.avg_user_impact * S.user_seeks as INT) as [score],
        ID.equality_columns, ID.inequality_columns, ID.included_columns,               
        'CREATE INDEX '
        + 'IX_' + OBJECT_NAME(ID.object_id) + '_' + CAST(ID.index_handle as VARCHAR)
        + ' ON ' + REPLACE(ID.[Statement], '[' + DB_NAME(ID.database_id) + '].', '')
        + ' (' + ISNULL (ID.equality_columns,'')
        + CASE
                WHEN ID.equality_columns IS NOT NULL AND ID.inequality_columns IS NOT NULL
                THEN ', ' + ID.inequality_columns
                ELSE ''
          END
        + ')'
        + ISNULL (' INCLUDE (' + ID.included_columns + ')', '') AS create_index_statement
FROM
        sys.dm_db_missing_index_details ID
        INNER JOIN sys.dm_db_missing_index_groups G
                ON ID.index_handle = G.index_handle
        INNER JOIN sys.dm_db_missing_index_group_stats S
                ON G.index_group_handle = S.group_handle
WHERE
        ID.database_id = DB_ID()
        AND OBJECTPROPERTY(ID.[object_id], 'IsMsShipped') = 0  
ORDER BY score DESC;

Get database restore history

Returns last 10 restores performed on your server:

SELECT TOP 10
        RH.destination_database_name AS [Database],
        RH.user_name AS [Restored By],
        RH.restore_date AS [Restore Started],
        BMF.physical_device_name AS [Restored From],
        RF.destination_phys_name AS [Restored To],
        RH.*
FROM
        msdb.dbo.restorehistory RH
        INNER JOIN msdb.dbo.backupset BS ON RH.backup_set_id = BS.backup_set_id
        INNER JOIN msdb.dbo.restorefile RF ON RH.restore_history_id = RF.restore_history_id
        INNER JOIN msdb.dbo.backupmediafamily BMF ON BMF.media_set_id = BS.media_set_id
--WHERE destination_database_name = '<DatabaseName>'
ORDER BY RH.restore_history_id DESC
GO

Compute the last day of the year

You bring the last day of a year (let's say December 31th 1900) and then you add it the number of years passed:

DECLARE @dt AS DATETIME = SYSDATETIME();
SELECT DATEADD(year, DATEDIFF(year,'19001231', @dt), '19001231');
GO

Improve performance using filtered indexes

A filtered index is a non-clustered index that contains only a subset of the number of rows contained in a table. You add a WHERE clause to reduce the number of rows that are stored at the leaf level.

CREATE NONCLUSTERED INDEX <Index_Name>
ON <Table Name>(<Column(s)>)
WHERE <Filter Condition>

A good real example it is use is on Status column, as you might have a large number of rows with Closed status that are not part of your current searches (the business wants to query for Open, Processing, Invoicing, etc)

CREATE NONCLUSTERED INDEX [IX_Orders_IdStatus_Filterered]
ON [dbo].[Orders] ([IdStatus])
WHERE ([IdStatus] IN ('O', 'P', 'I'))

Another well-suited scenarios is where you frequently have to filter out NULL values.

Pros:

  • By having fewer rows in an index, less I/O is done when that index is used.
  • Index size is dramatically reduced.
  • The statistics on these filtered indexes are filtered as well, which typically results in them being more accurate.
  • Cons:

  • Use certain expressions, such as BETWEEN, NOT IN, or a CASE statement.
  • Parameterized queries doesn't take advantage of filtered indexes.
  • Pages

    Subscribe to RSS - sql-server-2008