Contained Database

A contained database is a database that is isolated from other databases and from the instance of SQL Server that hosts the database. Basically it includes all database settings and the metadata within itself thereby resulting in no configuration dependencies on the instance of the SQL Server Database Engine where the database is actually installed.

This feature is available from SQL Server 2012 and it is available on Azure SQL Database. This feature is not available on Azure SQL Data Warehouse or Parallel Data Warehouse.

One of the benefits of contained databases is that you will no longer struggle with orphaned users when moving databases or working with high availability. From SQL Server 2012 it introduces the notion of boundaries which are borders that define where a database and its metadata and settings end and where server-level features, metadata, and dependencies begin. In addition to implicitly allowing increased portability, boundaries are also used to help explicitly isolate data, settings, features, and capabilities.

There are 3 types of containment levels:

  • NONE. This default containment mode is what exists prior to SQL Server 2012. There are no boundaries defined or visible.
  • PARTIAL. With partially contained databases, you have the ability to define clearer boundaries between databases and the server, making it easier for metadata to be hosted within the databases. This, in turn, makes SQL Server databases more portable and less dependent on underlying hosts.
  • FULL. Full containment will enable greater database portability and allow for strict enforcement of containment boundaries as a means of fully isolating databases from each other and from the underlying host.

In SQL Server 2016 there are 4 ways in which SQL Server helps users to isolate the database from the instance.

  • The metadata that describes the database is maintained in the database and not in the master database.
  • All metadata are defined using the same collation
  • User authentication can be performed by the database, reducing the databases dependency on the logins of the instance of SQL Server
  • The SQL Server environment (DMV’s, XEvents, etc.) reports and can act upon containment information.

There are two types of users for contained databases:

  • Contained database user with password. These are authenticated by the database.
  • Windows principals. Authorized windows users and members of authorized Windows groups can connect directly to the database and do not need logins in the master database. The database trusts the authentication by Windows.

Benefits of using Partially Contained Databases
Database Movement
When moving a database, problems occur with some important information being unavailable after moving to another instance. Login information is stored within an instance and not within a database, therefore when moving the database to another instance will lose this information. The partially contained database can store important information in the database so the database still has the information after it is moved.

Benefit of Contained Database Users with Always On
By reducing the ties to the instance of SQL Server, partially contained databases can be useful during failover when you use Always On Availability Groups. The contained users is a very significant feature in an Always On solution. If the users are contained users, in the event of a failover, people would be able to connect to the failover node without creating logins on the instance hosting the failover node.

Initial Database Development
When creating a database, the developer may not always know where the database will be deployed and all the environmental variables that could impact the database. With a non-contained model, the developer would need to consider all the environmental variables and the impacts it could have on the database and develop accordingly. With using partially contained databases, the developer can identify instance-level impacts on the database.

Database Administration
Database administration becomes easier with maintaining database settings, instead of having the settings in the master database you can now maintain them in the database. This allows each database owner to have more control over their database without giving the database owner sysadmin permission.

Limitations with Partially Contained Databases
With almost any feature it comes with certain limitations. The following are limitations of a Partially Contained Database where the below features are not allowed:

  • Cannot use replication, change data capture, or change tracking.
  • Numbered procedures
  • Schema-bound objects that depend on built-in functions with collation changes
  • Binding change resulting from collation changes, including references to objects, columns, symbols, or types.

Identifying Database Containment
There are two tools to identify the containment status of a database, sys.dm_db_uncontained_entities and database_uncontained_usage event. sys.dm_db_uncontained_entities is a view that shows all potentially uncontained entities in the database. However if any entities are using dynamic SQL, it cannot be determined until runtime whether the entity is contained or not. Database_uncontained_usage Xevent occur whenever an uncontained entity is identified at run time. Therefore, any uncontained user entities you have not run will not be identified by this XEvent.

This could be an awesome feature to use for database developers from SQL Server 2012, hope this post was helpful for you!

10252137_641249792618222_6085988990639587497_n

Temp Table VS Table Variable

There have been many discussions on whether a temp table is better to use or a table variable. It depends on what you would like to do. They both have their pros, cons and limitations.

Table Variable Limitations

  • Table variable changes do not get rolled back during a transaction ROLLBACK whereas temp table changes will.
  • Table variables cannot have Non-Clustered Indexes
  • You cannot create constraints in table variables
  • You cannot create default values on table variable columns
  • Statistics cannot be created against table variables
  • Table variables cannot have a parallel plan, whereas temp tables are not restricted in this manner
  • Table variables cannot be selected into, altered, truncated or be the target of the DBCC commands such as DBCC CHECKDENT or of SET IDENTITY INSERT
  • Query hints are not supported
  • CHECK constraints on table variables are not considered by the optimizer for simplification, implied predicates or contradiction detection.

Similarities between Temp Tables and Table Variables

  • Clustered indexes can be created on both
  • Both are logged in the transaction log
  • Instantiated in the tempdb
  • You can perform DML queries against the tables

Logical Location
Table variables behave more as if they were part of the current database. The table variable column collations if not specified explicitly will be that of the current database. Temp tables will use the default collation of tempdb. User-defined data types and XML collections must be in tempdb in order for you to use it in temp tables, however, table variables can use them from the current database.

  • In a contained database, temporary table data is collated in the collation of the contained database. All metadata associated with the temporary tables will be in the catalog collation.
  • Named constraints may not be used in temporary tables.
  • Temporary tables may not refer to user-defined functions, user-defined types or XML collections.

Visibility to different scopes
Table variables are limited in such that they can only be accessed in the current batch and scope in which they are declared. Temp tables are accessible within child batches (nested triggers, procedures, exec calls). Temp tables created at the outer scope (@@NESTLEVEL=0) can span batches too as they persist until the session ends.

Lifetime
Table variables are created implicitly when containing a DECLARE @MyTableVariable TABLE statement above any user code is executed, and they are dropped implicitly at the end. Temp tables are created explicitly when the CREATE TABLE #MyTemp statement is encountered and can be dropped explicitly with the DROP TABLE or will be dropped when the session ends. If created in a child batch with @@NESTLECT > 0 it will be dropped implicitly.

Within stored routines both types can be cached rather than repeatedly creating and dropping new tables. There are restrictions on when this can occur, it is possible to violate temp tables, the restrictions on table variables prevent any violations. The maintenance overhead on cached temp tables is slightly greater than for table variables. For more info on this, take a look here.

Transactions
Temp table operations are carried out as part of the user transaction itself, whereas table variable operations are carried out as system transactions, independent of any outer user transaction. A ROLLBACK command will therefore affect a temp table but leave a table variable untouched.

Logging
Both generate log records in the tempdb transaction log. A common misconception is that this is not the case for table variables. The quantity of logging is very similar, a very important difference is that log records related to temp tables cannot be cleared out until any containing user transaction finishes. A long running transaction that at some point writes to temp table will prevent log truncation in the tempdb whereas the autonomous transaction spawned for table variables do not.

Indexes
For versions prior to SQL Server 2016 indexes can only be created implicitly on table variables as a side effect of adding a unique constraint or primary key. This means that only unique indexes are supported. Table variables do not support included columns, filtered indexes (until 2016) or partitioning, whereas temp tables do.

In SQL Server 2014 non unique indexes can be declared inline in the table variable definition. Now in SQL Server 2016 filtered indexes can be applied to table variables.

Hope the above gives you a bit more clarity on the benefits of each one in helping you decide whether to use a temp table or table variable in your scenarios.

whatittakes

Detecting Row Count Anomalies with Simple Linear Regression

A follow up from my previous post on Detecting ETL Duration Anomalies with Simple Linear Regression, I thought I would share how to detect Row Count Anomalies with Simple Linear Regression.

This is very useful to detect when tables row counts all of a sudden grow by an abnormal amount, or are flushed unexpectedly.

You can use the scripts to create the user-defined data type, table valued function and failure log table from my previous post here.

Once you have run the above scripts you can move on to creating the row count history table that will store a history of the row counts.

CREATE TABLE [tests].[TableRecCountHist](
	[TableRecCountHistoryID] [int] IDENTITY(1,1) NOT NULL,
	[TableTestDate] [date] NOT NULL,
	[TableTestTime] [time](7) NOT NULL,
	[TableName] [varchar](100) NOT NULL,
	[TableRecCount] [int] NOT NULL,
	[IsError] [bit] NULL,
	[Sensitivity] [decimal](18, 3) NULL,
	[CurrCntSigma] [decimal](18, 5) NULL
) ON [PRIMARY]

Next you will create the stored procedure to detect the row count anomalies. This procedure will insert into the history table to keep track of the table row counts in order to work out whether there is an anomaly or not.

CREATE PROC [tests].[tst_TableRowCount]
AS
BEGIN
	SET NOCOUNT ON;
	/*----------------------------------------------	
	  * Uncomment for debug
	------------------------------------------------*/
	 --IF OBJECT_ID('tempdb..#PreOp') IS NOT NULL DROP TABLE #PreOp
	 --IF OBJECT_ID('tempdb..#tReport') IS NOT NULL DROP TABLE #tReport

	/*----------------------------------------------
	  * Tunable parameters
	------------------------------------------------*/
	DECLARE @nDaysHist			SMALLINT	= 90,	-- Set day count to limit history retention 
			@nSampleMAX			SMALLINT	= 30,	-- Set maximum history count to evalute using simple linear regression
			@nSampleMIN			SMALLINT	= 3,	-- Set minimum history count to evalute using simple linear regression
			@STDevSensitivity	SMALLINT	= 6		-- Set a value to increase or decrease the upper and lower threshold

	/*-----------------------------------------------
	  *	List tables to exclude
	-------------------------------------------------*/
	DECLARE @tExceptions TABLE	(
									FullTableName varchar(200) NOT NULL
								)

	INSERT INTO @tExceptions
	SELECT 'Staging.ops.Address' 
	UNION
	SELECT 'TestDWH.dbo.Address'

	/*-----------------------------------------------------------------------------------------------------------------------------------
	  FLUSH                       
		* Clear history table of records older than @nDaysHist days or records added on current day to get latest stats
	*/-----------------------------------------------------------------------------------------------------------------------------------
	DELETE
	FROM tests.TableRecCountHist
	WHERE TableTestDate < DATEADD(DAY, -@nDaysHist, GETDATE()) 
	;

	/*-----------------------------------------------------------------------------------------------------------------------------------
	  GET COUNTS                                             
		* Get table record count for all tables from [Staging], [TestDWH] and [BelparkDW] databases
		* Counts will be recorded per table, no matter how many times they have been recorded for that day already
	*/-----------------------------------------------------------------------------------------------------------------------------------
	;WITH TABcte ([object_id], name, database_name, SCHEMA_ID) AS
	(
		SELECT *
		FROM
		(
			SELECT [object_id], name, database_name = 'Staging', SCHEMA_ID FROM [Staging].sys.tables
			UNION ALL
			SELECT [object_id], name, database_name = 'TestDWH',  SCHEMA_ID FROM [TestDWH].sys.tables 
		)DT
	)
	,SCHcte AS
	(
		SELECT database_name = 'Staging', [schema_id], name FROM [Staging].sys.schemas 
		UNION ALL
		SELECT database_name = 'TestDWH',	[schema_id], name FROM [TestDWH].sys.schemas
	)
	,DPScte AS
	(
		SELECT database_name = 'Staging', [object_id], index_id, row_count FROM [Staging].sys.dm_db_partition_stats 
		UNION ALL
		SELECT database_name = 'TestDWH',[object_id], index_id, row_count FROM [TestDWH].sys.dm_db_partition_stats
	)
	,TRCcte AS
	(
		SELECT	 TableID		= TAB.[object_id]
				,TableTestDate	= CAST(GETDATE() AS DATE) 
				,TableTestTime	= CAST(GETDATE() AS TIME)	
				,FullName		= CONCAT(TAB.database_name, '.', SCH.name, '.', TAB.name)
				,TableRecCount	= ISNULL(RC.RCount,0) 
		FROM TABcte		TAB
		JOIN SCHcte		SCH		ON	TAB.[schema_id] = SCH.[schema_id]
								AND	TAB.database_name = SCH.database_name
		CROSS 
		APPLY
		(
			SELECT RCount = SUM (row_count)
			FROM DPScte
			WHERE [object_id] = TAB.[object_id]
			AND (index_id = 0 OR index_id = 1)
			AND DPScte.database_name = TAB.database_name
		)				RC
		WHERE TAB.name != '__RefactorLog'
		AND TAB.name != 'sysdiagrams'
		AND SCH.name NOT IN ('tests','ops','errorlog') --Add Schema names of schemas you do not want tested.
	)

	INSERT INTO tests.TableRecCountHist (TableTestDate, TableTestTime, TableName, TableRecCount)
	SELECT DISTINCT
			 TableTestDate
			,TableTestTime
			,FullName
			,TableRecCount
	FROM TRCcte	

	/*-----------------------------------------------------------------------------------------------------------------------------------
	  MAGIC
		* Using Simple Linear Regression, issue an estimation for what the row count per table should be "today"
		* Based on the regression line, define thresholds that will determine alert
	*/-----------------------------------------------------------------------------------------------------------------------------------
	SELECT	 TableName
			,X						= DATEDIFF(DAY, MIN(TableTestDate) OVER (PARTITION BY TableName), TableTestDate) + 1
			,TableTestDate
			,Y						= TableRecCount
			,SampleOrdinal			= ROW_NUMBER() OVER (PARTITION BY TableName ORDER BY TableTestDate DESC) 
			,MeasurementsPerTable	= COUNT(*) OVER (PARTITION BY TableName)
	INTO #PreOp
	FROM
	(
		SELECT	 [TableRecCountHistoryID]
				,[TableTestDate]
				,[TableTestTime]
				,trch.[TableName]
				,[TableRecCount]
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY trch.TableName, trch.TableTestDate ORDER BY trch.TableTestTime DESC)
		FROM tests.TableRecCountHist	trch
		WHERE trch.IsError IS NULL
		AND trch.TableTestDate < CAST(GETDATE() AS DATE)
		AND NOT EXISTS	(
							SELECT *
							FROM @tExceptions ex
							WHERE trch.TableName = ex.FullTableName 
						)
	)DT
	WHERE RecentPerDay = 1

	DECLARE @tXYData XYDataPoints
	INSERT INTO @tXYData(GroupID,X,Y)
	SELECT	 TableName
			,X
			,Y
	FROM #PreOp
	WHERE SampleOrdinal <= @nSampleMAX AND MeasurementsPerTable >= @nSampleMIN

	/*-----------------------------------------------
	  * Table to use for updating the error log
	-----------------------------------------------*/
	DECLARE @ErrorOutput TABLE	( 
									FullTableName varchar(200) NOT NULL
								)

	;WITH Pruned AS
	(
		SELECT	 TableName
				,X						
				,TableTestDate
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable	
		FROM #PreOp pO
		WHERE	EXISTS	(
							SELECT *
							FROM @tXYData	tD
							WHERE tD.GroupID = pO.TableName	
						)
	)
	,FNResponse AS
	(
		SELECT	 GroupID				
				,Slope				
				,Intercept			
				,StDevError			
		FROM dbo.fn_GetSimpleLinearRegression(@tXYData)
	)
	,TodayXCTE AS
	(
		SELECT	 TableName
				,X						
				,TableTestDate
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable
				,GroupID				
				,Slope				
				,Intercept			
				,StDevError		
				,PredictedY	=	(	Slope * (
												DATEDIFF(DAY, MIN(TableTestDate) OVER (PARTITION BY TableName), GETDATE()) + 1
											)
								) + Intercept
		FROM Pruned		P
		JOIN FNResponse	F	ON P.TableName = F.GroupID
	)
	,ThresholdCTE AS
	(
		SELECT DISTINCT
				 TableName
				,PredictedY
				,Y
				,UpperThreshold	= PredictedY + (@STDevSensitivity * STDevError)
				,LowerThreshold	= PredictedY - (@STDevSensitivity * STDevError)
				,StDevError
		FROM TodayXCTE
		WHERE SampleOrdinal = 1
	)
	,TblRecCntHistCTE AS
	(
		SELECT	 TableName
				,TableTestDate
				,TableRecCount
				,IsError
				,Sensitivity
				,CurrCntSigma
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY trch.TableName, trch.TableTestDate ORDER BY trch.TableTestTime DESC)
		FROM tests.TableRecCountHist	trch
	)
	/*-----------------------------------------------
	  * Update only most recent counts in history 
		table
	-----------------------------------------------*/
	UPDATE trch
	SET  IsError		= 1
		,Sensitivity	= @STDevSensitivity
		,CurrCntSigma	= CEILING(ABS(trch.TableRecCount - ths.PredictedY) / NULLIF(ths.StDevError,0))
	/*-----------------------------------------------
	  * Output updated values to table variable 
		for later use 
	-----------------------------------------------*/
	OUTPUT	INSERTED.TableName
	INTO	@ErrorOutput

	FROM TblRecCntHistCTE	trch
	LEFT
	JOIN ThresholdCTE		ths	ON trch.TableName = ths.TableName
	WHERE	trch.RecentPerDay = 1
		AND trch.TableTestDate = CAST(GETDATE() AS DATE)
		AND trch.IsError IS NULL
		AND (
				(trch.TableRecCount NOT BETWEEN ths.LowerThreshold AND ths.UpperThreshold) 
				OR trch.TableRecCount = 0
			)

	/*-----------------------------------------------
	  * Log entry to failure log
	-----------------------------------------------*/
	INSERT INTO tests.TestFailureLog (TestDate, TestTime, TestCollection, TestName)		
	SELECT	 TestDate			= CAST(GETDATE() AS DATE)
			,TestTime			= CAST(GETDATE() AS TIME)
			,TestCollection		= 'TableRowCount'
			,FullTableName
	FROM @ErrorOutput	EO
	WHERE NOT EXISTS	(
							SELECT *
							FROM tests.TestFailureLog	TL
							WHERE TL.TestDate		= CAST(GETDATE() AS DATE) 
							AND TL.TestCollection	= 'TableRowCount'
							AND TL.TestName			= EO.FullTableName
						)
END
GO

If you have multiple test cases in a test schema you can create another proc that will execute all the procs in a specific schema. Then you can create a job that executes this procedure and have it email out the results of the failure log table every day.

CREATE PROC [tests].[RunTestSuite] 
	@Debug INT = 0
AS
BEGIN
	DECLARE @i				SMALLINT		= 0,
			@DynamicSQL		NVARCHAR(500)	= '',
			@CurrentLogID	INT				= 0

	CREATE TABLE #TestsToRun
	(
		TestOrdinal		SMALLINT,
		TestProcName	VARCHAR(50)	
	)

	INSERT INTO #TestsToRun
	(
		TestOrdinal,
		TestProcName
	)
	SELECT	 TestOrdinal	= ROW_NUMBER() OVER (ORDER BY  ROUTINE_NAME)
			,TestProcName	= CONCAT(ROUTINE_SCHEMA, '.', ROUTINE_NAME)
	FROM INFORMATION_SCHEMA.ROUTINES	
	WHERE ROUTINE_SCHEMA = 'tests'
	AND ROUTINE_NAME LIKE 'tst_%'

	SET @i = @@ROWCOUNT

	SET @CurrentLogID = IDENT_CURRENT('tests.TestFailureLog')

	WHILE @i > 0
	BEGIN
		SELECT @DynamicSQL = CONCAT('EXECUTE ', TestProcName)
		FROM #TestsToRun 
		WHERE TestOrdinal = @i		
		
		IF @Debug = 1 
		BEGIN
			PRINT @DynamicSQL
		END ELSE
		BEGIN
			EXECUTE sp_executesql @DynamicSQL
		END
	
		SET @i = @i - 1;
	END

	

	SELECT	 TestFailureLogID
			,TestDate
			,TestTime
			,TestCollection
			,TestName
	FROM tests.TestFailureLog
	WHERE TestFailureLogID > @CurrentLogID

END

A special mention to fellow colleagues that worked on this code as well, Mark Diamond, Ivor Williams and Janlo Du Toit.

Hope this was as useful for you as it was for us! You can download the script file for this post here.

tumblr_n1mh7i9xZq1rqvcu2o1_400

Detecting ETL Duration Anomalies with Simple Linear Regression

Thought I would share a piece on detecting ETL duration anomalies using Simple Linear Regression. This post will give you a brief overview of what simple linear regression is and how you can use it to your advantage with testing. In this post I will be demonstrating how to use Simple Linear Regression to detect ETL duration Anomalies.

What happens with ETLs is that they normally run for a set amount of time, give or take a few minutes every time. Then all of a sudden it starts running extremely long. If you have nothing in place to check this, other than looking at it manually each day, then this could be your solution.

Let me begin by explaining what Simple Linear Regression is.

The Simple Linear Regression model is used in statistics, where a straight line is plotted through a set of points on a graph. It is plotted in such a way that makes the sum of squared residuals of the model as small as possible.

In order to plot the line on the graph you need to know where the line intercepts on the y axis and the slope of the line. You will also need to set the standard deviation to detect when something is in fact an anomaly. See Figure below.
LR

Step 1:
To start you will need to create a user-defined data type to accept the values of the X and Y data points.

CREATE TYPE dbo.XYDataPoints AS TABLE
(
	GroupID VARCHAR(200) NOT NULL,
	X FLOAT NULL,
	Y FLOAT NULL
)

Step 2:
You need to create a table-valued function which will calculate the simple linear regression of the ETL’s durations.

CREATE FUNCTION [dbo].[fn_GetSimpleLinearRegression] (@MyInput AS XYDataPoints READONLY)

RETURNS @MyOutput TABLE 
(
	-- Columns returned by the function
	 GroupID			VARCHAR(200)
	,Slope				FLOAT
	,Intercept			FLOAT
	,StDevError			DECIMAL(18,5)
)
AS 

BEGIN

	;WITH BASE AS -- Check for NULL source
	(	
		SELECT	 GroupID
				,X
				,Y
		FROM @MyInput
	)
	,XYBAR AS
	(
		SELECT	GroupID
				,Y
				,yBar = AVG(Y) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
				,X
				,xBar = AVG(X) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
				,xCount	= COUNT(X) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
		FROM BASE
	)
	,SLOPE AS
	(
		SELECT	GroupID
				,Slope = SUM((X - xBar) * (Y - yBar)) / SUM(POWER(X - xBar,2))
				,MAX(yBar) AS yBar
				,MAX(xBar) AS xBar
				,MAX(xCount) AS xCount
		FROM XYBAR
		GROUP BY GroupID
	)
	,INTERCEPT AS
	(
		SELECT	 GroupID
				,Slope
				,xBar
				,yBar
				,xCount
				,Intercept = yBar - xBar * Slope
		FROM SLOPE
	)
	INSERT @MyOutput
	SELECT	 INTERCEPT.GroupID
			,Slope
			,Intercept
			,StDevError = STDEV	(	
									ABS(y - ((Slope * (x)) + Intercept))
								)
	FROM BASE
	JOIN INTERCEPT ON BASE.GroupID = INTERCEPT.GroupID
	GROUP BY Slope,Intercept, INTERCEPT.GroupID

	RETURN
END

Step 3:
You need to create a table to contain a history of the duration of ETLs. I chose to create a tests schema and to create all my objects for this detection under the tests schema.

CREATE TABLE [tests].[ETLLogDurationHistory](
	[ETLLogDurationHistoryID] [int] IDENTITY(1,1) NOT NULL,
	[ETLName] [varchar](250) NOT NULL,
	[DateRun] [date] NOT NULL,
	[TimeRun] [time](7) NOT NULL,
	[Duration] [time](0) NOT NULL,
	[DurationInt] [int] NOT NULL,
	[ETLOutcome] [varchar](250) NOT NULL,
	[IsError] [bit] NULL,
	[Sensitivity] [decimal](18, 3) NULL,
	[CurrCntSigma] [decimal](18, 5) NULL,
 CONSTRAINT [PK_ETLLogDurationHistory] PRIMARY KEY CLUSTERED 
(
	[ETLLogDurationHistoryID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Step 4:
You need to create a table to store the error details that were detected for the ETL duration anomalies.

CREATE TABLE [tests].[TestFailureLog](
	[TestFailureLogID] [int] IDENTITY(1,1) NOT NULL,
	[TestDate] [date] NOT NULL,
	[TestTime] [time](7) NOT NULL,
	[TestCollection] [varchar](100) NOT NULL,
	[TestName] [varchar](300) NOT NULL,
 CONSTRAINT [PK_TestFailureLog] PRIMARY KEY CLUSTERED 
(
	[TestFailureLogID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Step 5:
Create the proc to detect the ETL duration anomalies. This proc will also insert into the history table to keep track of how long ETLs ran for in order to work out whether there is an anomaly or not.

CREATE PROC [tests].[tst_ETLDurationAnomaly] 
AS
BEGIN
	SET NOCOUNT ON;
	/*----------------------------------------------	
	  * Uncomment for debug
	------------------------------------------------*/
	 --IF OBJECT_ID('tempdb..#PreOp') IS NOT NULL DROP TABLE #PreOp
	 --IF OBJECT_ID('tempdb..#tReport') IS NOT NULL DROP TABLE #tReport

	/*----------------------------------------------
	  * Tunable parameters
	------------------------------------------------*/
	DECLARE @nDaysHist			SMALLINT	= 90,	-- Set day count to limit history retention 
			@nSampleMAX			SMALLINT	= 30,	-- Set maximum history count to evalute using simple linear regression
			@nSampleMIN			SMALLINT	= 3,	-- Set minimum history count to evalute using simple linear regression
			@STDevSensitivity	SMALLINT	= 6		-- Set a value to increase or decrease the upper and lower threshold
			
	/*-----------------------------------------------------------------------------------------------------------------------------------
		* FLUSH TABLES                    
		* Clear history table of records older than @nDaysHist days or records added on current day to get latest stats
	-----------------------------------------------------------------------------------------------------------------------------------*/
	DELETE
	FROM tests.ETLLogDurationHistory
	WHERE DateRun < DATEADD(DAY, -@nDaysHist, GETDATE()) ; 
        /*----------------------------------------------------------------------------------------------------------------------------------- 
                * Get duration for all ETLs * Duration will be recorded per ETL, no matter how many times they have been recorded for that 
                  day already 
        -----------------------------------------------------------------------------------------------------------------------------------*/ 
        INSERT INTO [tests].[ETLLogDurationHistory] (ETLName,DateRun,TimeRun,Duration,DurationInt,ETLOutcome)
	SELECT	  ETLName	=	name
		 ,DateRun	=	CAST(CAST(CONVERT(DATE, CONVERT(VARCHAR(10), run_date)) AS VARCHAR(10)) AS DATE)
		 ,TimeRun	=	CAST(CAST(CONVERT(VARCHAR(8), DATEADD(SECOND, run_time, ''), 114) AS VARCHAR(10)) AS TIME(0))
		 ,Duration	=	CAST(CONVERT(CHAR(8), DATEADD(SECOND, run_duration, ''), 114)  AS TIME(0))
		 ,DurationInt	=	run_duration
		 ,ETLOutcome	=	CASE 
						WHEN SJH.run_status=0 THEN 'Failed'
						WHEN SJH.run_status=1 THEN 'Succeeded'
						WHEN SJH.run_status=2 THEN 'Retry'
						WHEN SJH.run_status=3 THEN 'Cancelled'
						ELSE 'Unknown'  
						END 
	FROM MSDB.dbo.sysjobhistory	SJH  
	JOIN MSDB.dbo.sysjobs		SJ ON SJH.job_id = SJ.job_id  
	WHERE step_id = 0  
	AND	name NOT IN (
				'collection_set_1_noncached_collect_and_upload'
				,'collection_set_2_upload'
				,'collection_set_3_upload'
				,'mdw_purge_data_[BI_MDW]'
				,'SSIS Server Maintenance Job'
				,'sysutility_get_cache_tables_data_into_aggregate_tables_hourly'
				,'sysutility_get_views_data_into_cache_tables'
				,'syspolicy_purge_history'
				,'sysutility_get_cache_tables_data_into_aggregate_tables_daily'
			    )
	AND	CAST(CONVERT(DATE, CONVERT(VARCHAR(10), run_date)) AS VARCHAR(10)) + ' ' 
		+ CAST(CONVERT(VARCHAR(8), DATEADD(SECOND, run_time, ''), 114) AS VARCHAR(10)) >= DATEADD(D,-1,GetDate())  
	ORDER BY name,run_date,run_time  

	/*-----------------------------------------------------------------------------------------------------------------------------------
	    * MAGIC
		* Using Simple Linear Regression, issue an estimation for what the Duration per ETL should be "today"
		* Based on the regression line, define thresholds that will determine alert

	*/-----------------------------------------------------------------------------------------------------------------------------------
	SELECT	 ETLName
			,X						= DATEDIFF(DAY, MIN(DateRun) OVER (PARTITION BY ETLName), DateRun) + 1
			,DateRun
			,Y						= DurationInt
			,SampleOrdinal			= ROW_NUMBER() OVER (PARTITION BY ETLName ORDER BY DateRun DESC) 
			,MeasurementsPerTable	= COUNT(*) OVER (PARTITION BY ETLName)
	INTO #PreOp
	FROM
	(
		SELECT	 ETLLogDurationHistoryID
				,ETLName
				,DateRun
				,TimeRun
				,Duration
				,DurationInt
				,ETLOutcome
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY ETLName, DateRun ORDER BY TimeRun DESC)
		FROM tests.ETLLogDurationHistory	EDH
		WHERE EDH.DateRun < CAST(GETDATE() AS DATE)
	)DT
	WHERE RecentPerDay = 1

	DECLARE @tXYData XYDataPoints
	INSERT INTO @tXYData(GroupID,X,Y)
	SELECT	 ETLName
			,X
			,Y
	FROM #PreOp
	WHERE SampleOrdinal <= @nSampleMAX AND MeasurementsPerTable >= @nSampleMIN

	/*-----------------------------------------------
	  * Table to use for updating the error log
	-----------------------------------------------*/
	DECLARE @ErrorOutput TABLE	( 
						FullTableName VARCHAR(200) NOT NULL
					)

	;WITH Pruned AS
	(
		SELECT	 ETLName
				,X						
				,DateRun
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable	
		FROM #PreOp pO
		WHERE	EXISTS	(
							SELECT *
							FROM @tXYData	tD
							WHERE tD.GroupID = pO.ETLName	
						)
	)
	,FNResponse AS
	(
		SELECT	 GroupID				
				,Slope				
				,Intercept			
				,StDevError			
		FROM dbo.fn_GetSimpleLinearRegression(@tXYData)
	)
	,TodayXCTE AS
	(
		SELECT	 ETLName
				,X						
				,DateRun
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable
				,GroupID				
				,Slope				
				,Intercept			
				,StDevError		
				,PredictedY	=	(	Slope * (
												--	TodayX
												DATEDIFF(DAY, MIN(DateRun) OVER (PARTITION BY ETLName), GETDATE()) + 1
											)
								) + Intercept
		FROM Pruned		P
		JOIN FNResponse	F	ON P.ETLName = F.GroupID
	)
	,ThresholdCTE AS
	(
		SELECT DISTINCT
				 ETLName
				,PredictedY
				,Y
				,UpperThreshold	= PredictedY + (@STDevSensitivity * STDevError)
				,LowerThreshold	= PredictedY - (@STDevSensitivity * STDevError)
				,StDevError
		FROM TodayXCTE
		WHERE SampleOrdinal = 1
	)
	,TblETLDurHistCTE AS
	(
		SELECT	 ETLName
				,DateRun
				,TimeRun
				,Duration
				,DurationInt
				,ETLOutcome
				,IsError
				,Sensitivity
				,CurrCntSigma
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY ETLName, DateRun ORDER BY TimeRun DESC)
		FROM tests.ETLLogDurationHistory	EDH
	)
	/*-----------------------------------------------
	  * Update only most recent record in history 
		table
	-----------------------------------------------*/
	UPDATE EDH
	SET  IsError		= 1
		,Sensitivity	= @STDevSensitivity
		,CurrCntSigma	= CEILING(ABS(EDH.DurationInt - ths.PredictedY) / NULLIF(ths.StDevError,0))
	/*-----------------------------------------------
	  * Output updated values to table variable 
		for later use 
	-----------------------------------------------*/
	OUTPUT	INSERTED.ETLName
	INTO	@ErrorOutput

	FROM TblETLDurHistCTE	EDH
	LEFT
	JOIN ThresholdCTE		ths	ON EDH.ETLName = ths.ETLName
	WHERE	EDH.RecentPerDay = 1
		AND EDH.DateRun = CAST(GETDATE() AS DATE)
		AND EDH.IsError IS NULL
		AND (
				(EDH.DurationInt NOT BETWEEN ths.LowerThreshold AND ths.UpperThreshold) 
				OR EDH.DurationInt = 0
			)

	/*-----------------------------------------------
	  * Log entry to failure log
	-----------------------------------------------*/
	INSERT INTO tests.TestFailureLog (TestDate, TestTime, TestCollection, TestName)		
	SELECT	 TestDate			= CAST(GETDATE() AS DATE)
			,TestTime			= CAST(GETDATE() AS TIME)
			,TestCollection		= 'ETLDuration'
			,FullTableName
	FROM @ErrorOutput
END

You can download the script file here.

Hope you are able to find this as useful as I do. Feel free to ask any questions in the comment section below!
10421110_647997325276802_799481460399430457_n

Retrieving Data Types of All Columns in a Database

Thought I would share a quick post about retrieving all the data types of all columns in a database. This can be quite a handy script when you are comparing development and production databases to make sure all the data types match in each table.

The script displays the order in which the columns appear in the table, the schema the column in the table belongs to, whether is is null or not, the maximum length and the precision of the column.

SELECT  [Schema]			= SCH.name 
       ,TableName			= TBS.name
       ,column_name			= COL.name 
       ,COL.column_id
       ,DatatypeSchema		= SCHEMA_NAME(TYP.[schema_id])
       ,Datatypename		= TYP.name
       ,TYP.is_user_defined
       ,TYP.is_assembly_type
       ,COL.is_nullable
       ,COL.max_length
       ,COL.[Precision]
       ,COL.scale
FROM sys.columns    COL
JOIN sys.types      TYP ON COL.user_type_id   = TYP.user_type_id
JOIN sys.tables     TBS ON TBS.[object_id]    = COL.[object_id]
JOIN sys.schemas    SCH ON SCH.[schema_id]    = TBS.[schema_id]
ORDER BY SCH.name, TBS.name, COL.column_id

Hope this can be of great use to you!
StayYoungKeepLearning

Email Job Execution History in HTML Format

Knowing if your jobs have succeeded or failed can be quite handy. Especially when it comes to knowing whether your backups are failing or if you have ETLs that are failing. You can choose to set up email notifications on each job, although if you have multiple servers you could be receiving a lot of email throughout the day.

There is a simple script where you can email out the job history in a HTML table format for all the jobs executed on a server in one email. You can then set this up on each server and have one mail per server for you to check every day. This script can be scheduled as a job that executes a T-SQL query.

First things first, you need to make sure you have database mail set up on the server and have a default account set. You can use your company’s SMTP server settings, or if you are doing this for yourself, you can use Gmail. There are many more to use besides Gmail.

For this post I shall be setting up the database mail settings with Gmail’s SMTP server details.

Part 1: Setting up database mail
Expand the Management node in Object Explorer. Right-click on Database Mail and select Configure Database Mail.
Step1

Click next on the Wizard start screen and by default the Set up Database Mail radio button will be selected. You can click next on this window to go through the steps of setting up Database Mail.
Step2
Step3

Next you will set up a Profile name for your database mail. I have chosen Job Execution mail to make it relevant for what it will be used for.
Step4

An SMTP account will need to be added. You will select Add on the same New Profile window.

Next complete the mail account details. I have set mine up to use my Gmail account. If you are using your company’s SMTP mail server details, you will need to find out the details and the port number to use. In my experience I have used Anonymous authentication when using a company SMTP server. Once everything is filled in you will click on OK.
Step5

You shall see your newly created profile for your mails. Then select next.
Step6

In the next window you need to manage your profile security. You shall see the profile name you just created. You will need to set this at the default profile for the sp_send_dbmail to use when sending out the emails. Then click on next.
Step7

In the next window you can configure system parameters. I choose to leave the settings that has been recommended by Microsoft SQL Server and click next.
Step8

A completion wizard will appear with a summary of everything that will be created and you can click on Finish.
Step9

Once all has executed you shall see green ticks next to each process that was completed. Finally you can click close and the database mail is now set up.
Step10

Now that the database mail is set up you can you can use the script to email out your job execution history in HTML format.

Part 2: HTML email script
The first section of the script is where you set up your HTML table and place your query for what you want to email out. You can also style the table to have background colours and set padding (you might need to brush up on some CSS).

At the end of the table set up you will set up who you want this to be emailed to, the subject, body and body format.

To test you can run this from within a new query window in SSMS. Later on if you choose you can set this up as a SQL job to email out once day or as frequently as you prefer.
EmailQuery
You can download the script for the Email Jobs Query here.

Involve me and I learn

Scripting Multiple SQL Server Agent Jobs

Thought I would share this useful tip on how to script multiple SQL Server Agent jobs into one script in order to execute it on another server.

For example, you have ETLs and backups scheduled on a DEV server and you want to replicate the jobs onto the PROD server. If you have one or two jobs it would be easy to manually create them. When you have a quite a few more jobs that need to be replicated it can be quite tedious to manually create each one on the other server.

There is a simple trick for you to script out multiple jobs from SSMS and run them on another server.

Step 1:
Connect to the server where the jobs are located that you want to replicate. Expand the SQL Server Agent node in Object Explorer and click on the Jobs node and push F7 on your keyboard.
Step1

Alternatively, you can access the Object Explorer Details by clicking on View on the top Menu bar and selecting the second option Object Explorer Details
Step1.1

This will open up Object Explorer Details
Step1.2

Step 2:
Next you will select all the jobs that you would like to replicate. You can do this by holding Ctrl in on your keyboard and clicking on each job. Then right-click on one of the highlighted jobs and select Script job as -> Create To -> New Query Editor Window

Step2

Step 3:
This will script out all the jobs into one window. You will then need to do a find and replace to change the server name to the name of the server you will be executing this on.
Step3

You can do this by highlighting the server name. In this case BIDEV and pressing Ctrl + H on your keyboard which will open up the Find and Replace dialogue box. Next you will enter the name of the server you want to run the script on and click Replace All.
Step3.1

A pop up will appear to tell you how many occurrences were replaced.
Step3.2

Step 4:
The last step will be to copy and paste the script onto the server you would like to replicate the jobs on. Alternatively, you can save the script and copy the script over to the server you would like to run it on and open it up in SSMS, then click execute once open.

2660-inspirational-quotes-about-learning

SQL OUTPUT Clause

For my first post I thought I would share about the SQL OUTPUT clause which returns information from modified rows (INSERT, DELETE, UPDATE).

This can be quite useful for returning information to confirmation screens in an application or for auditing and archiving the modified rows.

Inserted and Deleted tables are two memory-resident tables that reside within SQL server and are used with the OUTPUT clause. Whenever any DML (INSERT, DELETE, UPDATE) statement is executed, these tables are populated.

The results of the INSERT statement are stored in the Insert table and the results of the DELETE statement are stored in the Delete table. An UPDATE statement is essentially delete and insert operations combined, therefore the deleted value is written to the Delete table and the new value is written to the Insert table.

On a side note, you cannot directly query Inserted and Deleted tables to see what data they are currently holding, but you can use them with the OUTPUT clause as well as with Triggers.

The below examples will show you how to use the OUTPUT clause for DML statements. This will just print the results on screen.

First let us create a table and insert some values:

IF OBJECT_ID('dbo.Music') IS NOT NULL
DROP TABLE dbo.Music
GO
CREATE TABLE dbo.Music
(
   ID			  INT IDENTITY(1,1) CONSTRAINT PK_Music_ID PRIMARY KEY
  ,SongName	VARCHAR(200) NOT NULL
  ,Artist		VARCHAR(50) NOT NULL
)
GO

INSERT INTO dbo.Music (SongName, Artist) 
VALUES	 ('Symphony No.9', 'Wolfgang Amadeus Mozart')
        ,('Don Giovanni','Wolfgang Amadeus Mozart')
        ,('Gabriel''s Oboe','Yo-yo Ma')
        ,('Inception: Dream is Collapsing','Hans Zimmer')
        ,('Cello Suits','Johann Sebastian Bach')
GO

SELECT * FROM dbo.Music
GO

Example: Using the OUTPUT clause with the INSERT statement

INSERT INTO dbo.Music (SongName, Artist)
OUTPUT INSERTED.ID, INSERTED.SongName, INSERTED.Artist
VALUES ('Swan Lake', 'Pyotr Ilyich Tchaikovsky');
GO

SELECT * FROM dbo.Music
GO

Example: Using the OUTPUT clause with the DELETE statement

DELETE FROM dbo.Music
OUTPUT DELETED.ID, DELETED.SongName, DELETED.Artist
WHERE ID=5;
GO

SELECT * FROM dbo.Music
GO

Example: Using the OUTPUT clause with the UPDATE statement

UPDATE dbo.Music
SET Artist = 'Ludwig van Beethoven'
OUTPUT DELETED.Artist, INSERTED.Artist
WHERE ID = 1;
GO

SELECT * FROM dbo.Music;
GO

The next few examples will show you how to store the results in tables, temp tables and table variables.

Example 1 – table:
Inserting the data return from an OUTPUT clause into a table can be done using an OUTPUT INTO clause. Keep in mind that you first need to create the target table which must have the same number of columns and data types that match the source table.

IF OBJECT_ID('dbo.Music_Inserted') IS NOT NULL
DROP TABLE dbo.Music_Inserted
GO
CREATE TABLE dbo.Music_Inserted
(
   ID			  INT IDENTITY(1,1) CONSTRAINT PK_Music__Inserted_ID PRIMARY KEY
  ,SongName	VARCHAR(200) NOT NULL
  ,Artist		VARCHAR(50) NOT NULL
)
GO

INSERT INTO dbo.Music ( ID, SongName, Artist)
OUTPUT INSERTED.* INTO dbo.Music_Inserted 
VALUES (5, 'Duniya', 'Piyush Mishra');
GO

-- Result of Music_Inserted table and base table.
SELECT * FROM dbo.Music_Inserted;
SELECT * FROM dbo.Music;
GO

Example 2 – temp table:
With a temp table you will need to do the same as above and create the temp table first and then use the OUTPUT INTO clause to insert the data into the temp table.

IF OBJECT_ID('tempdb..#Music_Deleted') IS NOT NULL
DROP TABLE dbo.#Music_Deleted
GO
CREATE TABLE dbo.#Music_Deleted
(
   ID			  INT 
  ,SongName	VARCHAR(200) NOT NULL
  ,Artist		VARCHAR(50) NOT NULL
)
GO

DELETE FROM dbo.Music
OUTPUT DELETED.* INTO dbo.#Music_Deleted
WHERE ID IN (4,5);
GO

-- Result of temporary table and base table.
SELECT * FROM dbo.#Music_Deleted;
SELECT * FROM dbo.Music;

Example 3 – table variable:
You will declare a table variable with the same structure as the source table.

DECLARE @Music_Deleted TABLE
(
   ID			  INT 
  ,SongName	VARCHAR(200) NOT NULL
  ,Artist		VARCHAR(50) NOT NULL
)

DELETE FROM dbo.Music
OUTPUT DELETED.* 
INTO @Music_Deleted
WHERE ID IN (1,2);

-- Result of table variable
SELECT * FROM @Music_Deleted;