Limiting SSIS to a Single File Type

A while ago I was having some issues with SSIS and moving files. There was a hidden file called thumbs.db

When SSIS was running the move file task it would fail because it can’t place duplicate files in the same location. The thumbs.db file was the problem. Even though it was hidden as a windows file, SSIS still saw it and wanted to move it.

This is where limiting SSIS to a specific file type saves you from this problem.

To do this you filter on the Foreach Loop Container. You select the container and then you click on Properties.

In the Properties right at the top you shall see FileSpec. There you can set it to the file type you would like to filter/limit for.

And that’s how you limit SSIS to a specific file type!

SSIS Error – the path for isserverexec.exe cannot be found

Recently been working with getting multiple SQL Instances on one machine to work. When I uninstalled an old one it affected the latest one. Then all of a sudden my ETL jobs failed. It did not make sense at all seeing as I removed a different instance.

Got the following error when trying to run my ETL jobs or when trying to deploy the jobs:
“the path for isserverexec.exe cannot be found”

Looking for solutions online, they recommend going into the registry. Which didn’t work for me.

I ended up using the install file and running it to add features to an existing connection. And there is was Integration services was not installed for the instance that I was getting the error with.

Installed SSIS on the current instance that needed it and all worked perfectly after that and the jobs ran as perusal.

Hope this helps someone else in the future.

Setting Up SSIS in SQL Server Management Studio

Before you can deploy a SSIS project you need to set up SSIS in SQL Server Management Studio. When installing SQL you will select to install SSIS, however, it does not setup the SSISSDB or configure any settings for SSIS.

Here are the steps for setting up SSIS in SQL Server Management Studio.

Step 1:
Right-click on Integrations Services Catalogs and select Create Catalog.

Step 2:
Check the box next to Enable CLR Integration and set a password for your SSISDB and click Ok.

You will then see you SSISDB listed under Databases and under Integration Services Catalogs.

Now you will be able to deploy your SSIS project to your server where you have set up SSIS.

The next post will be on deploying an SSIS project.

MDS Backup Strategies

Recently a colleague and I had a discussion about which method is the best to use for backing up MDS (Master Data Services) for disaster recovery. We came to the conclusion it all depended on the SQL environment version you were recovering to. Below are the 2 different methods for backing up an MDS database.

Model Backup
Pros:

  • Fast to create, can backup specific versions
  • Will always be able to deploy a model to a MDS install
  • Can be automated via SSIS
  • Can be restore to a server version that is +1, 0 and -1

Cons:

  • If you want to save backups of different models, then the models will have to be backed up separately.
  • Security is not retained; it will have to be redone. Can take about 2 minutes per user. On a small model this should not be a problem, however, when scaled out it could be very time consuming.
  • Saving more than one backup of a model would require a script/maintenance task to remove backups older than X amount of time.

When deploying the model, the option DeployClone must be used, not DeployNew. DeployNew will remove the MUID and therefore will not allow any changes to be deployed to that model at a later stage. DeployClone keeps it all intact and will allow for future updates.

Database Backup
Pros:

  • Security is retained
  • Simple to set-up and maintain
  • Easy to restore
  • Maintenance tasks to clean up old backups

Cons:

  • Cannot be restored to a server version less than the current version.

The database backup for MDS can be scheduled just like other backups. You can use the maintenance wizard to set up the backup and clean up task on a schedule.

If you know what version the disaster recovery environment will be on, for example the same version as the current MDS server version or higher, then it is simpler to use the Database Backup method. If you are uncertain of what version the disaster recovery version will be then you would want to use the Model Backup method in case you have to restore to a version less than the current MDS server version.

Hope this helps you to choose what is best suited for your environment!

Special thanks to Janlo Du Toit for the discussion on these methods 🙂

Detecting ETL Duration Anomalies with Simple Linear Regression

Thought I would share a piece on detecting ETL duration anomalies using Simple Linear Regression. This post will give you a brief overview of what simple linear regression is and how you can use it to your advantage with testing. In this post I will be demonstrating how to use Simple Linear Regression to detect ETL duration Anomalies.

What happens with ETLs is that they normally run for a set amount of time, give or take a few minutes every time. Then all of a sudden it starts running extremely long. If you have nothing in place to check this, other than looking at it manually each day, then this could be your solution.

Let me begin by explaining what Simple Linear Regression is.

The Simple Linear Regression model is used in statistics, where a straight line is plotted through a set of points on a graph. It is plotted in such a way that makes the sum of squared residuals of the model as small as possible.

In order to plot the line on the graph you need to know where the line intercepts on the y axis and the slope of the line. You will also need to set the standard deviation to detect when something is in fact an anomaly. See Figure below.
LR

Step 1:
To start you will need to create a user-defined data type to accept the values of the X and Y data points.

CREATE TYPE dbo.XYDataPoints AS TABLE
(
	GroupID VARCHAR(200) NOT NULL,
	X FLOAT NULL,
	Y FLOAT NULL
)

Step 2:
You need to create a table-valued function which will calculate the simple linear regression of the ETL’s durations.

CREATE FUNCTION [dbo].[fn_GetSimpleLinearRegression] (@MyInput AS XYDataPoints READONLY)

RETURNS @MyOutput TABLE 
(
	-- Columns returned by the function
	 GroupID			VARCHAR(200)
	,Slope				FLOAT
	,Intercept			FLOAT
	,StDevError			DECIMAL(18,5)
)
AS 

BEGIN

	;WITH BASE AS -- Check for NULL source
	(	
		SELECT	 GroupID
				,X
				,Y
		FROM @MyInput
	)
	,XYBAR AS
	(
		SELECT	GroupID
				,Y
				,yBar = AVG(Y) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
				,X
				,xBar = AVG(X) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
				,xCount	= COUNT(X) OVER(PARTITION BY GroupID ORDER BY X ROWS BETWEEN UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING)
		FROM BASE
	)
	,SLOPE AS
	(
		SELECT	GroupID
				,Slope = SUM((X - xBar) * (Y - yBar)) / SUM(POWER(X - xBar,2))
				,MAX(yBar) AS yBar
				,MAX(xBar) AS xBar
				,MAX(xCount) AS xCount
		FROM XYBAR
		GROUP BY GroupID
	)
	,INTERCEPT AS
	(
		SELECT	 GroupID
				,Slope
				,xBar
				,yBar
				,xCount
				,Intercept = yBar - xBar * Slope
		FROM SLOPE
	)
	INSERT @MyOutput
	SELECT	 INTERCEPT.GroupID
			,Slope
			,Intercept
			,StDevError = STDEV	(	
									ABS(y - ((Slope * (x)) + Intercept))
								)
	FROM BASE
	JOIN INTERCEPT ON BASE.GroupID = INTERCEPT.GroupID
	GROUP BY Slope,Intercept, INTERCEPT.GroupID

	RETURN
END

Step 3:
You need to create a table to contain a history of the duration of ETLs. I chose to create a tests schema and to create all my objects for this detection under the tests schema.

CREATE TABLE [tests].[ETLLogDurationHistory](
	[ETLLogDurationHistoryID] [int] IDENTITY(1,1) NOT NULL,
	[ETLName] [varchar](250) NOT NULL,
	[DateRun] [date] NOT NULL,
	[TimeRun] [time](7) NOT NULL,
	[Duration] [time](0) NOT NULL,
	[DurationInt] [int] NOT NULL,
	[ETLOutcome] [varchar](250) NOT NULL,
	[IsError] [bit] NULL,
	[Sensitivity] [decimal](18, 3) NULL,
	[CurrCntSigma] [decimal](18, 5) NULL,
 CONSTRAINT [PK_ETLLogDurationHistory] PRIMARY KEY CLUSTERED 
(
	[ETLLogDurationHistoryID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Step 4:
You need to create a table to store the error details that were detected for the ETL duration anomalies.

CREATE TABLE [tests].[TestFailureLog](
	[TestFailureLogID] [int] IDENTITY(1,1) NOT NULL,
	[TestDate] [date] NOT NULL,
	[TestTime] [time](7) NOT NULL,
	[TestCollection] [varchar](100) NOT NULL,
	[TestName] [varchar](300) NOT NULL,
 CONSTRAINT [PK_TestFailureLog] PRIMARY KEY CLUSTERED 
(
	[TestFailureLogID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]

Step 5:
Create the proc to detect the ETL duration anomalies. This proc will also insert into the history table to keep track of how long ETLs ran for in order to work out whether there is an anomaly or not.

CREATE PROC [tests].[tst_ETLDurationAnomaly] 
AS
BEGIN
	SET NOCOUNT ON;
	/*----------------------------------------------	
	  * Uncomment for debug
	------------------------------------------------*/
	 --IF OBJECT_ID('tempdb..#PreOp') IS NOT NULL DROP TABLE #PreOp
	 --IF OBJECT_ID('tempdb..#tReport') IS NOT NULL DROP TABLE #tReport

	/*----------------------------------------------
	  * Tunable parameters
	------------------------------------------------*/
	DECLARE @nDaysHist			SMALLINT	= 90,	-- Set day count to limit history retention 
			@nSampleMAX			SMALLINT	= 30,	-- Set maximum history count to evalute using simple linear regression
			@nSampleMIN			SMALLINT	= 3,	-- Set minimum history count to evalute using simple linear regression
			@STDevSensitivity	SMALLINT	= 6		-- Set a value to increase or decrease the upper and lower threshold
			
	/*-----------------------------------------------------------------------------------------------------------------------------------
		* FLUSH TABLES                    
		* Clear history table of records older than @nDaysHist days or records added on current day to get latest stats
	-----------------------------------------------------------------------------------------------------------------------------------*/
	DELETE
	FROM tests.ETLLogDurationHistory
	WHERE DateRun < DATEADD(DAY, -@nDaysHist, GETDATE()) ; 
        /*----------------------------------------------------------------------------------------------------------------------------------- 
                * Get duration for all ETLs * Duration will be recorded per ETL, no matter how many times they have been recorded for that 
                  day already 
        -----------------------------------------------------------------------------------------------------------------------------------*/ 
        INSERT INTO [tests].[ETLLogDurationHistory] (ETLName,DateRun,TimeRun,Duration,DurationInt,ETLOutcome)
	SELECT	  ETLName	=	name
		 ,DateRun	=	CAST(CAST(CONVERT(DATE, CONVERT(VARCHAR(10), run_date)) AS VARCHAR(10)) AS DATE)
		 ,TimeRun	=	CAST(CAST(CONVERT(VARCHAR(8), DATEADD(SECOND, run_time, ''), 114) AS VARCHAR(10)) AS TIME(0))
		 ,Duration	=	CAST(CONVERT(CHAR(8), DATEADD(SECOND, run_duration, ''), 114)  AS TIME(0))
		 ,DurationInt	=	run_duration
		 ,ETLOutcome	=	CASE 
						WHEN SJH.run_status=0 THEN 'Failed'
						WHEN SJH.run_status=1 THEN 'Succeeded'
						WHEN SJH.run_status=2 THEN 'Retry'
						WHEN SJH.run_status=3 THEN 'Cancelled'
						ELSE 'Unknown'  
						END 
	FROM MSDB.dbo.sysjobhistory	SJH  
	JOIN MSDB.dbo.sysjobs		SJ ON SJH.job_id = SJ.job_id  
	WHERE step_id = 0  
	AND	name NOT IN (
				'collection_set_1_noncached_collect_and_upload'
				,'collection_set_2_upload'
				,'collection_set_3_upload'
				,'mdw_purge_data_[BI_MDW]'
				,'SSIS Server Maintenance Job'
				,'sysutility_get_cache_tables_data_into_aggregate_tables_hourly'
				,'sysutility_get_views_data_into_cache_tables'
				,'syspolicy_purge_history'
				,'sysutility_get_cache_tables_data_into_aggregate_tables_daily'
			    )
	AND	CAST(CONVERT(DATE, CONVERT(VARCHAR(10), run_date)) AS VARCHAR(10)) + ' ' 
		+ CAST(CONVERT(VARCHAR(8), DATEADD(SECOND, run_time, ''), 114) AS VARCHAR(10)) >= DATEADD(D,-1,GetDate())  
	ORDER BY name,run_date,run_time  

	/*-----------------------------------------------------------------------------------------------------------------------------------
	    * MAGIC
		* Using Simple Linear Regression, issue an estimation for what the Duration per ETL should be "today"
		* Based on the regression line, define thresholds that will determine alert

	*/-----------------------------------------------------------------------------------------------------------------------------------
	SELECT	 ETLName
			,X						= DATEDIFF(DAY, MIN(DateRun) OVER (PARTITION BY ETLName), DateRun) + 1
			,DateRun
			,Y						= DurationInt
			,SampleOrdinal			= ROW_NUMBER() OVER (PARTITION BY ETLName ORDER BY DateRun DESC) 
			,MeasurementsPerTable	= COUNT(*) OVER (PARTITION BY ETLName)
	INTO #PreOp
	FROM
	(
		SELECT	 ETLLogDurationHistoryID
				,ETLName
				,DateRun
				,TimeRun
				,Duration
				,DurationInt
				,ETLOutcome
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY ETLName, DateRun ORDER BY TimeRun DESC)
		FROM tests.ETLLogDurationHistory	EDH
		WHERE EDH.DateRun < CAST(GETDATE() AS DATE)
	)DT
	WHERE RecentPerDay = 1

	DECLARE @tXYData XYDataPoints
	INSERT INTO @tXYData(GroupID,X,Y)
	SELECT	 ETLName
			,X
			,Y
	FROM #PreOp
	WHERE SampleOrdinal <= @nSampleMAX AND MeasurementsPerTable >= @nSampleMIN

	/*-----------------------------------------------
	  * Table to use for updating the error log
	-----------------------------------------------*/
	DECLARE @ErrorOutput TABLE	( 
						FullTableName VARCHAR(200) NOT NULL
					)

	;WITH Pruned AS
	(
		SELECT	 ETLName
				,X						
				,DateRun
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable	
		FROM #PreOp pO
		WHERE	EXISTS	(
							SELECT *
							FROM @tXYData	tD
							WHERE tD.GroupID = pO.ETLName	
						)
	)
	,FNResponse AS
	(
		SELECT	 GroupID				
				,Slope				
				,Intercept			
				,StDevError			
		FROM dbo.fn_GetSimpleLinearRegression(@tXYData)
	)
	,TodayXCTE AS
	(
		SELECT	 ETLName
				,X						
				,DateRun
				,Y						
				,SampleOrdinal			
				,MeasurementsPerTable
				,GroupID				
				,Slope				
				,Intercept			
				,StDevError		
				,PredictedY	=	(	Slope * (
												--	TodayX
												DATEDIFF(DAY, MIN(DateRun) OVER (PARTITION BY ETLName), GETDATE()) + 1
											)
								) + Intercept
		FROM Pruned		P
		JOIN FNResponse	F	ON P.ETLName = F.GroupID
	)
	,ThresholdCTE AS
	(
		SELECT DISTINCT
				 ETLName
				,PredictedY
				,Y
				,UpperThreshold	= PredictedY + (@STDevSensitivity * STDevError)
				,LowerThreshold	= PredictedY - (@STDevSensitivity * STDevError)
				,StDevError
		FROM TodayXCTE
		WHERE SampleOrdinal = 1
	)
	,TblETLDurHistCTE AS
	(
		SELECT	 ETLName
				,DateRun
				,TimeRun
				,Duration
				,DurationInt
				,ETLOutcome
				,IsError
				,Sensitivity
				,CurrCntSigma
				,RecentPerDay	= ROW_NUMBER() OVER (PARTITION BY ETLName, DateRun ORDER BY TimeRun DESC)
		FROM tests.ETLLogDurationHistory	EDH
	)
	/*-----------------------------------------------
	  * Update only most recent record in history 
		table
	-----------------------------------------------*/
	UPDATE EDH
	SET  IsError		= 1
		,Sensitivity	= @STDevSensitivity
		,CurrCntSigma	= CEILING(ABS(EDH.DurationInt - ths.PredictedY) / NULLIF(ths.StDevError,0))
	/*-----------------------------------------------
	  * Output updated values to table variable 
		for later use 
	-----------------------------------------------*/
	OUTPUT	INSERTED.ETLName
	INTO	@ErrorOutput

	FROM TblETLDurHistCTE	EDH
	LEFT
	JOIN ThresholdCTE		ths	ON EDH.ETLName = ths.ETLName
	WHERE	EDH.RecentPerDay = 1
		AND EDH.DateRun = CAST(GETDATE() AS DATE)
		AND EDH.IsError IS NULL
		AND (
				(EDH.DurationInt NOT BETWEEN ths.LowerThreshold AND ths.UpperThreshold) 
				OR EDH.DurationInt = 0
			)

	/*-----------------------------------------------
	  * Log entry to failure log
	-----------------------------------------------*/
	INSERT INTO tests.TestFailureLog (TestDate, TestTime, TestCollection, TestName)		
	SELECT	 TestDate			= CAST(GETDATE() AS DATE)
			,TestTime			= CAST(GETDATE() AS TIME)
			,TestCollection		= 'ETLDuration'
			,FullTableName
	FROM @ErrorOutput
END

You can download the script file here.

Hope you are able to find this as useful as I do. Feel free to ask any questions in the comment section below!
10421110_647997325276802_799481460399430457_n

Formatting Shortcuts in SSDT for SSIS Packages

When developing SSIS packages the tasks and transformations tend to be misaligned after dragging them onto the designer from the SSIS toolbox. You can correct this by selecting Format > Auto Layout > Diagram.
Diagram1
When you have multiple packages in your SSIS project this can become quite tedious. There is a quicker way by setting up a keyboard shortcut to align everything up. I have found this to be the most useful shortcut I have come across while developing SSIS projects.

To set up the shortcut you can follow the steps below.

Step 1:
Select Tools > Options
Step1

Step 2:
Select Keyboard under Environment on the left in the pop-up. Then select Format.Diagram on the right hand side of the pop-up.
Step2

Step 3:
Click in the “Press shortcut keys” dialogue block and push the keys on your keyboard you would like to use for this option and click Ok. I recommend using CTRL + F, CTRL + D as it is easier to remember.
Step3

Now your shortcut is set up for you to use to align everything perfectly in each package!
TheWisestMind

Creating a “Run As” Account for Scheduled ETL Jobs

When working in large SQL production environments that have multiple jobs doing different things you might want the ETL jobs to run as another account other than SQL Server Agent.

This can help with security by limiting the amount of access the account has to the databases the packages in the ETLs are connecting to.

This is quite simple to set up. I recommend using a Windows AD account. In my example below of the setup I shall be demonstrating how to create an account to run the ETL jobs.

Step 1:
You will need to add the account to the Logins under Security on the server where the jobs and are executing and the ETL projects are deployed to.
Step 1

Step 2:
You will need to create a Credential under Security that links to the account you created. Right click on Credential and select New Credential…
Step 2

Step 3:
You will give the Credential a name and select the ellipse just below it so that you can select the Windows account you added under Logins in step 1 and click Ok. You will then type in the same password for the Windows AD account and click Ok. Now your credential is set up.
Step 3
Step 3a

Step 4:
Next you will need to add the Credential to the SSIS Package Execution Proxy so that it can execute SSIS packages on the server. Expand the SQL Server Agent node, expand the Proxies node and right click on SSIS Package Execution and select New Proxy…
Step 4

Step 5:
Give your Proxy a name. I choose to use the same name as the Windows AD account throughout the entire setup. Then select the ellipse next to Credential so that you can add the Credential you created in step 3 and click Ok.
Step 5

You will notice that SQL Server Integration Services Package is checked under Active to the following subsystems. This allows the account to execute SSIS packages. You can click Ok and the setup is complete.
Step 5a

You will now see the Proxy you created under SSIS Package Execution.
Step 5b

Step 6:
When you create the new job and in the steps section set the type to SQL Server Integration Services Package, you will then be able to click on the drop down below Run as and you shall see the Proxy account you set up. You can select it and then your job will run as that account.
Step 6

An important note to remember is that the Windows AD account you are using will need Read/Write access on the databases that are used as connections in the ETL packages.

LearnWithPleasure

Email Job Execution History in HTML Format

Knowing if your jobs have succeeded or failed can be quite handy. Especially when it comes to knowing whether your backups are failing or if you have ETLs that are failing. You can choose to set up email notifications on each job, although if you have multiple servers you could be receiving a lot of email throughout the day.

There is a simple script where you can email out the job history in a HTML table format for all the jobs executed on a server in one email. You can then set this up on each server and have one mail per server for you to check every day. This script can be scheduled as a job that executes a T-SQL query.

First things first, you need to make sure you have database mail set up on the server and have a default account set. You can use your company’s SMTP server settings, or if you are doing this for yourself, you can use Gmail. There are many more to use besides Gmail.

For this post I shall be setting up the database mail settings with Gmail’s SMTP server details.

Part 1: Setting up database mail
Expand the Management node in Object Explorer. Right-click on Database Mail and select Configure Database Mail.
Step1

Click next on the Wizard start screen and by default the Set up Database Mail radio button will be selected. You can click next on this window to go through the steps of setting up Database Mail.
Step2
Step3

Next you will set up a Profile name for your database mail. I have chosen Job Execution mail to make it relevant for what it will be used for.
Step4

An SMTP account will need to be added. You will select Add on the same New Profile window.

Next complete the mail account details. I have set mine up to use my Gmail account. If you are using your company’s SMTP mail server details, you will need to find out the details and the port number to use. In my experience I have used Anonymous authentication when using a company SMTP server. Once everything is filled in you will click on OK.
Step5

You shall see your newly created profile for your mails. Then select next.
Step6

In the next window you need to manage your profile security. You shall see the profile name you just created. You will need to set this at the default profile for the sp_send_dbmail to use when sending out the emails. Then click on next.
Step7

In the next window you can configure system parameters. I choose to leave the settings that has been recommended by Microsoft SQL Server and click next.
Step8

A completion wizard will appear with a summary of everything that will be created and you can click on Finish.
Step9

Once all has executed you shall see green ticks next to each process that was completed. Finally you can click close and the database mail is now set up.
Step10

Now that the database mail is set up you can you can use the script to email out your job execution history in HTML format.

Part 2: HTML email script
The first section of the script is where you set up your HTML table and place your query for what you want to email out. You can also style the table to have background colours and set padding (you might need to brush up on some CSS).

At the end of the table set up you will set up who you want this to be emailed to, the subject, body and body format.

To test you can run this from within a new query window in SSMS. Later on if you choose you can set this up as a SQL job to email out once day or as frequently as you prefer.
EmailQuery
You can download the script for the Email Jobs Query here.

Involve me and I learn