Tuesday, 20 April 2010

Navigating SSIS packages with Linq to Xml

I’m currently writing a validator tool which compares data files against the metadata from a flat file connection manager in an SSIS package and outputs a data quality report. It’s intended to provide a quick way of verifying customer’s data before doing an on-site demonstration. Hopefully this will solve the problem of wasted journeys that I’ve often had due to data being an incorrect format. The problem stems from the fact that it takes several hours for the customer to download the data from their provider and confidentiality precludes it being sent to us over the wire – in addition to which the width and length of data makes it difficult to fix quality issues on the fly.

The only limitation is that the tool cannot make use of the Dts namespaces (because their availability in the deployment environment cannot be guaranteed), which means that some part of the vehicle is going to need re-inventing. Also, some of the data files we’re dealing with have in excess of 1000 columns, so we’d like to be able to simply reflect the column metadata from the existing dtsx packages (rather than replicate it elsewhere). This has led me to the obvious conclusion of issuing XPath statements against the packages so that I can build a list of columns with data type information and such. However this got rather unwieldy, so I though it would be a good time to use LINQ to XML. I’ve avoided this thus far, as it’s very rare that I work with raw XML these days – I’m one of those lazy facists who believes that XML should never be seen outside of standards documents or the message layer (although as with all such ill-founded pronouncements this falls down somewhat with derivatives such as XHTML and XAML). To be honest, it’s swings and roundabouts between using LINQ to XML and the ‘old way’. It’s still full of string literals and anonymous types, but iteration syntax is slightly more elegant.

I won’t reproduce the entire project here – just enough to make it through the basics of reading the package. I may cover the data type matching in another post. The following constant is used throughout the code:

public const string DTS_NAMESPACE = @"{www.microsoft.com/SqlServer/Dts}";

First up is getting a list of the connection managers in the package (which is a PITA because the name is an attributed child node of the <ConnectionManager> element):

var connectionManagers = from cm in packageXml.Root.Elements(
                DTS_NAMESPACE + "ConnectionManager").Elements(DTS_NAMESPACE + "Property")
                        where cm.Attribute(DTS_NAMESPACE + "Name").Value == "ObjectName"
                        select new
                            connectionManagerName = cm.Value,
                            connectionManagerXml = cm.Parent

This gives us the name of the connection manager and the child XML nodes in separate variables. We can then iterate through this (for this example we’re grabbing flat file connections only):

foreach (var connectionManager in connectionManagers)
   XDocument connectionXml = XDocument.Parse(

   List<string> connectionTypes = (from ct in connectionXml.Root.Elements(DTS_NAMESPACE + "Property")
                                   where ct.Attribute(DTS_NAMESPACE + "Name").Value == "CreationName"
                                   select ct.Value.ToString()).ToList<string>();

   if (connectionTypes.Count == 1)
       string connectionType = connectionTypes[0];

       if (connectionType.ToUpper() == "FLATFILE")
           // all the Xml for the connection manager is in the connectionXml variable

We can get at the connection-level properties using the following method:

private string GetConnectionManagerProperty(
   XDocument connectionManagerXml, string propertyName)
   List<string> properties = (from pr in connectionManagerXml.Root.Elements(DtsxPackage.DTS_NAMESPACE + "Property")
                              where pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == propertyName
                              select pr.Value.ToString()).ToList<string>();
   Debug.Assert(properties.Count <= 1);

   if (properties.Count == 1)
       return properties[0];

   return string.Empty;

like so:

string unicodeFlag = GetConnectionManagerProperty(connectionManagerXml, "Unicode");

Similarly we can iterate through the columns like so:

var columns = from cl in connectionManagerXml.Root.Descendants(DtsxPackage.DTS_NAMESPACE + "FlatFileColumn")
             select cl;

foreach (var columnProperties in columns)
   var properties = (from pr in columnProperties.Elements(DtsxPackage.DTS_NAMESPACE + "Property")
                     where pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "ObjectName" ||
                     pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "DataType" ||
                     pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "DataPrecision" ||
                     pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "DataScale" ||
                     pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "ColumnDelimiter" ||
                     pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value == "MaximumWidth"
                     select new PropertyPair()
                         Name = pr.Attribute(DtsxPackage.DTS_NAMESPACE + "Name").Value,
                         Value = pr.Value

It would be nice to be able to deconstruct the a dtsx in a single XPath/XML statement, but the rather strange schema for dtsx appears to preclude that. Hopefully some XmlMagician will show me how much more concisely and simply this could be achieved.

No comments:

Post a Comment