Charteris Community Server

Welcome to the Charteris Community
Welcome to Charteris Community Server Sign in | Join | Help
in Search

Chris Dickson's Blog

  • Exploring the WCF Named Pipe Binding - Part 3

    In this post I will show one way to restrict access to the named pipe created by the WCF named pipe listener, to provide a partial workaround for the security flaw mentioned in my last post.

    The strategy is to target directly the internal property AllowedUsers on the type System.ServiceModel.Channels.NamedPipeChannelListener. We cannot call this property normally because it is internal to WCF, but reflection allows an alternative way to invoke it. Since this only needs to be done once, when Open is called on the ServiceHost to build the service run-time, the performance cost of using reflection is not an issue here. We will populate this AllowedUsers collection with the SID for a Group representing the authorised users of the service we are protecting, supplied as a parameter of the binding before the service is opened. It turns out we also need to add the SID for the service account itself, for reasons I will explain in more detail below. WCF will then use this collection of SIDs, rather than its default list (EVERYONE), when calling CreateNamedPipe in the PipeConnectionListener.

    After Open has been called on the ServiceHost, WCF builds the server run-time stack. The key part of this process which interests us is the point where the channel listener is created by the transport binding element. When using the standard netNamedPipe binding, the relevant transport binding element is of type System.ServiceModel.NamedPipeTransportBindingElement. We can conveniently perform our amendment to the configuration of the listener, by subclassing this NamedPipeTransportBindingElement and overriding the virtual method BuildChannelListener<>(). This allows us to get a reference to the listener after it has been created by the standard WCF transport binding element code, but before BeginAccept() is called on it (whch is when the first pipe instance is created).

    Here is some code for a custom named pipe binding which implements this strategy:

    using System;
    using System.Collections.Generic;
    using System.ServiceModel.Channels;
    using System.ServiceModel;
    using System.Reflection;
    using System.Security.Principal;
    using System.Threading;

    namespace Charteris.ChrisDicksonBlog.Samples
    {
      public class AclSecuredNamedPipeBinding : CustomBinding
     
    {

        public AclSecuredNamedPipeBinding(): base()
           {
               NetNamedPipeBinding standardBinding = new NetNamedPipeBinding(NetNamedPipeSecurityMode.Transport);
               foreach (BindingElement element in standardBinding.CreateBindingElements())
              {
                  NamedPipeTransportBindingElement transportElement = element as NamedPipeTransportBindingElement;
                  base.Elements.Add(
                 null != transportElement ? new AclSecuredNamedPipeTransportBindingElement(transportElement) : element);

              }
              AddUserOrGroup(WindowsIdentity.GetCurrent().User);
           }

        public void AddUserOrGroup(SecurityIdentifier sid)
           {
              List<SecurityIdentifier> allowedUsers
                  = Elements.Find<AclSecuredNamedPipeTransportBindingElement>().AllowedUsers;

              if (!allowedUsers.Contains(sid))
              {
                  allowedUsers.Add(sid);
              }
           }
       }

      public class AclSecuredNamedPipeTransportBindingElement : NamedPipeTransportBindingElement
     
    {
           
    private static Type namedPipeChannelListenerType 
                  = Type.GetType("System.ServiceModel.Channels.NamedPipeChannelListener, System.ServiceModel", false);

        public AclSecuredNamedPipeTransportBindingElement(NamedPipeTransportBindingElement inner): base(inner)
          
    {
            
    if (inner is AclSecuredNamedPipeTransportBindingElement)
             {
                
    _allowedUsers = new List<SecurityIdentifier>(
                  ((AclSecuredNamedPipeTransportBindingElement)inner)._allowedUsers);
          }
        }

        public override BindingElement Clone()
           {
             return new AclSecuredNamedPipeTransportBindingElement(this);
           }

        public override IChannelListener<TChannel> BuildChannelListener<TChannel>(BindingContext context)
          
    {
          IChannelListener<TChannel> listener = base.BuildChannelListener<TChannel>(context);
          PropertyInfo p = namedPipeChannelListenerType.GetProperty(
                  "AllowedUsers", BindingFlags.Instance|BindingFlags.NonPublic);
          p.SetValue(listener, _allowedUsers, null);
          return listener;
        }

        internal List<SecurityIdentifier> AllowedUsers { get { return _allowedUsers; } }
        private List<SecurityIdentifier> _allowedUsers = new List<SecurityIdentifier>();
      }
    }
     

    As it stands, this code allows the SIDs of service users to be added in code but not by means of service configuration. The latter is left as an exercise for the reader, as they say.

    Using this custom binding, we can restrict use of a service endpoint to members of a specific Windows group, by means of code like this in the service host:

    AclSecuredNamedPipeBinding binding = new AclSecuredNamedPipeBinding();

    SecurityIdentifier allowedGroup
         = (SecurityIdentifier)(new NTAccount("NPServiceUsers").Translate(typeof(SecurityIdentifier)));

    binding.AddUserOrGroup(allowedGroup);

    ...

    _serviceHost.AddServiceEndpoint(... , binding, ...);

    ...

    _serviceHost.Open()

    I described this as a partial workaround for the flaw in the default security provided by the standard binding. It is not a full workaround because SIDs which are allowed access to the pipe still have the powerful permission FILE_CREATE_PIPE_INSTANCE, which ideally we would not want anyone other then the service account itself to have.

    I said I would say something about why we need to add the service account itself to the AllowedUsers collection. This relates back to the CREATOR OWNER anomaly in the pipe DACL, which I raised in my last post. You might think (and I suspect one of the WCF developers thought) that this ACE in the DACL would grant the service account the rights it needs to set up the listener and handle client requests arriving on the pipe. This isn't the case, though... it is actually the EVERYONE ACE which enables a service using the standard binding to work correctly.

    Let's look what happens if we remove the line

              AddUserOrGroup(WindowsIdentity.GetCurrent().User);

    from the constructor the custom binding, so that the DACL on the pipe just contains the NETWORK deny ACE, an ACE allowing access to our service users' group, and the CREATOR OWNER ACE. In other words, just like the one created by the standard binding, except with our service users' group instead of EVERYONE. 

    With this configuration, the service appears to start correctly, but as soon as the first client message hits the pipe, the service host starts to consume CPU cycles uncontrollably (and ultimately has to be killed) and the client never gets any response. Turning on tracing shows that the service is repeatedly trying to create a new pipe instance, and failing with an Access Denied error:

    <E2ETraceEvent xmlns="http://schemas.microsoft.com/2004/06/E2ETraceEvent"><System xmlns="http://schemas.microsoft.com/2004/06/windows/eventlog/system"><EventID>131075</EventID><Type>3</Type><SubType Name="Error">0</SubType><Level>2</Level><TimeCreated SystemTime="2008-05-14T09:47:27.8109616Z" /><Source Name="System.ServiceModel" /><Correlation ActivityID="{905d5b25-0f13-4f25-b3fb-a31d9a69738f}" /><Execution ProcessName="WCFDemoNPServer" ProcessID="5916" ThreadID="3" /><Channel /><Computer>#####</Computer></System><ApplicationData><TraceData><DataItem><TraceRecord xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord" Severity="Error"><TraceIdentifier>http://msdn.microsoft.com/en-GB/library/System.ServiceModel.Diagnostics.ThrowingException.aspx</TraceIdentifier><Description>Throwing an exception.</Description><AppDomain>WCFDemoNPServer.exe</AppDomain>
    <Exception>
    <ExceptionType>System.ServiceModel.AddressAccessDeniedException, System.ServiceModel, Version=3.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType>
    <Message>
    Cannot listen on pipe 'net.pipe://localhost/WCFDemoNPServer/NPService': Unrecognized error 5 (0x5)
    </Message>
    <StackTrace>  
       at System.ServiceModel.Channels.PipeConnectionListener.CreatePipe()
       at System.ServiceModel.Channels.PipeConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.BufferedConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.TracingConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.ConnectionAcceptor.AcceptIfNecessary(Boolean startAccepting)
       at System.ServiceModel.Channels.ConnectionAcceptor.HandleCompletedAccept(IAsyncResult result)
       at System.ServiceModel.Channels.ConnectionAcceptor.AcceptCompletedCallback(IAsyncResult result)
       at System.ServiceModel.Diagnostics.Utility.AsyncThunk.UnhandledExceptionFrame(IAsyncResult result)
       at System.ServiceModel.AsyncResult.Complete(Boolean completedSynchronously)
       at System.ServiceModel.Channels.PipeConnectionListener.PendingAccept.OnAcceptComplete(Boolean haveResult, Int32 error, Int32 numBytes)
       at System.ServiceModel.Channels.OverlappedContext.CompleteCallback(UInt32 error, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
       at System.ServiceModel.Diagnostics.Utility.IOCompletionThunk.UnhandledExceptionFrame(UInt32 error, UInt32 bytesRead, NativeOverlapped* nativeOverlapped)
       at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
    </StackTrace>

    <ExceptionString>System.ServiceModel.AddressAccessDeniedException: Cannot listen on pipe 'net.pipe://localhost/WCFDemoNPServer/NPService': Unrecognized error 5 (0x5) ---&amp;gt; System.IO.PipeException: Cannot listen on pipe 'net.pipe://localhost/WCFDemoNPServer/NPService': Unrecognized error 5 (0x5)
       --- End of inner exception stack trace ---</ExceptionString><InnerException><ExceptionType>System.IO.PipeException, System.ServiceModel, Version=3.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType><Message>Cannot listen on pipe 'net.pipe://localhost/WCFDemoNPServer/NPService': Unrecognized error 5 (0x5)</Message><StackTrace>   at System.ServiceModel.Channels.PipeConnectionListener.CreatePipe()
       at System.ServiceModel.Channels.PipeConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.BufferedConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.TracingConnectionListener.BeginAccept(AsyncCallback callback, Object state)
       at System.ServiceModel.Channels.ConnectionAcceptor.AcceptIfNecessary(Boolean startAccepting)
       at System.ServiceModel.Channels.ConnectionAcceptor.HandleCompletedAccept(IAsyncResult result)
       at System.ServiceModel.Channels.ConnectionAcceptor.AcceptCompletedCallback(IAsyncResult result)
       at System.ServiceModel.Diagnostics.Utility.AsyncThunk.UnhandledExceptionFrame(IAsyncResult result)
       at System.ServiceModel.AsyncResult.Complete(Boolean completedSynchronously)
       at System.ServiceModel.Channels.PipeConnectionListener.PendingAccept.OnAcceptComplete(Boolean haveResult, Int32 error, Int32 numBytes)
       at System.ServiceModel.Channels.OverlappedContext.CompleteCallback(UInt32 error, UInt32 numBytes, NativeOverlapped* nativeOverlapped)
       at System.ServiceModel.Diagnostics.Utility.IOCompletionThunk.UnhandledExceptionFrame(UInt32 error, UInt32 bytesRead, NativeOverlapped* nativeOverlapped)
       at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32 errorCode, UInt32 numBytes, NativeOverlapped* pOVERLAP)
    </StackTrace><ExceptionString>System.IO.PipeException: Cannot listen on pipe 'net.pipe://localhost/WCFDemoNPServer/NPService': Unrecognized error 5 (0x5)</ExceptionString></InnerException>

    </Exception></TraceRecord></DataItem></TraceData></ApplicationData></E2ETraceEvent> 

    In essence, it seems to me that what is happening is that the service succeeds in getting a handle to the first pipe instance, at the time the pipe is created, but because it hasn't granted itself an ACE in the DACL on the pipe, it is locking itself out of obtaining handles to new instances of the pipe, which it needs to do as soon as a client request is received on the first instance. And there is clearly another bug in the IO completion code for the PipeConnectionListener, which causes this exception to recurse rather than faulting the service host.

    So, we have to make the service account itself an AllowedUser, to stop this happening.

    Posted Jun 23 2008, 02:08 PM by chrisdi with 5 comment(s)
    Filed under: , ,
  • Exploring the WCF Named Pipe Binding - Part 2

    In my previous post I explained how the named pipe for a WCF NetNamedPipe endpoint is named, and how a client discovers this name in order to connect to the service. This time, I'm looking at the Windows-level security.

    Both the named pipe itself, and the shared memory object used by the server to publish the name of the pipe to clients, are objects which Windows secures with Access Control Lists (ACLs). Let's look at the named pipe itself first of all...

    The ACL set up when WCF creates the named pipe looks like this in SDDL (Security Description Definition Language):

    D:(D;;FA;;;NU)(A;;0x12019f;;;WD)(A;;0x12019f;;;CO)

    The elements of this SDDL translate as follows:

    (D;;FA;;;NU) - Deny Full Access to NETWORK USERS - that is: deny the access rights specified by the access mask GENERIC_ALL, to any security context having membership of the group with well-known SID S-1-5-2

    (A;;0x12019f;;;WD) - Allow the access rights specified by the access mask 0x0012019f, to EVERYONE (the well-known SID S-1-1-0)

    (A;;0x12019f;;;CO) - Allow the access rights specified by the access mask 0x0012019f, to the well-known SID S-1-3-0 (CREATOR OWNER)

    The first entry enforces the rule that a WCF service endpoint with NetNamedPipe binding can only be accessed by a client process running on the same machine as the service. This is because any logon token created when a user is authenticated over a network protocol has the NETWORK USERS SID S-1-5-2 added to it by the system.

    The second ACE allows any authenticated user which is not a network logon to have the specified access to the named pipe. The access mask 0x0012019f corresponds to the following access rights:

    0x00100000 - SYNCHRONIZE

    0x00020000 - READ_CONTROL

    0x00000100 - FILE_WRITE_ATTRIBUTES

    0x00000080 - FILE_READ_ATTRIBUTES

    0x00000010 - FILE_WRITE_EA

    0x00000008 - FILE_READ_EA

    0x00000004 - FILE_CREATE_PIPE_INSTANCE

    0x00000002 - FILE_WRITE_DATA

    0x00000001 - FILE_READ_DATA

    More on this in a moment.

    The third ACE looks a bit odd to me. My understanding is that CREATOR OWNER is a placeholder SID which is really only relevant when a new security descriptor is being created for a new object using an existing descriptor as the pattern: if the template descriptor contains ACEs for the CREATOR OWNER SID, the corresponding ACEs in the security descriptor created for the new object have the SID for the principal which created the object. No logon token actually contains the CREATOR OWNER SID, as far as I know. Now, when an access check is being done against an ACL-protected object, only the ACEs which match a SID in the logon token are relevant to granting or denying permission. If I'm right that no logon token is ever going to contain the CREATOR OWNER SID, then this third ACE on the pipe's DACL will never have any function in an access check performed when a handle to the pipe is acquired. I suspect that the intention of the WCF developers was that this ACE would provide the access permissions for the service process whose channel listener created the pipe: but it doesn't do this, as I will demonstrate in a subsequent post.

    For the remainder of this post, let's focus on that second ACE, which grants permissions to the EVERYONE group. Did you raise an eyebrow at that FILE_CREATE_PIPE_INSTANCE permission? Do we really want EVERYONE to have permission to create an instance of the service's named pipe? No, we certainly do not! This is a bug in WCF which opens a serious security vulnerability.

    The problem is that any code at all, which is able to execute on the machine where the service lives, can call the Win32 API CreateNamedPipe with appropriate arguments and get a valid server-side handle to an instance of the WCF service's named pipe. It can then call ConnectNamePipe, whereupon it will be in direct competition with the actual service for incoming client connections to the service. Sooner or later some unsuspecting client trying to send a request to the service will be allocated to the instance of the pipe "owned" by the rogue process rather than one owned by the service.

    At best, the client's request to the service will just fail. But the rogue process might also read the data in the client's request; use the client's credentials by calling ImpersonateNamedPipeClient; or possibly return spoof response data to the client.

    We really need to do something about this, but what? Can we control the DACL which gets put on the pipe, when the service runtime is created? Let's deconstruct exactly where this happens...

    The  DACL applied to a named pipe is determined by the lpSecurityAttributes argument passed to Windows when CreateNamedPipe is first called:

    HANDLE WINAPI CreateNamedPipe(
      __in      LPCTSTR lpName,
      __in      DWORD dwOpenMode,
      __in      DWORD dwPipeMode,
      __in      DWORD nMaxInstances,
      __in      DWORD nOutBufferSize,
      __in      DWORD nInBufferSize,
      __in      DWORD nDefaultTimeOut,
      __in_opt  LPSECURITY_ATTRIBUTES lpSecurityAttributes
    );
    

    In WCF, this function is declared in System.ServiceModel.Channels.UnsafeNativeMethods, and is called by the private method CreatePipe() of System.ServiceModel.Channels.PipeConnectionListener, which is the implementation of IConnectionListener used by the service channel stack of the netNamedPipe binding. CreatePipe() is invoked when IConnectionListener.BeginAccept() is called by the service runtime. Our old friend Reflector shows us that the lpSecurityAttributes argument for CreateNamedPipe() is constructed in the PipeConnectionListener.CreatePipe method, using a hard-coded constant -1073741824, and a private member field, allowedSids, of type List<SecurityIdentifier>.

    That constant, -1073741824, is just 0xC0000000 in decimal, which is the value of GENERIC_READ|GENERIC_WRITE (defined in  WinNT.h). This specifies the access mask which is granted to each of the allowed SIDs. Generic access masks are translated by Windows into the corresponding standard and specific access mask bits applicable to the type of object being secured: in this case, the translated mask is the 0x0012019f we saw in the pipe DACL actually created.

    The list of allowed SIDs for the PipeConnectionListener is supplied in its constructor. If we look at the NamedPipeTransportBindingElement which defines how the transport channel is built for the netNamedPipe binding, we see that it too has a private List<SecurityIdentifier> field, called allowedUsers, and a corresponding internal property, AllowedUsers. So it looks as though the original intention of the WCF design was that the binding should define a set of SIDs which were to be allowed to access the pipe, and each one would get GENERIC_READ|GENERIC_WRITE access to the pipe. If this worked, it would not solve the problem that the DACL gives away FILE_CREATE_PIPE_INSTANCE rights to the pipe, but at least it would restrict access (including for that particular right) to a group of SIDs which the service configuration could control. This would be a big improvement on giving the right away to EVERYONE , even if it does not completely solve the problem.

    Unfortunately, the plumbing does not appear to be all there in the WCF bits to make this work: the allowedUsers in the binding element is not hooked up to the allowedSids of the PipeConnectionListener when the service runtime is built. In my next post, we'll look at ways to get round this.

    Posted Jun 16 2008, 07:04 PM by chrisdi with 2 comment(s)
    Filed under: , ,
  • Exploring the WCF Named Pipe Binding - Part 1

    This is the first in a series of posts in which I will aim to explain some details of the named pipe binding provided by Windows Communication Foundation (WCF), discovered during the course of some exploring I have been doing. My motivations for looking into this were:

    1. The standard binding (NetNamedPipeBinding) exposes very few properties relating to configuration of the underlying transport mechanism. Having some awareness of the Windows named pipe APIs from previous work, I was interested to understand how the WCF binding mapped to the underlying transport protocol; which named pipe configuration options were "baked into" the WCF implementation and which might be controlled/tweaked with a bit of work in the channel stack.
    2. I wanted to understand in more detail the security characteristics of the binding.
    3. I was just nosey :-) 

    In this post I will start by looking at the how the named pipe used by a service endpoint with the NetNamedPipe binding is created, and how clients locate it in order to connect.

    I had expected that if I looked into the service process using a tool like Process Explorer, I would see it holding a handle to a named pipe with a name closely related to the URI of the endpoint. What I see instead is a handle to a pipe named something like...

    \\.\pipe\197ad019-6e5f-48cb-8f88-02ae11dfd8c0

    ... clearly the pipe name has been created using a GUID. I also note that the name of the pipe changes each time I stop and restart my service host, so the GUID is being regenerated each time the endpoint runtime is built by WCF.

    How then does a client of the service know how to communicate with the endpoint? Somehow it must be able to resolve the well-known URI for the endpoint into whatever is the current name of the pipe it must use to send messages to the service. It turns out that this is accomplished using what amounts to a mini metadata publishing mechanism which is exclusive to the NetNamedPipe binding. This mechanism is based on a named Windows file mapping object backed by the system paging file. It is the name of this object which is invariant, and directly derived from the endpoint URI... though in a far from obvious way.

    So in order to locate the correct pipe, a client of a WCF NetNamedPipe service endpoint has to:

    • know that the special metadata mechanism exists
    • know how to derive from the endpoint URI the name of the file mapping object through which the metadata is published
    • located the file mapping object and use it to open a view on the shared memory
    • know how to interpret the metadata stored in the shared memory, and translate it into the name of the pipe currently being used by the endpoint

    Some more details for those who are interested:

    Deriving the file mapping object name from the URI

    The shared memory file mapping object created by the service endpoint listener (System.ServiceModel.Channels.PipeConnectionListener) has a name which looks something like this:

    net.pipe:EbmV0LnBpnGU6Ly9rL1dDRkRFTU9OUF1g6e9cUFNFUlZJQ0Uv

    This is derived from the following components:

    ["net.pipe"] [:E|:H] [base-64 encoded byte[] X]

    Where the X is constructed as:

        - when the second component is :E   :    UTF8 encoding of ["net.pipe://"] [URI hostname-or-wildcard*] [URI path or parent path]

        - when the second component is :H   :    the SHA-1 hash of the above (used when the UTF8 encoding of the above exceeds 127 bytes)

    *The URI hostname-or-wildcard depends on the HostNameComparisonMode setting for the endpoint's transport binding - this property is set to HostNameComparisonMode.StrongWildcard in the standard NetNamedPipeBinding, and is not exposed as a property of the binding itself. This means that this component of the name will be "+" (the strong wildcard symbol) unless a custom binding has been used to tweak the HostNameComparisonMode property of the transport binding element.

    Data stored by the service in the shared memory object

    The service stores 20 bytes of data in the shared memory, representing an instance of the structure System.ServiceModel.Channels.PipeSharedMemory+SharedMemoryContents, which looks like this...

    [StructLayout(LayoutKind.Sequential)]
    struct SharedMemoryContents
    {
        public bool isInitialized;
        public Guid pipeGuid;
    }
    The client uses the GUID stored in this object to construct the pipe name through which to connect to the service endpoint.

    Of course, the WCF client stack knows how to jump through these hoops, as it uses the same set of System.ServiceModel types as the service used to set up the mechanism. So you don't really need to know anything about all this if your service client is also a WCF application using the standard binding... which it will be if you are doing things as the WCF designers intended: the named pipe binding was designed solely for WCF-to-WCF scenarios. 

    That's not to say that, in principle, there is any fundamental reason why a named pipe binding to a WCF service should not be able to support any arbitrary client implementation which knows how to write messages to and read messages from a named pipe. Perhaps there are integration scenarios involving legacy unmanaged code or mixed technologies on a single box, where a more open named pipe binding might be useful, not least because the underlying transport mechanism is very fast. But the standard NetNamedPipe binding won't help with this. In practice, it is going to be much easier to use one of the bindings based on standard interoperable protocols, or by providing a COM wrapper around a WCF client implementation.

    Posted May 19 2008, 03:18 PM by chrisdi with 4 comment(s)
    Filed under: ,
  • .NET Framework 2.0 KB928365 Patch problems

    The build process for the project I am currently working on just got hosed by Microsoft Security Update for Microsoft .NET Framework 2.0 (KB928365). A hundred-odd projects which had been building correctly for weeks or months suddenly started to error during solution build, with post-build event failures complaining of file paths not found.

    The failing post-build events all contained constructs like:

    msbuild /v:m "$(ProjectDir)\..\Common\postbuild.proj" ...

    the error was happening because these events were now being executed as though it was

    msbuild /v:m "<project folder>\Common\postbuild.proj" ...  i.e. looking for the Common folder as a child of the project folder rather than its sibling.

    The eagle-eyed may have spotted that the expression in our project file generates a superfluous backslash after the project folder, because the $(ProjectDir) macro is expanded by VS to include a trailing backslash.

    It turns out that this security patch changes behaviour at the operating system level concerning the interpretation of file system paths containing duplicated backslash characters. Whereas the OS was previously forgiving of duplicates, treating them just as single backslash characters, the new behaviour somewhat bizarrely treats '\\..\' as though it were just '\' .

    I'm wondering whether this change in behaviour is intentional, and somehow related to a security issue, or whether it is an unintentional side effect of something MS have patched. I'm also wondering how many latent defects there are out there in deployed applications, which are going to be exposed by this.  

     

  • BizTalk starts to morph...?

    Something big happened last week, the birth of a new concept: the Internet Service Bus, or as Microsoft Connected Systems division has christened it,  BizTalk Services.  Read, for example, Dennis Pilarinos on what this is, and Clemens Vasters on why it's important.

    Besides being an exteremely interesting initiative because of what it is, I find the choice of name quite intriguing also... is this the first public evidence of a Redmond strategy to morph BizTalk from a product to a brand; from a messaging application platform to a marketing smorgasbord of services and tools for connecting distributed systems?

    I think it's quite likely: the advent of WCF and WF has always appeared to me as the writing on the wall for BizTalk as a single monolithic product, which the absorption of the BizTalk product group into the CSD seemed to confirm.

    Posted May 01 2007, 05:25 PM by chrisdi with no comments
    Filed under: , , ,
  • Throttling BizTalk service instances

    Most of the time, performance tuning of our software is about speeding things up... why would you ever want to slow things down? Well, as it happens, there are scenarios when working with BizTalk Server where slowing things down a bit is a desirable goal.

    A common one in BizTalk Server 2004 arises when using the SOAP send adapter which comes in the box to make calls to SOAP web services from within orchestrations. The SOAP send adapter uses the standard CLR thread pool to dispatch SOAP requests across the network, and process the responses received from the web service. Unfortunately, if you drive it too hard, at peak throughput you will start to see the send adapter logging exceptions like this and suspending the request messages for retry:

    Event ID 5740 - The adapter "SOAP" raised an error message. Details "There were not enough free threads in the ThreadPool object to complete the operation.

    This is actually a symptom of a problem in the ASP.NET HTTP stack rather than in the BizTalk product itself. The issue is explained in this Microsoft Knowledge Base article: briefly, each SOAP request uses a worker thread to make the request, an IO completion thread to service the response, and additional threads to perform authentication handshakes; if the available thread pool threads are tied up on requests and there are no free threads which can be used to service responses, these errors occur. Changes were made in BizTalk Server 2004 SP1 to alleviate this problem, and the throughput limit before the problem manifests can be increased by careful tuning of the CLR thread pool parameters as described in the KB article. The problem doesn't go away completely, however. In a typical BizTalk 2004 installation with optimal tuning of the CLR parameters, spikes in throughput of more than about 150 SOAP requests per second will encounter this issue. (I'm told that the situation is much better in BizTalk Server 2006 - the ASP.NET 2.0 stack was rewritten to considerably reduce the possibility of thread pool starvation - but I do not have first-hand experience or confirmation of this).

    Unfortunately, even if your normal peak throughput is considerably less than this limit, spikes can still occur due to temporary outages of one sort or another. BizTalk's automatic retry functionality then serves to exacerbate the problem, because the retry interval is fixed so messages which previously failed close together in time are resubmitted in a spike. Also, when the problem starts to occur, thread pool threads affected become tied up for a fairly lengthy timeout period (100 seconds, I think) before being freed up, so the problem tends to escalate. If you need high throughput rates it is highly desirable that you avoid the problem ever occurring.

    To this end, what we would like to be able to do is smooth out any spikes in the rate at which our orchestrations feed web service requests to the SOAP send adapter. Unfortunately, the knobs provided by BizTalk 2004 for controlling the work rate of orchestration hosts are much too blunt an instrument for doing this effectively - host throttling parameters are global to the BizTalk Group in BizTalk 2004. A different approach which I have used with some success is a fairly direct regulation of the rate at which the orchestrations execute the Send shape to the Web Port. I use a C# helper type, which I have called ExecutionBrake, which is invoked by the orchestration just before it initiates the SOAP request, and which acts as a governor controlling the peak rate at which concurrent instances can execute within that host instance.

    The idea is to use thread synchronisation primitives within this helper type to identify when the rate of executing requests is approaching the desired limit, and apply as light a touch as possible to delay execution of just enough threads to keep the rate from peaking above the limit. Unless there is a sustained spike, short waits using System.Threading.Thread.Sleep() are sufficient. If a sustained load in excess of the limit is experienced, a fallback method is used, whereby the helper type indicates to the orchestration that it should enter a longer delay loop using a Delay shape.

    The following sample code should illustrate the idea:

    using System;
    using System.Collections;
    using System.Threading;

    namespace Charteris.ChrisDicksonBlog.Samples
    {

     public class ExecutionBrake
     {

     /// <summary>
     
    /// Maintains a register of the named instances which are active in the
     
    /// current AppDomain. For any unique name, the braking is implemented by
     
    /// an internal Singleton object.
     
    /// </summary>
     
    /// <param name="uniqueName">Unique brake name to register</param>
     
    private static void RegisterBrake(string uniqueName)
     {
      
    if (!_brakingImplementations.ContainsKey(uniqueName))
       {
        
    lock (_sync)
         {

             if (!_brakingImplementations.ContainsKey(uniqueName))
             {
                ExecutionBrakeImpl impl =
    new ExecutionBrakeImpl();
                Thread.MemoryBarrier();
                _brakingImplementations.Add(uniqueName, impl);
             }
         }
       }
     }

     /// <summary>
     
    /// Collection of singleton implementation objects, keyed on unique name
     
    /// </summary>
     
    private static Hashtable _brakingImplementations = new Hashtable();
     
     
    private static object _sync = new object();


     public ExecutionBrake(string uniqueName)
     {
         _uniqueName = uniqueName;
         RegisterBrake(uniqueName);
     }

     /// <summary>
     
    /// Key method called by the orchestration. If the return value is zero, the orchestration
     /// continues to make the SOAP request. If non-zero, the orchestration should loop via a 
     /// Delay shape and call this method again before proceeding. The return value can be used
     /// to seed the Delay shape's configuration, so that retries are spread randomly. 
     
    /// </summary>
     public
    int
    ThrottleExecution()
     {
       
    return ((ExecutionBrakeImpl)_brakingImplementations[_uniqueName]).ThrottleExecution();
     }

     private string _uniqueName;

     private class ExecutionBrakeImpl
     {
       
    public int ThrottleExecution()
        {

         // Maintain a count of threads currently executing this method. The corresponding
          
    // decrement is in the finally block
          
    int threadsInThisMethod = Interlocked.Increment(ref _threadsUnderControlCount);
          
    try
          
    {
              
    if (threadsInThisMethod >= _deferThreshold)
               {
                  
    // We have more than enough threads already so defer this one immediately
              return _random.Next(_maximumDeferralDurationHint);
               }

            int numberOfSleeps = 0;
               
    if (_threadsReleasedThisIntervalCount >= _brakingThreshold)
                {
                    
    // We have already reached the limit of threads which can be released in
                    
    // the current reference interval, so this thread must wait
                    
    ++numberOfSleeps;
                     Thread.Sleep(_random.Next(_maximumThreadSleepDuration));
                }

            // Keep checking for a release window, then sleeping, alternately until this thread has
               
    // either been released or has used the maximum number of sleeps
               
    while (numberOfSleeps < _maximumThreadSleeps)
                {
                  
    lock (_sync)
                   {

                 // If the reference period has ended, we can start a new one and
                      
    // be the first thread released in the new period
                      
    if (0 > DateTime.Compare(_endOfControlInterval, DateTime.Now))
                       {
                           _endOfControlInterval = DateTime.Now + _thresholdInterval;
                           _threadsReleasedThisIntervalCount = 1;
                   return 0;
                       }

                      // Otherwise we can go if the count for the current interval hasn't been exceeded
                      
    if (_threadsReleasedThisIntervalCount < _brakingThreshold)
                      {
                          ++_threadsReleasedThisIntervalCount;
                   return 0;
                      }
                     
    // Otherwise we'll need to sleep and loop again
            }
                ++numberOfSleeps;
            Thread.Sleep(_random.Next(_maximumThreadSleepDuration));
           }
        }
       
    finally
       
    {
            Interlocked.Decrement(
    ref _threadsUnderControlCount);
        }
       
    // We were not able to release the thread, so return a non-zero deferral hint
       
    return _random.Next(_maximumDeferralDurationHint);
     }
     
    private DateTime _endOfControlInterval = DateTime.Now;
     
    private int _threadsUnderControlCount;
     
    private int _threadsReleasedThisIntervalCount;
     
    private object _sync = new object();
     
    private Random _random = new Random(); 
     
    // Configuration parameters.
      
    // For the purposes of this sample these are constants, but
      
    // in practice they would need some configuration mechanism
      
    // to tune the braking for any particular named brake.
      
    private const int _brakingThreshold = 100;
      
    private const int _deferThreshold = 200;
      
    private TimeSpan _thresholdInterval = new TimeSpan(0,0,0,1);
      
    private const int _maximumThreadSleepDuration = 150;
      
    private const int _maximumThreadSleeps = 3;
      
    private const int _maximumDeferralDurationHint = 500;
      }

     }
    }

    Naturally, if there are multiple host instances executing the orchestration, this braking mechanism smooths the rate of execution of each one independently, and the parameters need to be configured with this is mind.

    Don't expect such a mechanism to enable very precise regulation of execution rates, particularly with multiple host instances, but it can be used effectively to prevent abnormal spikes in message volumes causing the sort of problems described above with the SOAP adapter.

    Posted Mar 06 2007, 01:56 PM by chrisdi with 3 comment(s)
    Filed under:
  • Windows Workflow: what it is and isn't

    I have been meaning for some time to post about my enthusiasm for Dharma Shukla and Bob Schmidt's wonderful book Essential Windows Workflow Foundation. In the best tradition of "Essential ..." titles started by Don Box's book on COM, they explain what WF is for, and how it is built, in a clear and thorough way. The emphasis is on explanation of the principles and the abstractions which make up the framework, at a good level of technical detail but in an easy to follow way. Quite a contrast with the MSDN documentation for WF, it has to be said.

    The first chapter by itself is a masterpiece: not a single line of WF code, but a methodical deconstruction of a simple "Hello, World!" console application which along the way elucidates brilliantly all the reasons why WF exists.

    This is so, so much more helpful than all those courses, presentations and magazine articles which start and end with the Visual Studio Workflow Designer, and foster the widespread misconception that WF is all about programming with pictures. This book places discussion of graphical designers where it deserves to be, in a few short sections at the end of the last chapter ("Miscellanea"), after a comprehensive demonstration, over the previous 7 chapters, of what can be achieved with WF without once invoking the VS designer.

    I would urge anyone who found the last paragraph surprising or controversial, to get and read a copy of the book as soon as possible. Because WF is NOT about drawing flowcharts and turning them into programs: it is about the efficient implementation of distributed systems for executing long-lived processes (perhaps very many concurrent instances of each process) which may spend much of their time in a passive state waiting for some external stimulus to reawaken them to execute their next step.     

    Posted Jan 30 2007, 05:50 PM by chrisdi with no comments
    Filed under:
  • BizTalk Orchestration Exception Handling: What's changed?

    One thing which has changed significantly between BizTalk Server 2004 and BizTalk Server 2006 is the way that the product behaves when unhandled exceptions occur during execution of an orchestration.

    In BizTalk Server 2004, the model is brutally straightforward: any exception which is not caught and handled in an exception handler block suspends the orchestration instance in a state (Suspended(Not Resumeable)) from which it cannot be resurrected, and BizTalk's XLANG/s engine logs an error event in the Application event log looking like this:

    Event Type: Error
    Event Source: XLANG/s
    Event Category: None
    Event ID: 10034
    Date: 21/11/2006
    Time: 15:56:20
    User: N/A
    Computer: CHRISDI-VM1
    Description:
    Uncaught exception terminated service BTSDefaultExceptions.BizTalk_Orchestration1(b08ac914-74bc-6032-8029-29cd7aeb5541), instance 497a22a5-0e93-4cc4-8fd8-43499e58e3ea

    Invalid data: Some text
    Exception type: ApplicationException
    Source: ExceptionGenerator
    Target Site: Void DoNotALot(System.String)
    Help Link:
    Additional error information:

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    This behaviour creates some major problems for applications:

    1. it is often impossible to tell from the limited information in the event log entry what caused the failure
    2. the orchestration state cannot be retrieved
    3. messages associated with the failed orchestration can only be recovered by using HAT to dump them out to a file, or by implementing special recovery code using WMI
    4. The suspended service instances remain clogging up the MessageBox until they are purged by administrative action

    As a result, it is imperative when developing orchestrations in BizTalk Server 2004:

    • to adopt a standard pattern which provides a last-resort exception handler: that is, a scope containing catch blocks for both .NET exceptions of type System.Exception and General Exceptions, in which a more informative description of the exception context can be constructed and used as the Message property of a new exception which is then rethrown; this mitigates the first issue above, as the richer context information will then appear in the XLANG/s event log entry.
    • to implement recovery logic explicitly for any exception types which can be anticipated. This logic might comprise:
      • compensation of work already committed within atomic scopes
      • retry loops using the Suspend shape (e.g. to handle temporary resource outages such as network connectivity failures);
      • business process level error handling (e.g. sending a response message containing an error status following a data validation exception).
      • graceful termination of the orchestration with appropriate logging, in the event of unrecoverable exceptions 

    Essentially, the aim in BizTalk 2004 must be to avoid the XLANG/s default exception handling behaviour completely.

    BizTalk Server 2006's exception handling model is quite different: an exception which is not caught within the exception handlers of the orchestration causes the orchestration instance to be moved to the Suspended (Resumeable) state, rather than Suspended (Non-resumeable). An event log error entry is still written by XLANG/s, but it contains more context information, including the name of the orchestration shape where the exception emerged and the stack trace at the exception site (as well as some bad spelling ;-)):

    Event Type: Error
    Event Source: XLANG/s
    Event Category: None
    Event ID: 10034
    Date:  21/11/2006
    Time:  11:20:42
    User:  N/A
    Computer: VM-WS2K3
    Description:
    Uncaught exception (see the 'inner exception' below) has suspended an instance of service 'BizTalk_Server_Project1.BizTalk_Orchestration1(872ef22d-51f6-5cde-aaa0-5a7dcc036b3b)'.
    The service instance will remain suspended until administratively resumed or terminated.
    If resumed the instance will continue from its last persisted state and may re-throw the same unexpected exception.
    InstanceId: 014e260d-336c-458e-9d47-829afbe9148a
    Shape name: Expression_1
    ShapeId: 17d5c466-163b-460e-8372-cf385835b397
    Exception thrown from: segment 1, progress 6
    Inner exception: We don't like the name 'name_0'!
           
    Exception type: ApplicationException
    Source: ClassLibrary1
    Target Site: Void .ctor(System.String)
    The following is a stack trace that identifies the location where the exception occured

       at ClassLibrary1.Class1..ctor(String name)
       at BizTalk_Server_Project1.BizTalk_Orchestration1.segment1(StopConditions stopOn)
       at Microsoft.XLANGs.Core.SegmentScheduler.RunASegment(Segment s, StopConditions stopCond, Exception& exp)

           

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    As indicated in the event log text, the orchestration instance can be resumed (using HAT or WMI), and it will restart from its last persisted state. This goes some way to resolving the issues from which the BizTalk 2004 model suffers: the event log entry is more informative; and the orchestration state can be recovered much more easily.

    But we should not allow ourselves to think that these product changes mean that we can simplify to any great extent the way that we handle exceptions within our orchestrations. The fact that orchestration service instances can be resumed after an unanticipated and unhandled exception may be a saving grace in some circumstances, but it does not provide an automatic panacea which can be relied upon to replace specific exception handling within the orchestration:

    • for one thing, the resume mechanism is one that you do not want to be invoking routinely. It either requires use of the HAT tool by a human user, requiring appropriate alerting of operations personnel; or it needs to be done automatically using WMI, and controlling BizTalk via WMI can be problematic if done in high volumes - the BizTalk WMI providers place considerable load on the Message Box, are not written to be highly scalable and may not behave correctly when multiple operations are performed concurrently (particularly those which change service instance states).
    • although more exception context information is displayed in the event log entry than was the case with BizTalk Server 2004, using this context information in any automated recovery mechanism is not very easy.
    • working out where the various persistence points in an orchestration are is not as easy as it might be. And it is possible for orchestration shapes executed after the last persistence point to have side effects: if these are not idempotent, automatic resumption from the last persistence point may lead to incorrect system state.

    So in general, I would not advocate trying to incorporate this out-of-the-box exception handling functionality into your routine error handling mechanisms; you still need to analyse all the failure scenarios in your process carefully, catch all anticipatable exceptions in the appropriate places, and provide appropriate compensation blocks and explicit retry loops where necessary; the default exception handling functionality should remain a safety blanket which you ideally never use. Regarding any XLANG/s event log error as a symptom of a defect in your orchestration is, I think, a good mindset with which to approach the design of orchestration exception handling.

    There is one exception scenario where BizTalk Server 2004 was deficient, and workarounds were very difficult, but BizTalk Server 2006 functionality is very much better: that is when an exception occurs during an attempt by BizTalk to execute a persistence point, for example if network problems cause connection errors when the BizTalk host instance running the orchestration tries to communicate with the Message Box. In BizTalk Server 2004, such exceptions caused an unhandled exception in the orchestration, and the orchestration service instance was rendered immediately Suspended (Not Resumable). In BizTalk Server 2006 the BTSNtSvc host process is now more savvy when it encounters difficulties communicating with the MessageBox: it logs an error to the event log but keeps Active orchestration instances alive while it periodically retries the database call. If a retry succeeds, the orchestration instance proceeds on its way completely unaware of the temporary hiatus in MessageBox communication.

     

     

     

    Posted Dec 19 2006, 06:03 PM by chrisdi with 3 comment(s)
    Filed under:
  • InvalidOperation? I was only trying to open a connection...

    I stumbled across a little ADO.NET gotcha the other day...

    By and large, ADO.NET's SQL client library presents a good, consistent exception interface when something goes wrong: errors and warnings are reported as instances of SqlException and there is lots of context wrapped up in these exception objects enabling exception handling code to distinguish between the various error conditions and implement a suitable failure or recovery strategy. This is fortunate, because lots of things can go wrong when calling into a database across a network connection, many of them being transient issues which call for retry strategies of one sort or another. So a common pattern in a data access layer involves catching and mapping SqlException instances to a specific exception handling implementation, following every call to the database:

    try

    {

       using (SqlConnection connection = ...)

       {

          ...

          connection.Open();

          ... use the connection to call the database

       }

    }

    catch (SqlException sqlException)

    {

       HandleException(sqlException);   // Application-specific mapping function

    }

     

    This is so much nicer than using APIs which have a list as long as your arm of different exception types which might be thrown depending on the error condition.

    I was disappointed to discover that ADO.NET (1.1) is not perfect in this respect either. A system under heavy load suddenly threw this exception at us, neatly circumventing all our carefully crafted database exception handling and requiring operator intervention for a condition which was transient and eminently suitable for an automated approach to recovery:

    System.InvalidOperationException: Timeout expired.  The timeout period elapsed prior to obtaining a connection from the pool.  This may have occurred because all pooled connections were in use and max pool size was reached.

       at System.Data.SqlClient.SqlConnectionPoolManager.GetPooledConnection(SqlConnectionString options, Boolean& isInTransaction)

       at System.Data.SqlClient.SqlConnection.Open() 

       at ... (our code)

    What made this even more disappointing was that a brief foray with Reflector immediately showed that this was not the side effect of some catch-all exception handling which happened to throw InvalidOperation as a default choice, but must have been the result of a conscious decision to throw InvalidOperation for a very specific timeout condition (the exception thrown is even instantiated in its own dedicated utility method System.Data.SqlClient.SQL.PooledOpenTimeout()). I would love to know what the thought process behind that design decision was. InvalidOperation is supposed to mean that the method call was invalid for the current state of the object on which it is made. Er... excuse me... I just called Open() an a freshly instantiated SqlConnection. And the GetPooledConnection method where it was actually thrown is static.  

    InvalidOperationException unfortunately provides exactly zero additional context data beyond what it inherits from its base types System.SystemException and System.Exception, which leaves us having to parse the Message property to distinguish this particular timeout condition from any other InvalidOperationException which we might catch. Not very nice. 

    Thankfully, I see that the implementation behind SqlConnection.Open has changed substantially in Framework 2.0, and SqlConnectionPoolManager.GetPooledConnection() no longer exists, so let's hope this has been fixed.

  • SQL Server: The effect of collation on CHECKSUM

    Diagnosing a bug in a SQL Server stored procedure recently I unearthed an interesting ‘feature’ of the CHECKSUM function which I couldn't find documented elsewhere.

    The stored procedure received some data as Unicode string arguments, and after some validation and processing stored it in tables defined with VARCHAR columns. (Yes, I know this sounds like a less than optimal design, but this was an application integration task and both the source data and the target database schema were defined by other applications). The problem was arising in another stored procedure which used the data from the tables, involving a join on a hash column which the first stored procedure had calculated using CHECKSUM. This join had the expected behaviour in most environments, but failed to generate any of the expected rows when executed on a particular SQL Server instance.

    The cause of the problem was eventually tracked down to the fact that the stored procedure code was not being completely fastidious about conversions between VARCHAR and NVARCHAR in the parts of the code where the CHECKSUM hash was being generated. On most SQL instances, this was not a problem since  CHECKSUM(<VARCHAR>) yielded exactly the same value as CHECKSUM(<NVARCHAR>) for the string values concerned. It transpired, however, that for certain collations this relationship does not hold true, and the SQL Server instance where the problem emerged happened to have a default collation which exposed this difference.

    The following query evolved during diagnosis of the problem, and illustrates where the potential pitfalls lie when using CHECKSUM in T-SQL code which may need to be portable between SQL Server instances with potentially differing collations:

    set nocount on

    create table #collations (name sysname, sql nvarchar(1000))

    create table #compare (collationName sysname, vcharHash int, nvcharHash int)

    insert #collations

    SELECT

    name,

    'SELECT ''' + name + ''', CHECKSUM(''1234'' COLLATE ' + name + '), CHECKSUM(N''1234'' COLLATE ' + name + ')'

    FROM ::fn_helpcollations()

     

    declare c cursor for select sql from #collations

    declare @sql nvarchar(1000)

     

    open c

    fetch next from c into @sql

    while @@FETCH_STATUS = 0

    begin

          -- exclude Hindi collations from consideration as they can only apply to NVARCHAR values

          if NOT @sql like 'SELECT ''Hindi_%'

          begin

                insert #compare

                exec sp_executesql @sql

          end

          fetch next from c into @sql

    end

     

    close c

    deallocate c

     

    print 'Collations where CHECKSUM(VARCHAR) differs from CHECKSUM(NVARCHAR)'

    select * from #compare where vcharHash != nvcharHash

     

    print 'Distinct CHECKSUM values where CHECKSUM(VARCHAR) is the same as CHECKSUM(NVARCHAR)'

    select case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END AS [Sort type],  MAX(vcharHash) as [Hash Value], COUNT(*) as [Number of collations] from #compare where vcharHash = nvcharHash

    group by vcharHash, case WHEN collationName like '%_BIN' THEN 'Binary sort order' ELSE 'Dictionary sort order' END

     

    drop table #collations

    drop table #compare

    If you try running this on any SQL Server instance, you will note that:

    • The relationship CHECKSUM(VARCHAR) == CHECKSUM(NVARCHAR) holds (at least for the test string "1234" and all others that I have tried) for all collations other than the SQL Server sort order collations provided for backward compatibility with older SQL Server versions
    • CHECKSUM appears to always generate a different result with binary sort order collations than it does with dictionary sort order collations

    So, the moral of the story is: if you use CHECKSUM to generate hash values for use in indices or comparisons, be very careful and consistent about the collation of the expressions you pass to CHECKSUM, or you may end up with obscure latent bugs.  

  • Blogging

    Well, I've talked about doing it for some time... now some of my colleagues have taken away my remaining excuses (no time to set it up, etc etc) I can't avoid blogging any longer.

    Most of my work as a developer and architect is in the enterprise application integration arena, using Microsoft technologies, so expect to see posts about BizTalk, Windows Communication Foundation and other topics relevant to integration 'plumbing'.

    Posted Oct 31 2006, 11:41 AM by chrisdi with 1 comment(s)
    Filed under: