Today a short article that describes an issue I faced on Entra ID joined Windows devices which failed accessing on-premises resources.
I was recently testing Microsofts Global Secure Access solution to provide access from Entra joined Windows devices to on-premises file shares as since a short time private DNS is available. The implementation of this solution was in-place and from my virtual machine I could reach the on-premises shares from my Windows device on which I was signed in with a synced user account.
So far so good. But then I enrolled a physical device in Intune and Entra ID and noticed I couldn’t reach the file share that I previously could reach from my VM. As I used Windows hello on my physical device and not on the VM, I first thought Windows Hello was the issue in this case. I checked my Cloud Kerberos Trust implementation. Checked Microsoft documentation and this excellent blog post on this topic and scratched my head a couple of times.
When I locked my device and signed in again, but with my password I could reach the on-prem file shares. Pretty strange if you ask me.
I started to do some basic checks.
Resolve-DNSName returned the information I expected for LDAP and Kerberos.
Test-NetConnection to my Domain Controller on port 88 (Kerberos) and 389 (LDAP) worked fine.
Test-NetConnection on port 445 (SMB) to my server on which the SMB file share is located also showed a success.
The nltest /dsgetdc showed everything as expected.
Opening the Advanced diagnostics of the Global Secure Access client showed a lot of connections on port 88 to my domain controllers. But these connections were active and almost immediately closed again.
After a while I hit myself on the head. I realized the issue in my lab environment could be the exact same Kerberos timing issue for which I recently applied a workaround in another environment. And that would mean, not being able to reach the file shares using Windows Hello was just a coincidence.
When opening the event viewer and checking the SMBclient security event, a lot of events with id 31001 were listed.
Log Name: Microsoft-Windows-SmbClient/Security
Source: Microsoft-Windows-SMBClient
Event ID: 31001
Task Category: None
Level: Error
Keywords: (128)
User: SYSTEM
Description:
Smb2DiagReasonISC.
Error: The system cannot contact a domain controller to service the authentication request. Please try again later.
Security status: 0xC0000388
User name:
Logon ID: 0x85472
Server name: \pkfile01.peterklapwijk.internal
Principal name: cifs/pkfile01.peterklapwijk.internal
In the other environment it turned out that (some) users were not able to reach on-premises resources directly after they signed in to their Windows device. The cause of that turned out to be Kerberos Negative Caching on Windows Machines.
Kerberos Negative Caching on Windows machines refers to the process where the system temporarily stores (caches) failed Kerberos authentication attempts. This caching helps to reduce the load on the Key Distribution Center (KDC) by preventing repeated attempts to authenticate with the same invalid credentials.
in this article of Microsoft, it is described what this Kerberos negative caching is. The default caching time is 10 minutes. It turned out for me if I just waited for 10 minutes, or executed KLIST PURGE_BIND in an elevated command prompt, I could immediately connect to the on-premises file share, even with Windows Hello.
The workaround for me was pretty straightforward, create a DWORD registry key FarKdcTimeout under Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Lsa\Kerberos\Parameters.
I set the value to 1, which means the cache timeout is 1 minute.
On X I was pointed to this article, which contains some information regarding Kerberos Negative Cache and the Global Secure Access client. It turns out that a network change triggers the Kerberos stack to refresh itself, which is not the case for GSA, but also not for a solution like Zscaler for example.
5 Comments
The value you set for the registry key is 1 correct?
Yes, for our environment that’s correct. It configures the cache time to 1 minute.
Peter, did you tried it with 0?
No, I didn’t.
Thank you for this blog post – resolved for me with one more thing.
KDC on DCs was hardened in Default GPO to fail on “KDC support for claims, compound authentication and Kerberos armoring” with Value 3.
This needed to be changed at least to Value 2 – Always provide claims
https://admx.help/?Category=Windows_10_2016&Policy=Microsoft.Policies.KDC::CbacAndArmor