Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Switching on DFS causes entire network to hang 2

Status
Not open for further replies.

Tightpants

Technical User
Jan 22, 2004
238
GB
Why would switching on the DFS service on our Windows 2003 servers cause the entire network to "hang" at random intervals? Our new servers were installed about a year ago and from the start we suffered with all the workstations locking up from time to time. After trying all sorts of things the cure seemed to be disabling the DFS service.

Any ideas?
 
whats your replication interval set to?

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
I have checked DFS and there are no roots configured therefore I cannot check the replication interval, unless you know where else I can check! In the event log there is a FRS category so something must be working somewhere. I will have a more detailed look tomorrow and get back to you.
 
what kind of servers role wise? member servers i take it...no DCs I hope

gotta keep in mind that if thesse are DCs, ntfrs.exe is used by both DFS and FRS for sysvol replication

If you have a File Replication Service log, you are likely on a DC, unless there is a dfs root, which can be hidden in the console...if you right click on distributed file system and select show roots...see if anything shows up in there

I'll check at work tomorrow for any known issues

but first i need you to elaborate on what you mean by "hang"

whats the workstation o/s?


also need you to elaborate on locking up.

is it through office, or on office docs I should say...any sort of pattern like that


what are the FRS errors

first thing i would check is for a memory leak in dfs though....

SP1 releases tomorrow by the way (in fact, just got out of training class for it about 2 hrs ago), and that has ALOT of fixes in there, and actually makes quite a few very nice changes...although we did identify a few places where there will be support calls generated on a medium scale...but there is alot of nice functionality as I say...including updating ntfrs and dfs related items...and making metadata cleanups completely cleanup AD and no longer requires manual removal of failed DCs from adsiedit :)
security configuration wizard is another nice plus...but thats a spot we identified alot of support calls coming from those who do not fully understand the effects of lmcompatibility and smb signing

it might be worth it for you to try the SP1 install

however something to keep in mind about SP1...slipstream SP1 installs or upgrades from NT4 to slipstreamed 2003 w/SP1 will enable the windows firewall and break alot of stuff

like i say though, you get me a more elaborate description, and youll be bound to be asked more details or more ?s anyway about your environment...I'll look into it for you, and if theres a hotfix or anything that can prevent sp1 install for you that i can locate, ill send it to you (that statement is erroeous since all hotfixes are included in teh next SP unless a better one comes out..and very few other exxceptions)
enable the dfs service again on a server and check task mgr for the cpu and memory usage for teh executable

is there any frequency pattern to the hangs or lockups?

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
The two servers are domain controllers, not member servers. There are no roots configured in DFS. The only event ids in the FRS event log are information events 13501 and 13516.

When I say "hang" I meant stop responding. You can be working away in AutoCAD or MS Office and suddenly the mouse will stop moving. You cannot do anything for maybe 30 seconds up to a few minutes. The frequency and pattern of the hangs is completely random.

All the workstations except one (an NT4 workstation) are running Windows XP Pro SP2 and the latest critical updates.

I have tried checking CPU and memory use for the DFS service however I could be sitting there for hours watching for something to happen. I have tried event logging but this didn't show anything unusual.

I would be quite happy to install Windows Server 2003 SP1. We only had a couple of minor problems with the Windows XP firewall when XP SP2 arrived.
 
well since you already have an installation, it will not enable the firewall. so no worries there.

ok so if you do a net share on the guy getting the 13516 (if not both), is sysvol and shared?

you log the 13501 at service start

but you log teh 13516 BECAUSE you have disabled DFS

this is why i made teh comment about hoping they are not dcs: sysvol cannot replicate with the dfs service disabled so that is going to be a major problem..your ad replication will succeed, providing no other probs, but FRS will fail

what kind of switches are you running? and are tehy smart or dumb switches?

dfs service is not causing your issue

are the files begin shared also hosted on the DCs?

what this translates to you >>>> eventually your policies will be skewed and user logoons will differ in behavior depending on GPO changes made on the otehr DC

bottom line is that you cannot keep that disabled and have a healthy domain...youll be talkin to me on teh phone in a week and ill be fixin ya myself
as far as tracking a leak down
use perfmon

you could very easily be seeing the problem occur at group policy refresh

ask a user how long goes by from issue to issue, if about 1.5 hrs, its likely GP refresh

it is possible that if you are sharing the folders on teh DCs, that replication is happening when there is a change made, and your users have 0 priority when it comes to the DCs replicating...with teh exception of authentication



-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
oh about the switches...if cisco or smart 3com's (or equal), try disabling spannign tree altogether, if you cant do that, then enable portfast....one or the otehr must be done though.

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
'try disabling spanning tree altogether'

Bold statements like that bring networks down......

Adny
 
We do get some USERENV errors when the servers are rebooted and they relate to group policy not updating. This happens because DFS is not working. However I can replicate the settings using Site and Services or GPUPDATE.

The switch is a 3Com but during the process of trying to isolate the problem we tried a completely different Netgear switch which did not cure the problem.

You could well be correct when you say that the workstations are being excluded during replication. This suggests that the FRS is struggling which relates to your first response about the replication interval.
 
workstations do not replicate
their objects in AD do

AD sites and services is actualyl for AD replication, but it does force both

the userenv errors....1030 and 1058? Theres a hotfix for that, it is also included in SP2

let me know the errors and ill tell ya how to tshoot it...

now gpupdate...that command should always be ran as gpupdate /force...otehrwise it doesnt do anything immediately...and even with force, it only performs some of teh settings

xp by default takes two reboots to apply group policy

ill tell ya what...go to this link and download the 5th file in teh list (MPSRPT_DirSvc.exe) and double click to run it....do this on an XP machine, as well as both DCs...tehn we'll figure out a way for you to get me the cab files generated (this file will be in %systemroot%\MPSReports\DirSvc\Logs\Cab).

gpupdate is only to apply policy...does not mean anything from a replication perspective

good go on teh switch btw


what i am curious to know:
1. Who does your PDC emulator point to as preferred DNS? This should be himself and himself only (if not currently at that, change it, as you have to if you want anything to ever work right). If you just chanegd this, restart the netlogon service
2. Who does the replica DC point to for preferred DNS....this should be teh PDC emulator in your particular environment on teh replica, and itself as seconday (the PDC does not need a secondary)
3. Are both DCs Global Catalogs? If so, identify your infrastructure master and take teh global catlog function off of him by going to ad sites and services, right clicking his ntds object, and unchecking global catalog...this may be the cause of your occasional slow browsing issue.....
4. the dns tab in advanced tcp/ip properties should be set to defaults, that is, register this connections addresses in DNS, append primary adn connection specific DNS suffix, append parent suffix of the primary dns suffix, and the dns server address(es)
5. All clients hsould have the same settings for DNS as outlined in steps 1 and 2, if not, change them (if not using any WebDAV apps go ahead and disable the webclient service as well)
6. ensure all required services are started, this includes dhcp client (who is responsible for ALL dynamic registrations, whether in DNS or for an actual DHCP address aquisition), dns client, tcp/ip netbios helper, windows time, and a host of others...if you are at default in services, stay that way
7. is the firewall disabled on your xp clients?


The MPS repotrts will tell us if you have a replication problem, or is sysvol is not shared on one of teh DCs...actually you can give me that input beforehand.....just run "net share" from a cmd prompt on each domain controller...does sysvol and netlogon get listed?
alot of food for thought eh

if DFS is disabled still, you are not replicating files in sysvol.

there are two replication engines in AD...AD replication adn file replication.....normally AD replication stays good, which is why you likeyl see a user you create on one DC on teh other...
the question is, if you make a txt file on one DC inside of %systemroot%\sysvol\sysvol\<domain.com>, does it replicate to the other DC? try it and see, if it does, try it the otehr way, make a txt file on the other DC and look for it on the first one we tested from....replication will happen (if not before 5 minutes) every 5 minutes with a 30 second offset for the first to "busy" attempts, and a 15 second offset thereafter....so we should see these files within a max of 10 minutes (your time should be more like 30 seconds to 1 minute)

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
shite i forgot the link to mps reports



also, SP1 for Win2003 was due to be released today, but for some oddball reason one of our dev guys pulled it off the line to go out today...which sucks, there are ALOT of enhancements...makes life much much easier

if you are still at RTM build, this is definitely a good idea

as of this second, only RC2 is out on the net.....

ill try to find out the deal with that tomorrow


anywasy get me that data, and ill save you a $250 call to my team to fix ya :)

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
The USERENV errors are 1030 and 1058 but these would go away if DFS was enabled.

1. The PDC emulator points to itself for DNS. There is no alternate DNS server.
2. The replica DC points to itself for DNS and the PDC for the alternate DNS server. Both are configured as DNS servers.
3. Both DCs are Global Catalogs.
4. The DNS advanced options are set as you have listed. "Use this connection's DNS suffix in DNS registation" is not ticked.
5. We are not running any WebDAV apps as far as I know so I can disable the webclient service which is currently visible in my network places.
6. DHCP and other services are all running as default.
7. The firewall is enabled on XP but we had the problems way before XP SP2.

Most of the above settings were checked during the process of trying to eliminate our problems.

When i run "net share" both NETLOGON and SYSVOL are listed.

I created a test text file in the %systemroot%\sysvol\sysvol\<domain.com> folder and it replicated straight away, in both directions.

The CAB files generated by the Reporting Tool are huge. Is there anything in particular I need to check? Looks like there is a couple of weeks reading there!
 
lol

no not that hard

dcdiag andnetdiag, just do a find in the txt file for "failed"

gpotool just do a find for "error"

regentries has to actually be compared

so replication is good, that is good

the reason your clients get teh 1030 and 1058 is because with DFS disabled, you cannot access a DFS share, and sysvol is a DFS share, and sysvol is where your group policies are stored....anotehr reason you must have DFS enabled

replication really should not be happening between those boxes though if DFS is disabled...but seen stranger things...they mustve patched it (I'll have to search our internal kb to find out when the hell they did that)

you need to change you DNS config to what I outlined...stop islanding your DNS by pointing replicas to themselves as preferred DNS, it needs to be the PDC, the only time that is acceptable is if the DC is in a remote site, and has forwarders to all otehr DNS servers in teh domain

turn teh firewall off on your clients....its more trouble than its worth unless you know all teh exceptions...i beleive there is a kb of exceptions required by windows firewall for domain authentication...plug those words into google and youll prolly get it coming up (providing teh article is public of course)

do you only have the one domain? do you plan on ever having trusts with other domains, or adding child domains, or anything like? If so, you must take teh GC function off of your infrastructure master fsmo holder.....having teh GC function on the infrastructure master causes teh fsmo role to be non-functional becasue teh GC functions use the same subsets and logic as teh infrastructure master does, but the GC function when on an infrastructure master wil ltake priority and basically never allow the infrastructure master to work

i cannot stress that enough...and that can very easily cause browsing issues on the domain too...meaning the hanging symptoms you described


-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
also cannot stress the importance of having your DNS settings configured correctly on replica DCs...if you dont, youll be asking for headaches later

although luckily 2003 is much much better than 2000 was in respects to DNS islanding

-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
I have checked dcdiag, netdiag and gpotool for "failed" and "error". The only errors found were in netdiag on both servers. The errors were... Opening \device\nwlnkipx failed.

But this is with DFS disabled. I may get different results if I enabled DFS.

I know the XP firewall is not a problem because we had problems before SP2.

We only have one domain and we are unlikely to change this. So I guess that means we are unlikely to have problems with the GC function on an infrastructure master.

A little bit of history... At the outset our DNS was not working properly and we were relying on hosts files - that was the way it was configured. I sought a second opinion and the DNS configuration was corrected. We are just about to install a new server at a second site so we could have similar problems there. I have been advised to configure the second site as a separate "site" in Sites and Services with a different subnet. The server on the second site will be a DC, DHCP server and DNS server (with forwarders to the DNS server in the main site).

I really appreciate your helpful suggestions. Thank you. I will be offline for a few days so it may be a while before I can get back to you with more information.
 
Ok the failure you mentioned. It's not a problem:
Opening \device\nwlnkipx failed.

this is ipx/spx (novell junk), and if its not installed, it reports a failure there

do me a favor and uncheck the global catalog box as described before

even without other domains, I have seen GCs no infrastructure masters cause strang problems

it really sounds like you are getting some network chatter somewhere from something that is taking up alot of bandwidth....also possible someone is trying to force browsers elections on the network..which could be identified by errors on clients or other servers.

do you have some sniffing software there? I would suggest grabbing one. You can probably find some free ones on webattack dot com.

are you familiar with network monitor at all or ethereal? i would be interested to see 4 traces if possible. 1 from the client when the problem is occurring, 1 from the client when the problem is not occurring, 1 from the server the client is connecting to when the problem is occuring (taken at same time as client with problem trace), and 1 from teh server when the problem is not occurring. you can streamline this if you find a client working correctly and 1 not at the same time, then you can do all in one trace.



-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
I have downloaded and installed Ethereal. When I get a free half day I will switch on DFS and check the information Ethereal comes up with.

Obviously there is no easy answer.

I will also try installing Windows Server 2003 SP1 and see if that makes any difference.

Thanks.
 
if you install SP1, make sure to do it on all DCs

in response to ABD100. Spanning rtee algorithm has teh capability (and will) to break replication, authentication, gpo application, and a slew of other items
portfast being enabled bypasses all of those problems, and is the current recommendation (actually a requirement, a customer refusing to make this step will not get supported if when all else fails, it appears spanning tree is the cause...I have seen this literally 100s of times)
the old recommendation was to completely disable spanning tree, and is still a very viable troubleshooting step...dont expect to fix anything if you dont eliminate possible causes...spanning tree being a major one...however that is not involved here

what does your DNS config look like? Be sure teh DNS tab in advanced tcp/ip properteis is at default, and also ensure netbios over tcpip is enabled on the WINS tab

should be PDC emulator points to himself and himself only
all other DCs and clients in same site as PDC should use PDC as preferred, and others as alternates
in remote sites...pick a DC and use him for preferred on all DCs and clients in that site, and set a forwarder on the preferred to the PDC

I would be interested in seeing a userenv log.



-Brandon Wilson
MCSE00/03, MCSA:Messaging, MCSA03, A+
almost got a paragraph there :)
 
The new server went in yesterday in the second site. This server is a DHCP and DNS server for that site. All was working okay yesterday but after rebooting the two servers on the main site overnight, DNS wasn't working this morning, ie. Internet browsing would not work. When I checked the DNS settings on the PDC it did not have any forwarders configured. When I added our ISP's DNS servers on the forwarders tab in DNS it all started working again. I don't know how it could have worked before the changes but this could be significant.

By the way I also noticed that the DHCP server on the PDC was not authorised. I have authorised it now though.

Can I just check the DNS settings with you again?

- The PDC (192.168.1.1) points to itself because it is a DNS server. In the forwarders tab it has the ISP's DNS servers listed.
- The replica DC (192.168.1.2) on the same site points to itself as a DNS server (because it is configured as one) and has the PDC (192.168.1.1) configured as a secondary DNS server in the LAN properties. There is nothing listed in the forwarders tab.
- The DC at the remote site (192.168.2.1) is also configured as a DNS server. It points to itself as the primary DNS and the PDC in the other site (192.168.1.1)as the secondary DNS server. I have added the ISP's DNS servers in the forwarders tab but I am not sure that this is correct. If I take out the forwarders then all DNS queries will have to go via our VPN link between the sites which will increase network traffic.

If I can get these DNS settings correct, and I think there were a couple of errors there, I think I may try switching DFS back on. DFS is enabled in the remote site and is not causing problems.

I also need to install Windows Server 2003 SP1 which I haven't done yet. I am getting to grips with Ethereal in case I need it later.
 
In response to ADGod - yes spanning-tree portfast is the way to go to remove interface start-up delay problems but disabling STP altogether without any analysis of the network is a BIG NO-NO and is likely to make things MUCH WORSE.
I have heard people make rash comments regarding STP to customers and then have to clean up the mess after they willingly disable it and then loose connectivity to everything. STP can be a complex subject if not designed correctly - working out the roots of each VLAN, the designated ports, the root ports, which switch is blocking and where etc etc.
In modern Layer-3 networks (hopefully everyone is moving away from extended Layer-2 networks?) STP is not normally an issue but make sure it is designed correctly.

Good luck

Andy
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top