|Contents||Bulletin||Scripting in shell and Perl||Network troubleshooting||History||Humor|
This AppNote describes how to configure a NFSv4 Server and Client on a SLES 10 box.
Table of Contents
- Daemons for NFSv4
- About NFSv4 Daemons
- NFS Server Configuration
- Client side configuration
5.1. Automount in NFSv4
5.2. Using /etc/fstab to mount NFSv4 exported volume
NFS is a UNIX protocol for large scale client/server file sharing. It is analogous to the server Message Block (SMB) and Common Internet File System (CIFS) protocols on Microsoft Windows. The Network File System Version 4 is a distributed filesystem protocol which owes heritage to NFSv2 and NFSv3. Unlike previous versions of NFS the present version(NFSv4) supports traditional file access while integrating support for file locking and mount protocol. There are many additional features with NFSv4 such as support for strong security, compound operations, client caching and internationalization.
NFSv4 is the successor of NFSv3. It has been designed to work on a LAN or over the Internet.
NFSv4 comes with several new features:
- Advanced security management
- Firewall friendly
- Advanced and aggressive cache management
- Non Unix compatibility (Windows)
- Easy to administer (Replication, migration)
- Crash recovery (Client and server sides)
NFSv4 uses 32 KBytes pages.
The NFSv3 and NFSv4 protocols are not compatible. A NFSv4 client cannot access a NFSv3
server, and vice versa. However, in order to simplify migrations from NFSv3 to NFSv4, both
NFSv3 and NFSv4 services are launched by the command: rpc.nfsd.
In the case of NFSv3 and NFSv4 clients simultaneously accessing the same server, one must be aware that two different file systems are used: there is no backward support to NFSv3 by the NFSv4 server.
In order to ensure a better reliability over the Internet, NFSv4 only uses TCP. To help NFS setup for internet use, one unique network port is used on NFSv4. This predetermined port is fixed. The default is port 2049.
2. Daemons for NFSv4:
client side both sides server side user commands: mountexportfs user daemons: portmapidmapd nfsd kernel parts: NFSv4RPCXDRTCPIpv4
The following are the Daemons that should be running on a NFSv4 Server:
- rpc.nfsd 8
The following are the Daemons that should be running on a NFSv4 client:
3. About NFSv4 Daemons
A NFSv4 client communicates with corresponding NFSv4 Server via Remote Procedure Calls (RPS's). The client sends a request and gets a reply from the server.
A NFSv4 server can only provide/export a single, hierarchical file system tree. If a server has to share more than one logical file system tree, the single trees are integrated in a new virtual root directory. This construction, called pseudo file system, is the one which is provided/exported to clients.
rpc.mountd — This process receives mount requests from NFS clients and verifies the requested file system is currently exported. This process is started automatically by the nfs service and does not require user configuration. This is not used with NFSv4.
rpc.idmapd — rpc.idmapd is the NFSv4 ID <-> name mapping daemon. It provides functionality to the NFSv4 kernel client and server, to which it communicates via upcalls, by translating user and group IDs to names, and vice versa.
rpc.svcgssd — This process provides the server transport mechanism for the authentication process (Kerberos Version 5) with NFSv4. This service is required for use with NFSv4.
rpc.gssd — This process provides the client transport mechanism for the authentication process (Kerberos Version 5) with NFSv4. This service is required for use with NFSv4.
To start the NFS server issue the command:/etc/init.d/idmapd /etc/init.d/svcgssd start (only if kerberos support is enabled/required) /etc/init.d/nfsserver start
On the NFS client type the following commands:/etc/init.d/idmapd start /etc/init.d/gssd start (only if kerberos support is enabled/required)
To check the exported volume from the server type the following command:showmount -e NFSserver name
4. NFS Server Configuration
This document explains how to configure and use NFSv4 on a SLES 10 box and covers the basic NFSv4 configuration and the automount facility using autofs. This setup is made on SUSE Linux 10.1.
To enable NFSv4 on the machine check: /etc/sysconfig/nfs
NFS_SUPPORT = "yes"
In /etc/exports make an entry of your exported path with the export options for eg:-
/etc/exports - contains a list of all directories that are to be exported via
NFS. The syntax is slightly different from NFSv3. Here is a sample entry:/nfs *(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash) /nfs gss/krb5(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash) /nfs gss/krb5i(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash) /nfs gss/krb5p(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)
Note: Single line entry for each security mode fsid - The value 0 has a special meaning when use with NFSv4. NFSv4 has a concept of a root of the overall exported filesystem (Pseudofilesystem). The export point exported with fsid=0 will be used as this root.
no_subtree_check - If a subdirectory of a filesystem is exported, but the whole filesystem isn't then whenever a NFS request arrives, the server must check not only that the accessed file is in the appropriate filesystem (which is easy) but also that it is in the exported tree (which is harder). This check is called the subtree_check. This option disables subtree_check.
Insecure - The insecure option in this entry also allows clients with NFS implementations that don't use a reserved port for NFS.
- /nfs *(rw,fsid=0,no_subtree_check,no_root_squash,sync)
exported paths export options for nfsv4
To export multiple volumes in NFSv4, follow the steps below:
If we want to export two directories say /NFS1 & /NFS2, then export the NFS1 as explained above. But for NFS2 we have to create a directory NFS2 in /NFS1.
- mkdir /NFS1/NFS2
- Bind the directory /NFS2 to /NFS1/NFS2 to do this execute the following command:
mount ?bind /NFS2 /NFS1/NFS2
- Now configure /etc/exports with the sample entries shown below:
NOTE:- notice the highlighted fields in the above entries.
- Mount the server from the client
mount -t nfs4 nfsserver:/ /mnt/
You should be able to access the files under /NFS1 and files under /NFS1/NFS2.
Checklist to ensure NFSv4 is up and running:
- ps -ef | grep nfsd; ps -ef | grep idmapd; ps -ef | grep svcgssd to check server side daemons
- ps -ef | grep idmapd; ps -ef | grep gssd to check client side daemons
- rpcinfo -p to check all registered RPC programs & versions
- Check firewall is enabled on server/client from Yast -> Security and Users -> Firewall. Make sure NFS services is enabled.
- showmount -e server to check mount information on NFS server
- If you are using NFSv4, make sure that one and only one path is exported with fsid=0. Refer Pseudo file systems for more information.
- If users are not mapped properly check whether idmapd is running in both server & client and dns domain name is properly configured.
- If you encounter problems when you use kerberos security mode, check whether rpc.svcgssd (server) & rpc.gssd (clients) daemons are running and keytab file is extracted.
- If you unable to mount, check the exports file entry.
5. Client Side Configuration
5.1 Automount in NFSv4
To automount a NFSv4 exported volume using Autofs, follow the steps below:
There are two files which are mainly responsible for automount to work using autofs. These two files fall under /etc directory.
- auto.misc or auto.home or auto.xxxxxx.Here xxxxxx can be any name.
Here are the contents of auto.master:# # $Id: auto.master,v 1.4 2005/01/04 14:36:54 raven Exp $ # # Sample auto.master file # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # For details of the format look at autofs(5). #/misc /etc/auto.misc --timeout=60 #/smb /etc/auto.smb #/misc /etc/auto.misc #/net /etc/auto.net /export /etc/auto.misc
In the above file, auto.misc file can also be auto.home or auto.xxxxxxx and the corresponding entry in the auto.misc or auto.home or auto.xxxxxx should be the one below (I have used auto.misc).# # $Id: auto.misc,v 1.2 2003/09/29 08:22:35 raven Exp $ # # This is an automounter map and it has the following format # key [ -mount-options-separated-by-comma ] location # Details may be found in the autofs(5) manpage cd -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom export -fstype=nfs4,rw NFSServer:/ # the following entries are samples to pique your imagination #linux - ro,soft,intr ftp.example.org:/pub/linux #boot -fstype=ext2 :/dev/hda1 #floppy -fstype=auto :/dev/fd0 #floppy -fstype=ext2 :/dev/fd0 #e2floppy -fstype=ext2 :/dev/fd0 #jaz -fstype=ext2 :/dev/sdc1 #removable -fstype=ext2 :/dev/hdd
After making these entries we have to restart the autofs. Type the command:/etc/init.d/autofs restart or /service/autofs restart
After this, you can check the status of the autofs by issuing the command:/service/autofs status
Now check for the mount by typing the command: mount. It shows something like this:$ mount /dev/hda1 on / type reiserfs (rw,acl,user_xattr) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) udev on /dev type tmpfs (rw) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) securityfs on /sys/kernel/security type securityfs (rw) automount(pid4176) on /export type autofs (rw,fd=4,pgrp=4176,minproto=2,maxproto=4) nfsd on /proc/fs/nfsd type nfsd (rw) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
Now type: ls /mntpoint/exportedDir
After executing the above command, type: df -hFilesystem 1K-blocks Used Available Use% Mounted on /dev/hda1 31462264 2810852 28651412 9% / udev 518296 88 518208 1% /dev Nfsserver:/ 69575360 2268576 67306784 4% /export/export
5.2 Using /etc/fstab to mount NFSv4 exported volume
The NFS exported volume can also be mounted on the client just by making an entry in the /etc/fstab file. If your NFS server name is NFSserver and the mount point on the client is /mnt point then the entry in the fstab should look like something below.
The following entry is made in /etc/fstab/dev/sda1 / reiserfs defaults 1 1 /dev/sda2 swap swap defaults 0 0 proc /proc proc defaults 0 0 sysfs /sys sysfs noauto 0 0 usbfs /proc/bus/usb usbfs noauto 0 0 devpts /dev/pts devpts mode=0620,gid=5 0 0 /dev/fd0 /media/floppy auto noauto,user,sync 0 0 NFSserver:/ /mnt point nfs4 rw,user,noauto 0 0
After making this entry in the /etc/fstab file, at the command prompt of the client just give the command: mount /mnt point
Some Useful commands on NFS Server and Clients:
To check the NFS threads type: rpcinfo on the server to check the server threads.$bb: rpcinfo -p program vers proto port 100000 2 tcp 111 portmapper 100000 2 udp 111 portmapper 100024 1 udp 32770 status 100021 1 udp 32770 nlockmgr 100021 3 udp 32770 nlockmgr 100021 4 udp 32770 nlockmgr 100024 1 tcp 57017 status 100021 1 tcp 57017 nlockmgr 100021 3 tcp 57017 nlockmgr 100021 4 tcp 57017 nlockmgr 1073741824 1 tcp 33805 100003 2 udp 2049 nfs 100003 3 udp 2049 nfs 100003 4 udp 2049 nfs 100003 2 tcp 2049 nfs 100003 3 tcp 2049 nfs 100003 4 tcp 2049 nfs 100005 1 udp 975 mountd 100005 1 tcp 976 mountd 100005 2 udp 975 mountd 100005 2 tcp 976 mountd 100005 3 udp 975 mountd 100005 3 tcp 976 mountd
To check the mount points on the client, type: mount.$bb: mount /dev/hda3 on / type reiserfs (rw,acl,user_xattr) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) udev on /dev type tmpfs (rw) devpts on /dev/pts type devpts (rw,mode=0620,gid=5) nfsd on /proc/fs/nfsd type nfsd (rw) rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) xxx.xxx.xxx.xxx:/ on /mnt type nfs4 (rw,addr=xxx.xxx.xxx.xxx)
Here xxx.xxx.xxx.xxx is the ip address of the NFS server.
You can check the highlighted line to find out which version of NFS mount is done.
Kernel NFS debugging can be enabled through /proc file system. All the debug messages will be logged in /var/log/messages.echo "65535" > /proc/sys/sunrpc/nfsd_debug (debugging server) echo "65535" > /proc/sys/sunrpc/nfs_debug (debugging client) echo "65535" > /proc/sys/sunrpc/rpc_debug (RPC)
Note: Things tend to slow down in a production system if you enable all debugging. Make sure you revert it after getting the debug output.
General Information and References for the NFSv4 protocol
- RFC1094 - NFS version 2
- RFC1813 - NFS version 3
- RFC2054 - WebNFS Client Specification
- RFC2054 - WebNFS Server Specification
- RFC2224 - NFS URL Scheme
- RFC2624 - NFS Version 4 Design Considerations
- RFC2224 - Security Negotiation for WebNFS
Standards Track RFCs of Interest:
- RFC1831 - RPC: Remote Procedure Call Protocol Specification Version 2
- RFC1832 - XDR: External Data Representation Standard
- RFC1964 - The Kerberos Version 5 GSS-API Mechanism
- RFC2025 - The Simple Public-Key GSS-API Mechanism (SPKM)
- RFC2203 - RPCSEC_GSS Protocol Specification
- RFC2581 - TCP Congestion Control
- RFC2623 - NFS Version 2 and Version 3 Security Issues and the NFS Protocol's Use of RPCSEC_GSS and Kerberos V5
- RFC2743 - Generic Security Service Application Program Interface, Version 2, Update 1
- RFC2847 - LIPKEY - A Low Infrastructure Public Key Mechanism Using SPKM
- RFC3010 - NFS version 4 Protocol (Obsoleted by RFC3530)
- RFC3530 - NFS version 4 Protocol
26 July 2006 - 11:00pm
Submitted by: bpraveen1
In order to use NFS you need to run portmap service and rpc.statd and rpc.lockd daemons. Use following commands to start
these services (RedHat/Fedora Linux):
# chkconfig portmap on
# chkconfig nfslock on
# /etc/init.d/portmap start
# /etc/init.d/nfslock start
Assuming that NAS is configured properly you need to type following command to access NAS (please refer our
# mkdir /backup
# mount -o tcp 184.108.40.206:/mountpoint /backup Linux supports UDP by default and TCP as an option. TCP
may improve performance in some cases (as a side effect it may increase the CPU load on the local server). If you want to
use UDP just type following command:
# mount 220.127.116.11:/mount/point /backup
You can also mount NFS share by editing /etc/fstab file:
# vi /etc/fstab
Append following line:
18.104.22.168:/mountpoint /backup nfs defaults 0 0
Save the file and exit to shell prompt.
Try to pass following values to mount command improve NFS performance:
# mount -t nfs -o nocto, rsize=32768,wsize=32768
There are few more options supported to tweak NFS please consult man page of nfs.
Quick Server Setup Guide
Quick Client Setup Guide
Frequently Asked Questions:
The Questions and Answers section is divided into categories as follows:
Servers indicate whether the requested data is permanently stored by setting a corresponding field in the response to each NFS write operation. A server can respond to an UNSTABLE write request with an UNSTABLE reply or a FILE_SYNC reply, depending on whether or not the requested data resides on permanent storage yet. An NFS protocol-compliant server must respond to a FILE_SYNC request only with a FILE_SYNC reply.
Clients ensure that data that was written using a safe asynchronous write has been written onto permanent storage using a new operation available in Version 3 called a COMMIT. Servers do not send a response to a COMMIT operation until all data specified in the request has been written to permanent storage. NFS Version 3 clients must protect buffered data that has been written using a safe asynchronous write but not yet committed. If a server reboots before a client has sent an appropriate COMMIT, the server can reply to the eventual COMMIT request in a way that forces the client to resend the original write operation. Version 3 clients use COMMIT operations when flushing safe asynchronous writes to the server during a close(2) or fsync(2) system call, or when encountering memory pressure.
For more information on the NFS Version 3 protocol, read RFC 1813.
Server support for TCP appears in 2.4.19 and later 2.4 kernels, and in 2.6 and later kernels. Not all 2.4-based distributions support NFS over TCP in the Linux NFS server.
A Linux implementation of NFS Version 4 is under development at the University of Michigan's Center for Information Technology Integration under the direction of Andy Adamson. This version is available now in the Linux 2.6 kernel. Although this is a reference implementation of an NFS Version 4 client and server, one of two such implementations required as part of the IETF's standards process, it is still missing some features. These features are currently under development and should appear soon. For more information, visit CITI U-M's NFSv4 project web site.
mount: RPC: Unable to receive; errno = Connection refused
You will also subsequently get the following (non-fatal) warning when you unmount any nfs mounted file system at all, regardless of when it was mounted:
Bad UMNT RPC: RPC: Program/version mismatch; low version = 3, high version = 3
Support for NFS security mechanisms using RPCSEC GSSAPI is now under development in Linux, based on work that is already in the 2.6 kernel. When completed, RPCSEC GSSAPI will work with all versions of the NFS protocol. In addition to the three flavors of Kerberos security (authentication, integrity checking, and full privacy), RPCSEC GSSAPI will eventually support other security flavors such as SPKM3, and will be fully compatible with other implementations such as the one in Solaris.
Besides kernel support for RPCSEC GSSAPI, additional support is required in the form of various user-level changes (the mount command, and a pair of rpcgss daemons, for example). Currently, only Fedora Core 2 has RPCSEC GSSAPI enabled in its kernels and user-level support integrated into its standard distribution. We expect that, as this work matures, it will be adopted by all 2.6-based distributions.
Currently Fedora Core 2 supports only the use of Kerberos 5 authentication with NFS Version 4. Because of bugs and missing features, for now support for Linux NFS with Kerberos is appropriate only for early adopters, and not for production use.
For more information on RPCSEC GSS, read RFC 2203. Information on the Linux implementation of RPCSEC GSSAPI is available here.
For more information on the NFS Version 4 protocol, read RFC 3530.
Early versions of the Linux NFS Version 4 prototype used two separate clients: the original client that supported NFS Versions 2 and 3, and a new separate client that supported only NFS Version 4. For various reasons this prevented the ability to mount NFS Version 4 servers at the same time as NFS Version 2 and 3 servers were mounted. This was an implementation choice, not a protocol limitation. This is no longer the case: the Linux 2.5 NFS client, and all future versions of the Linux NFS client, support all three versions seamlessly, and can concurrently mount servers that export version 2, version 3, and version 4.
The goal is that NFS Version 4 will coexist with versions 2 and 3 in much the same way as NFS Version 3 coexists with NFS Version 2 today. Upgrading should be nearly transparent.
There are some minor interoperability issues when applications running on clients make use of some of the new features of NFS Version 4 such as mandatory locking, share reservations, and delegations. These features help make NFS Version 4 more compatible with traditional Windows file systems like CIFS. Network Appliance, who makes file servers that can export file systems via both CIFS and NFS concurrently, has published papers describing some of these issues. See:
So, when an application opens a file stored in NFS, the NFS client checks that it still exists on the server, and is permitted to the opener, by sending a GETATTR or ACCESS operation. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(). This behavior is referred to as close-to-open cache consistency.
Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged.
Close-to-open cache consistency was introduced to the Linux NFS client in 2.4.20. If for some reason you have applications that depend on the old behavior, you can disable close-to-open support by using the "nocto" mount option.
There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of checking a file's attributes before and after an operation to allow a client to identify changes that could have been made by other clients. Unfortunately when a client is using many concurrent operations that update the same file at the same time, it is impossible to tell whether it was that client's updates or some other client's updates that changed the file.
For this reason, some versions of the Linux 2.6 NFS client abandon WCC checking entirely, and simply trust their own data cache. On these versions, the client can maintain a cache full of stale file data if a file is opened for write. In this case, using file locking is the best way to ensure that all clients see the latest version of a file's data.
A system administrator can try using the "noac" mount option to achieve attribute cache coherency among multiple clients. Almost every client operation checks file attribute information. Usually the client keeps this information cached for a period of time to reduce network and server load. When "noac" is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations.
Be careful not to confuse "noac" with "no data caching." The "noac" mount option will keep file attributes up-to-date with the server, but there are still races that may result in data incoherency between client and server. If you need absolute cache coherency among clients, applications can use file locking, where a client purges file data when a file is locked, and flushes changes back to the server before unlocking a file; or applications can open their files with the O_DIRECT flag to disable data caching entirely.
For a better understanding of the compromises faced in the design of NFS caching, see Callaghan's "NFS Illustrated."
Most NFS clients, including the Linux NFS client in kernels newer than 2.4.20, support "close to open" cache consistency, which provides good performance and meets the sharing needs of most applications. This style of cache consistency does not provide strict coherence of the file size attribute among multiple clients, which would be necessary to ensure that append writes are always placed at the end of a file.
Read all about the NFS cache consistency model here.
Alternately, the NFS protocol could include a specific atomic append write operation, but today's versions of the protocol do not. The designers of the NFS protocol felt that atomic append writes would be rarely used, so they never added the feature. Even with such a feature, keeping the file size attribute up to date would be challenging.
ESTALE is an error reported by a server when a file handle is not valid. Here are some common reasons why a file handle is not valid:
A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file.
Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.
Older Linux NFS clients do not recover from an ESTALE error, even during pathname resolution. In the 2.6.12 kernel and later, the Linux VFS layer can redrive pathname resolution when an ESTALE is encountered to recover appropriately.
Note that this limitation becomes especially significant for hardware that supports larger pages. For instance, many distributors provide a Linux kernel built for Itanium processors that uses 16KB pages rather than 4KB pages normally found on 32-bit x86 systems. On such a system, if wsize is smaller than 16KB, the client always sends write operations serially, if they occur in the same page.
Finally, note that the maximum transfer size permitted by the Linux server (NFSSVC_MAXBLKSIZE) is set to 32KB when applying all patches involved with the implementation of NFS over TCP in the 2.4 kernels. The latest 2.4 kernels have TCP support integrated, and allow transfer sizes up to 32KB.
For the Linux NFS client, however, the problem is somewhat worse because it is an anonymous file system. Local disk-based file systems have a block device associated with them, but anonymous file systems do not. /proc, for example, is an anonymous file system, and so are other network file systems like AFS. All anonymous file systems share the same major number, so there can be a maximum of only 255 anonymous file systems mounted on a single host.
Usually you won't need more than ten or twenty total NFS mounts on any given client. In some large enterprises, though, your work and users might be spread across hundreds of NFS file servers. To work around the limitation on the number of NFS file systems you can mount on a single host, we recommend that you set up and run one of the automounter daemons for Linux. An automounter finds and mounts file systems as they are needed, and unmounts any that it finds are inactive. You can find more information on Linux automounters here.
You may also run into a limit on the number of privileged network ports on your system. The NFS client uses a unique socket with its own port number for each NFS mount point. Using an automounter helps address the limited number of available ports by automatically unmounting file systems that are not in use, thus freeing their network ports. NFS version 4 support in the Linux NFS client uses a single socket per client-server pair, which also helps increase the allowable number of NFS mount points on a client.
When set to "sync," Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to "async," the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery.
Since Linux 2.4, the NFS Version 3 server recognizes the "async" export option. When this option is set, the server replies to clients before data has been written to permanent storage. The server also sends a FILE_SYNC response to the client, indicating that the client need not retain buffered data or send a subsequent COMMIT operation. This exposes the client to the same undetectable corruption as exists for NFS Version 2 (with "async") if the server crashes before it has actually written data to stable storage. (See question B6 for further discussion of this behavior and its consequences.) Note that even if a client sends a Version 3 COMMIT operation, the server replies immediately if the file system has been exported with the "async" option.
Conversely, when the "sync" export option is used on a Linux 2.4 server, both Version 2 and Version 3 writes behave as required by the NFS protocol specification. In this case, NFS Version 3 has a performance advantage over NFS Version 2, while maintaining data resilience during a server crash.
Note well that "[a]sync" also affects some metadata operations on the server.
In the Linux implementation of NFS Version 2, when the "async" export option is in effect, a Linux NFS server may crash before posting all NFS write requests to disk. A Version 2 client, however, always assumes data is permanently written to stable storage, and that it is safe to discard buffers containing the written data.
After a server crash, the Version 2 client cannot know that unwritten data is lost; this is why Version 2 writes are supposed to be permanent before the server replies. Even if a client still has the modified data in its cache, the data on the server no longer matches what is cached on the client (since some or all of the writes did not complete before the server crashed). This may cause applications to make future decisions based on data cached by the client rather than what is on the server, thus further corrupting the file.
For the Linux implementation of NFS Version 3, using the "async" export option to allow faster writes is no longer necessary. NFS Version 3 explicitly allows a server to reply before writing data to disk, under controlled circumstances. It allows clients and servers to communicate about the disposition of written data so that in the event of a server reboot, a Version 3 client can detect the reboot and resend the data.
In summary, be sure all exports on your Linux NFS servers use the "sync" option by setting it explicitly or by upgrading your nfs-utils package to version 1.0.1 or later. If you need fast writes, be sure your clients mount using NFS Version 3. You may also improve write performance by adding the "wdelay" option to your exports.
Two ways of mitigating this effect are to:
This limit has been removed in 2.6 and later kernels.
When a client mounts a file server, the file server advertises the largest number of bytes it can read or write in a single operation. Clients always use the smaller of the server's maximum and the value specified by the rsize and wsize values specified by the client in the mount command.
Large values of rsize and wsize may inhibit performance when using UDP. UDP datagrams must be separated into fragments that fit within your network's Maximum Transfer Unit. The loss of any of these fragments requires retransmission of the whole datagram. This may have a particularly adverse impact on client performance if your network is congested. TCP is considerably better at recovering one or two lost segments and managing network congestion, so larger I/O operations are usually more effective at reliably boosting performance when using NFS over TCP.
The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect. Applications enable synchronous writes for a single file by opening a file with the O_SYNC or O_DSYNC flags. System administrators enable synchronous writes for all files in a local file system by mounting that file system with the "sync" option. The "noac" mount option also enables synchronous writes. If it didn't, applications running on other clients would have a difficult time retrieving file modifications if a client delayed writes.
Currently the Linux NFS client has a limitation which prevents it from safely generating large synchronous writes. The client breaks large write requests into on-the-wire write operations that are no larger than a single page to guarantee that write requests arrive on the server's disk in byte order (some applications depend on this behavior). Even if you set wsize larger than a page, the client will break any application write request into page-sized NFS write operations to meet this guarantee.
In addition, if the server's page size is larger than the client's page size, the server is forced to do additional work when the client writes in small chunks. NFS clients normally align reads and writes to their own page size, which then may be unaligned on the server if it uses larger pages. Depending on the server OS and filesystem, this could result in a number of performance limiting problems.
The Linux IP layer transmits each fragment as it is breaking up a UDP datagram, encoding enough information in each fragment so that the receiving end can reassemble the individual fragments into the original UDP datagram. If something happens that prevents a client from continuing to fragment a packet (e.g., the output socket buffer space in the IP layer is exceeded), the IP layer stops sending fragments. In this case, the receiving end has a set of fragments that is incomplete, and after a certain time window, it will drop the fragments if it does not receive enough to assemble a complete datagram. When this occurs, the UDP datagram is lost. Clients detect this loss when they have not received a reply from the server after a certain time interval, and recover by retransmitting the datagram.
Under heavy write loads, the Linux NFS client can generate many large UDP datagrams. This can quickly exhaust output socket buffer space on the client. If this occurs many times in a short time, the client sends the server a large number of fragments, but almost never gets a whole datagram's worth of fragments to the server. This fills the server's IP reassembly queue, causing it to become unreachable via UDP until it expels the useless fragments from the queue.
Note that the same thing can occur on servers that are under a heavy read load. If the server's output socket buffers are too small, large reads will cause them to overflow during IP fragmentation. The client's IP reassembly queue then fills with worthless fragments, and little UDP traffic can get to the client.
Here are some symptoms of this problem:
The fix is to make the Linux's IP fragmentation logic continue fragmenting a datagram even when output socket buffer space is over its limit. This fix appears in kernels newer than 2.4.20. You can work around this problem in one of several ways:
Unfortunately, an NFS client has no way to determine that a server is squashing root. Thus the Linux client uses NFS Version 3 ACCESS operations when an application is running on a client as root. If an application runs as a normal user, a client uses it's own authentication checking, and doesn't bother to contact the server.
The Linux NFS client should cache the results of these ACCESS operations. In fact, in the new 2.6.x kernels, it does this and it extends ACCESS checking to all users to allow for generic uid/gid mapping on the server. This also enables proper support for Access Control Lists in the server's local file system. In pre-2.6 kernels, the stock NFS client does not cache the results of ACCESS operations.
Note that when a mount request arrives, mountd check .../etab to see if that host is allowed access. If it is, an entry is placed in .../rmtab and the filesystem is exported thus creating an entry in /proc/fs/nfs/exports.
When you run "exportfs -io <options> host:/dir then the entry in ../etab is changed, or a new one is added. If it is a subnet/wildcard/netgroup entry, then every line in ../rmtab is checked to see if it matches. When a match is found, a host-specific entry is given to (or changed in) the kernel. When you run "exportfs -a" it makes sure that all entries in /etc/exports are properly reflected in ../etab. Any extra entries in etab are left alone. Once the correct content of etab has been determined, rmtab is examine to create a list of specific-host entries for any new entries in etab. This host-specific entries are given to the kernel.
When you run "exportfs -r" it ignores the prior contents of ../etab and initializes etab to the contents of /etc/exportfs. Then it inspects rmtab and make an changes to /proc/fs/nfs/export that are necessary.
The first will grant hostname read and write access to /export/dir without squashing root privileges. The second will grant hostname read and write privileges with root squash, and it will grant everyone else read and write access, without squashing root privileges.
The most serious problem is that the FAT filesystem layout does not contain enough information to create a lasting identity needed for NFS to create persistent filehandles. For example, if you take a file, rename it to another directory, trunctate it, and write new data to it, there is nothing stored in the filesystem that can be used to show that the resulting file is, in any sense, the "same" as the original file, and there is no way to find the new file given any details about the original file. Therefore, the Linux NFS server cannot guarantee that once you have opened a file, you can continue to have access to that file, if the file is modified in the ways given above. NFS may then be unable to locate or identify the file correctly, and so may return ESTALE errors.
If you export a directory and one of its ancestors, and both reside on the same physical file system on the server, it can result in random client behavior when mounting.
These local file systems may work or may have a few minor-ish issues: iso9660, ntfs, reiser4, udf. Ask on the NFS mailing list for details.
Any file system based on FAT or not having the ability to provide permanent inode numbers will have trouble with NFS versions 2 and 3 (see question C4).
Local file systems that are known not to work with the Linux NFS server are: procfs, sysfs, tmpfs (and friends).
To perform this check, the server includes information about the parent directory of each file in NFS file handles that are handed out to NFS clients. If the file is renamed to a different directory, for example, this changes the file handle, even though the file itself is still the same file. This breaks NFS protocol-compliance, often causing misbehavior on clients such as ESTALE errors, inappropriate access to renamed or deleted files, broken hard links, and so on.
In the opinion of many, subtree checking causes much more trouble than it saves, and should be avoided in most cases. The subtree_check option is necessary only when you want to prevent a file handle guessing attack from gaining access to files that fall outside the exported part of your server's local file systems. If you need to be certain that noone can access files outside the exported part of a local file system, set up the partitions on your server so that you only export whole file systems.
Because of the design of the NFS protocol, there is no way for a file to be deleted from the name space but still remain in use by an application. Thus NFS clients have to emulate this using what already exists in the protocol. If an open file is unlinked, an NFS client renames it to a special name that looks like ".nfsXXXXX". This "hides" the file while it remains in use. This is known as a "silly rename." Note that NFS servers have nothing to do with this behavior.
After all applications on a client have closed the silly-renamed file, the client automatically finishes the unlink by deleting the file on the server. Generally this is effective, but if the client crashes before the file is removed, it will leave the .nfsXXXXX file. If you are sure that the applications using these files are no longer running, it is safe to delete these files manually.
The NFS version 4 protocol is stateful, and could actually support delete-on-last-close. Unfortunately there isn't an easy way to do this and remain backwards-compatible with version 2 and 3 accessors.
kernel: nfs: server server.domain.name not responding, still trying kernel: nfs: task 10754 can't get a request slot kernel: nfs: server server.domain.name OK
A. The "can't get a request slot" message means that the client-side RPC code has detected a lot of timeouts
(perhaps due to network congestion, perhaps due to an overloaded server), and is throttling back the number of concurrent
outstanding requests in an attempt to lighten the load. Some possible causes:
There have been some suggestioned solutions, but none have been implemented. One is to set up a special class of semaphores which are killable with 'SIGKILL', but replacing the relevant semaphores in the VFS and VM layers will not be possible before the 2.7 kernels the earliest. Another solution under consideration is to cause rpciod to awaken all waiting requests when a user requests an unmount, allowing them to exit with an error.
Until these are implemented, you can work around this problem by killing all processes waiting for I/O to complete in a given file system:
Another, less desirable, workaround is to use "soft" mounts. This will cause processes to stop retrying I/O after a time. Eventually processes become unstuck and your file system can be unmounted. However, soft mounts are not completely safe. See question E4 for a description of the risks of using "soft" mounts.
There are several common problems that can prevent rpc.statd from working. First, be sure that your client has the appropriate startup script enabled (/etc/rc.d/init.d/nfslock on Red Hat distributions). Next, make certain that when rpc.statd starts up, the network is already available for it to work (some DHCP-configured hosts may have a problem with this, for example).
Make sure that the client's nodename (uname -n) is the same as what is returned by gethostbyname(3) on your client. These can differ because of your nsswitch configuration, the contents of /etc/hosts, because your client is configured via DHCP, or because of DNS misconfiguration. The in-kernel lockd process uses a client's nodename to identify its locks when sending lock requests. Rpc.statd must send an identical string when it sends a recovery notification, otherwise the server has no way to match the notification to any locks it may still hold for the client.
It is also recommended that the nodenames for your NFS clients be fully qualified domain names, not just a hostname. If another client in a different domain with the same hostname contacts your server, a fully qualified nodename on both clients will allow the server to distinguish between locks set on each client.
When traversing a firewall between your clients and server, bi-directional RPC traffic must be allowed if you need lock recovery to work, as NLM is callback-based. Two important issues that may prevent the server from calling the client are:
Although some implementations of munmap(2) happen to write dirty pages to local file systems, the NFS version of munmap(2) does not. An msync(2) call is always required to guarantee that dirty mapped data is written to permanent storage. A subtle ramification of the Linux NFS client's treatment of munmap(2) is that does not consider munmap(2) to be a close operation for the purposes of close-to-open cache coherency.
The distinction between the MS_SYNC and MS_ASYNC flags is also important. MS_ASYNC will force dirty mapped pages to permanent storage eventually. Only MS_SYNC guarantees that the pages are written before msync(2) returns to your application. Therefore applications should use msync(MS_SYNC) to serialize data writes to mapped files.
Finally, the Linux NFS client may not flush dirty mapped pages when a file descriptor is closed via close(2). Oftentimes during close processing, the client may flush mapped pages along with pages dirtied by a write(2) call, but this behavior is not guaranteed. Many applications will open a file, map it, then close it and continue using the map. The behavior described above is an attempt to optimize the performance of this use case.
Copying over executables creates a window during which an NFS client's cache may hold parts of the old version and parts of the new version, all combined in the same file. The correct way to update executables and shared libraries on your NFS shares is to use the install program with the '-b' option. That renames the version of the executable that is in use, then creates a brand new file to contain the new version of the executable.
Here are some ways to serialize access to an NFS file.
It's worth noting that until early 2.6 kernels, O_EXCL creates were not atomic on Linux NFS clients. Don't use O_EXCL creates and expect atomic behavior among multiple NFS client unless you are running a kernel newer than 2.6.5.
It's a known issue that Perl uses flock()/BSD locking by default. This can break programs ported from other operating systems, such as Solaris, that expect flock/BSD locks to work like POSIX locks.
On Linux, using file locking instead of a hard link has the added benefit of checkpointing the client's cache with the server. When a file lock is acquired, the client will flush the page cache for that file so that any subsequent reads get new data from the server. When a file lock is released, any changes to the file on that client are flushed back to the server before the lock is released so that other clients waiting to lock that file can see the changes.
The NFS client in 2.6.12 provides support for flock()/BSD locks on NFS files by emulating the BSD-style locks in terms of POSIX byte range locks. Other NFS clients that use the same emulation mechanism, or that use fcntl()/POSIX locks, will then see the same locks that the Linux NFS client sees.
On local Linux filesystems, POSIX locks and BSD locks are invisible to one another. Thus, due to this emulation, applications running on a Linux NFS server will still see files locked by NFS clients as being locked with a fcntl()/POSIX lock, whether the application on the client is using a BSD-style or a POSIX-style lock. If the server application uses flock()BSD locks, it will not see the locks the NFS clients use.
Note that the mount command may update the contents of /etc/mtab whether or not the actual mount settings have changed in the kernel. So when you try mount -oremount with an NFS-specific mount option, subsequent mount commands may report that the setting is in effect. This is only because the mount command is reading /etc/mtab. The /proc/mounts file reflects the true mount options that the kernel is using.
Some programs upon receiving an I/O error will just try more I/O, making them unkillable again. For this reason, try killing all processes on the stuck mount first first, and then run "umount -f". When the I/O requests fail, the process will become killable, will see the signal, and will die. Sometimes it can at a couple of interations of the "kill processes" then "umount -f" cycle until the filesystem is unmounted, but it usually works.
If all else fails, you can still unmount the partition on which the processes are hanging using the "umount -l" command. This causes the stuck mount to become detached from the file system name space hierarchy on your client, and will thus no longer be visible to other processes. You can replace that mount point with another mount to the same server when it becomes available again, or to some other server if the remote data has moved. Note, though, that the old mount point will continue to consume client memory until the stuck processes have all died.
The export option no_auth_nlm is designed to alleviate this problem. Set it on any shares you wish to export to these clients. This will disable the authorization check on file lock requests.
A workaround to this problem is to use NFS Version 2. On the IRIX client, use vers=2 in your mount options.
Note that NFS over UDP now uses a retransmit timeout estimation algorithm in the latest 2.4 and 2.6 kernels, which means the timeo= mount option is less effective at preventing data corruption due to a soft timeout.
Groupthink : Understanding Micromanagers and Control Freaks : Toxic Managers : Bureaucracies : Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Two Party System as Polyarchy : Neoliberalism : The Iron Law of Oligarchy : Libertarian Philosophy
Skeptical Finance : John Kenneth Galbraith : Keynes : George Carlin : Skeptics : Propaganda : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Oscar Wilde : Talleyrand : Somerset Maugham : War and Peace : Marcus Aurelius : Eric Hoffer : Kurt Vonnegut : Otto Von Bismarck : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Oscar Wilde : Bernard Shaw : Mark Twain Quotes
Vol 26, No.1 (January, 2013) Object-Oriented Cult : Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks: The efficient markets hypothesis : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law
Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds : Larry Wall : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOS : Programming Languages History : PL/1 : Simula 67 : C : History of GCC development : Scripting Languages : Perl history : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history
The Peter Principle : Parkinson Law : 1984 : The Mythical Man-Month : How to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite
Most popular humor pages:
Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor
The Last but not Least
|You can use PayPal to make a contribution, supporting hosting of this site with different providers to distribute and speed up access. Currently there are two functional mirrors: softpanorama.info (the fastest) and softpanorama.net.|
The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.
Last modified: February, 19, 2014