Softpanorama
May the source be with you, but remember the KISS principle ;-)

Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Suse NFS service

Configuring a NFSv4 Server and Client on SUSE Linux Enterprise Server 10 Novell User Communities

This AppNote describes how to configure a NFSv4 Server and Client on a SLES 10 box.

Table of Contents

  1. Introduction
  2. Daemons for NFSv4
  3. About NFSv4 Daemons
  4. NFS Server Configuration
  5. Client side configuration
     
    5.1. Automount in NFSv4
    5.2. Using /etc/fstab to mount NFSv4 exported volume
  6. References

1. Introduction

NFS is a UNIX protocol for large scale client/server file sharing. It is analogous to the server Message Block (SMB) and Common Internet File System (CIFS) protocols on Microsoft Windows. The Network File System Version 4 is a distributed filesystem protocol which owes heritage to NFSv2 and NFSv3. Unlike previous versions of NFS the present version(NFSv4) supports traditional file access while integrating support for file locking and mount protocol. There are many additional features with NFSv4 such as support for strong security, compound operations, client caching and internationalization.

NFSv4 is the successor of NFSv3. It has been designed to work on a LAN or over the Internet.

NFSv4 comes with several new features:

NFSv4 uses 32 KBytes pages.

The NFSv3 and NFSv4 protocols are not compatible. A NFSv4 client cannot access a NFSv3
server, and vice versa. However, in order to simplify migrations from NFSv3 to NFSv4, both
NFSv3 and NFSv4 services are launched by the command: rpc.nfsd.

In the case of NFSv3 and NFSv4 clients simultaneously accessing the same server, one must be aware that two different file systems are used: there is no backward support to NFSv3 by the NFSv4 server.

In order to ensure a better reliability over the Internet, NFSv4 only uses TCP. To help NFS setup for internet use, one unique network port is used on NFSv4. This predetermined port is fixed. The default is port 2049.

 

2. Daemons for NFSv4:

  client side both sides server side
user commands: mountexportfs    
user daemons:   portmapidmapd nfsd
kernel parts:   NFSv4RPCXDRTCPIpv4  

The following are the Daemons that should be running on a NFSv4 Server:

The following are the Daemons that should be running on a NFSv4 client:

 

3. About NFSv4 Daemons

A NFSv4 client communicates with corresponding NFSv4 Server via Remote Procedure Calls (RPS's). The client sends a request and gets a reply from the server.

A NFSv4 server can only provide/export a single, hierarchical file system tree. If a server has to share more than one logical file system tree, the single trees are integrated in a new virtual root directory. This construction, called pseudo file system, is the one which is provided/exported to clients.

rpc.mountd — This process receives mount requests from NFS clients and verifies the requested file system is currently exported. This process is started automatically by the nfs service and does not require user configuration. This is not used with NFSv4.

rpc.idmapd — rpc.idmapd is the NFSv4 ID <-> name mapping daemon. It provides functionality to the NFSv4 kernel client and server, to which it communicates via upcalls, by translating user and group IDs to names, and vice versa.

rpc.svcgssd — This process provides the server transport mechanism for the authentication process (Kerberos Version 5) with NFSv4. This service is required for use with NFSv4.

rpc.gssd — This process provides the client transport mechanism for the authentication process (Kerberos Version 5) with NFSv4. This service is required for use with NFSv4.

To start the NFS server issue the command:

/etc/init.d/idmapd

/etc/init.d/svcgssd start (only if kerberos support is enabled/required)
/etc/init.d/nfsserver start

On the NFS client type the following commands:

/etc/init.d/idmapd start
/etc/init.d/gssd start (only if kerberos support is enabled/required)  

To check the exported volume from the server type the following command:

showmount  -e  NFSserver name

 

4. NFS Server Configuration

This document explains how to configure and use NFSv4 on a SLES 10 box and covers the basic NFSv4 configuration and the automount facility using autofs. This setup is made on SUSE Linux 10.1.

To enable NFSv4 on the machine check: /etc/sysconfig/nfs

NFS_SUPPORT = "yes"

In /etc/exports make an entry of your exported path with the export options for eg:-

/etc/exports - contains a list of all directories that are to be exported via
NFS. The syntax is slightly different from NFSv3. Here is a sample entry:

 
      /nfs  *(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)
      /nfs  gss/krb5(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)
      /nfs  gss/krb5i(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)
      /nfs  gss/krb5p(rw,fsid=0,insecure,no_subtree_check,sync,no_root_squash)

Note: Single line entry for each security mode fsid - The value 0 has a special meaning when use with NFSv4. NFSv4 has a concept of a root of the overall exported filesystem (Pseudofilesystem). The export point exported with fsid=0 will be used as this root.

no_subtree_check - If a subdirectory of a filesystem is exported, but the whole filesystem isn't then whenever a NFS request arrives, the server must check not only that the accessed file is in the appropriate filesystem (which is easy) but also that it is in the exported tree (which is harder). This check is called the subtree_check. This option disables subtree_check.

Insecure - The insecure option in this entry also allows clients with NFS implementations that don't use a reserved port for NFS.

- /nfs     *(rw,fsid=0,no_subtree_check,no_root_squash,sync)

exported paths      export options for nfsv4

To export multiple volumes in NFSv4, follow the steps below:

If we want to export two directories say /NFS1 & /NFS2, then export the NFS1 as explained above. But for NFS2 we have to create a directory NFS2 in /NFS1.

  1. mkdir /NFS1/NFS2
  2. Bind the directory /NFS2 to /NFS1/NFS2 to do this execute the following command:
    mount ?bind /NFS2 /NFS1/NFS2
  3. Now configure /etc/exports with the sample entries shown below:
    /NFS1 *(rw,fsid=0,no_subtree_check,no_root_squash,sync)
    /NFS1/NFS2 *(rw,nohide,no_subtree_check,no_root_squash,sync).
    NOTE:- notice the highlighted fields in the above entries.
  4. Mount the server from the client
    mount -t nfs4 nfsserver:/ /mnt/
    You should be able to access the files under /NFS1 and files under /NFS1/NFS2.

Checklist to ensure NFSv4 is up and running:

  1. ps -ef | grep nfsd; ps -ef | grep idmapd; ps -ef | grep svcgssd to check server side daemons
  2. ps -ef | grep idmapd; ps -ef | grep gssd to check client side daemons
  3. rpcinfo -p to check all registered RPC programs & versions
  4. Check firewall is enabled on server/client from Yast -> Security and Users -> Firewall. Make sure NFS services is enabled.
  5. showmount -e server to check mount information on NFS server
  6. If you are using NFSv4, make sure that one and only one path is exported with fsid=0. Refer Pseudo file systems for more information.
  7. If users are not mapped properly check whether idmapd is running in both server & client and dns domain name is properly configured.
  8. If you encounter problems when you use kerberos security mode, check whether rpc.svcgssd (server) & rpc.gssd (clients) daemons are running and keytab file is extracted.
  9. If you unable to mount, check the exports file entry.

 

5. Client Side Configuration

5.1 Automount in NFSv4

To automount a NFSv4 exported volume using Autofs, follow the steps below:

There are two files which are mainly responsible for automount to work using autofs. These two files fall under /etc directory.

  1. auto.master
  2. auto.misc or auto.home or auto.xxxxxx.Here xxxxxx can be any name.

Here are the contents of auto.master:

#
# $Id: auto.master,v 1.4 2005/01/04 14:36:54 raven Exp $
#
# Sample auto.master file
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# For details of the format look at autofs(5).
#/misc  /etc/auto.misc --timeout=60
#/smb   /etc/auto.smb
#/misc  /etc/auto.misc
#/net   /etc/auto.net

/export /etc/auto.misc 

In the above file, auto.misc file can also be auto.home or auto.xxxxxxx and the corresponding entry in the auto.misc or auto.home or auto.xxxxxx should be the one below (I have used auto.misc).

#
# $Id: auto.misc,v 1.2 2003/09/29 08:22:35 raven Exp $
#
# This is an automounter map and it has the following format
# key [ -mount-options-separated-by-comma ] location
# Details may be found in the autofs(5) manpage

cd              -fstype=iso9660,ro,nosuid,nodev :/dev/cdrom

export          -fstype=nfs4,rw         NFSServer:/

# the following entries are samples to pique your imagination
#linux                 - ro,soft,intr          ftp.example.org:/pub/linux
#boot                  -fstype=ext2            :/dev/hda1
#floppy                -fstype=auto            :/dev/fd0
#floppy                -fstype=ext2            :/dev/fd0
#e2floppy              -fstype=ext2            :/dev/fd0
#jaz                   -fstype=ext2            :/dev/sdc1
#removable             -fstype=ext2            :/dev/hdd

After making these entries we have to restart the autofs. Type the command:

/etc/init.d/autofs restart   or /service/autofs restart

After this, you can check the status of the autofs by issuing the command:

/service/autofs status

Now check for the mount by typing the command: mount. It shows something like this:

$ mount
/dev/hda1 on / type reiserfs (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
udev on /dev type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
securityfs on /sys/kernel/security type securityfs (rw)
automount(pid4176) on /export type autofs (rw,fd=4,pgrp=4176,minproto=2,maxproto=4)
nfsd on /proc/fs/nfsd type nfsd (rw)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw) 

Now type: ls /mntpoint/exportedDir

After executing the above command, type: df -h

Filesystem     1K-blocks    Used       Available    Use%    Mounted on
/dev/hda1      31462264     2810852    28651412     9%      /
udev           518296       88         518208       1%      /dev
Nfsserver:/    69575360     2268576    67306784     4%      /export/export

5.2 Using /etc/fstab to mount NFSv4 exported volume

The NFS exported volume can also be mounted on the client just by making an entry in the /etc/fstab file. If your NFS server name is NFSserver and the mount point on the client is /mnt point then the entry in the fstab should look like something below.

The following entry is made in /etc/fstab

/dev/sda1      /                  reiserfs      defaults             1  1
/dev/sda2      swap               swap          defaults             0  0
proc           /proc              proc          defaults             0  0
sysfs          /sys               sysfs         noauto               0  0
usbfs          /proc/bus/usb      usbfs         noauto               0  0
devpts         /dev/pts           devpts        mode=0620,gid=5      0  0
/dev/fd0       /media/floppy      auto          noauto,user,sync     0  0

NFSserver:/    /mnt point         nfs4          rw,user,noauto       0  0

After making this entry in the /etc/fstab file, at the command prompt of the client just give the command: mount /mnt point

Some Useful commands on NFS Server and Clients:

To check the NFS threads type: rpcinfo on the server to check the server threads.

 $bb:  rpcinfo -p
   program      vers  proto port
    100000      2     tcp    111  portmapper
    100000      2     udp    111  portmapper
    100024      1     udp  32770  status
    100021      1     udp  32770  nlockmgr
    100021      3     udp  32770  nlockmgr
    100021      4     udp  32770  nlockmgr
    100024      1     tcp  57017  status
    100021      1     tcp  57017  nlockmgr
    100021      3     tcp  57017  nlockmgr
    100021      4     tcp  57017  nlockmgr
    1073741824  1     tcp  33805
    100003      2     udp   2049  nfs
    100003      3     udp   2049  nfs
    100003      4     udp   2049  nfs
    100003      2     tcp   2049  nfs
    100003      3     tcp   2049  nfs
    100003      4     tcp   2049  nfs
    100005      1     udp    975  mountd
    100005      1     tcp    976  mountd
    100005      2     udp    975  mountd
    100005      2     tcp    976  mountd
    100005      3     udp    975  mountd
    100005      3     tcp    976  mountd

To check the mount points on the client, type: mount.

 $bb: mount
/dev/hda3 on / type reiserfs (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
udev on /dev type tmpfs (rw)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
nfsd on /proc/fs/nfsd type nfsd (rw)
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
xxx.xxx.xxx.xxx:/ on /mnt type nfs4 (rw,addr=xxx.xxx.xxx.xxx)

Here xxx.xxx.xxx.xxx is the ip address of the NFS server.

You can check the highlighted line to find out which version of NFS mount is done.

Enable debugging

Kernel NFS debugging can be enabled through /proc file system. All the debug messages will be logged in /var/log/messages.

echo "65535"  > /proc/sys/sunrpc/nfsd_debug (debugging server)                                                 
echo "65535"   > /proc/sys/sunrpc/nfs_debug (debugging client)
echo "65535"   > /proc/sys/sunrpc/rpc_debug (RPC) 

Note: Things tend to slow down in a production system if you enable all debugging. Make sure you revert it after getting the debug output.

 

6. References

General Information and References for the NFSv4 protocol

Informational RFCs:

Standards Track RFCs of Interest:

Author Info

26 July 2006 - 11:00pm
Submitted by: bpraveen1

 

In order to use NFS you need to run portmap service and rpc.statd and rpc.lockd daemons. Use following commands to start these services (RedHat/Fedora Linux):
# chkconfig portmap on
# chkconfig nfslock on
# /etc/init.d/portmap start
# /etc/init.d/nfslock start

Assuming that NAS is configured properly you need to type following command to access NAS (please refer our sample configuration diagram):# mkdir /backup

# mount -o tcp 202.54.20.111:/mountpoint /backup Linux supports UDP by default and TCP as an option. TCP may improve performance in some cases (as a side effect it may increase the CPU load on the local server). If you want to use UDP just type following command:# mount 202.54.20.111:/mount/point /backup

You can also mount NFS share by editing /etc/fstab file:# vi /etc/fstab

Append following line:202.54.20.111:/mountpoint /backup nfs defaults 0 0

Save the file and exit to shell prompt.

Try to pass following values to mount command improve NFS performance:# mount -t nfs -o nocto, rsize=32768,wsize=32768 202.54.20.111:/mountpoint /backup

Where,

There are few more options supported to tweak NFS please consult man page of nfs.  

Quick Overview

Quick Server Setup Guide

  1. Acquire and install a recent distribution of Linux.
  2. Set up your /etc/exports file (man exports for details).
  3. Consult your distribution's documentation to determine which /etc/init.d start-up script is used to start your server. Start NFS services by invoking this script as root, using the "start" parameter. Consider adding this script to the list of scripts that are automatically run at system start-up. (Red Hat uses the chkconfig command for this purpose).
  4. Read the NFS How-To for advice on tuning and securing your server.

 

Quick Client Setup Guide

  1. Acquire and install a recent distribution of Linux. To enable NLM lock recovery, ensure your client's host name, as returned by uname -n, matches the host name returned by DNS.
  2. The NLM protocol is handled by an in-kernel service in modern kernels, but the user-level rpc.statd program must be running to enable NLM lock recovery. Consult your distribution's documentation to determine which /etc/init.d start-up script is used to start it. Start the NSM daemon by invoking this script as root, using the "start" parameter. Consider adding this script to the list of scripts that are automatically run at system start-up. (Red Hat uses the chkconfig command for this purpose).
  3. Create the directories on your client where you will mount the NFS shares.
  4. Add entries in /etc/fstab corresponding to your mount points (man nfs for details).
  5. Use mount -a -t nfs to mount the NFS shares.
  6. During system boot-up, most distributions automatically mount NFS shares that are listed in /etc/fstab. If yours doesn't, check your distribution's documentation for instructions on how to configure your client to do this.

 

Frequently Asked Questions:

The Questions and Answers section is divided into categories as follows:

Section A: About the NFS protocol
Section B: Performance
Section C: Common export configuration errors
Section D: Commonly occurring error messages
Section E: Using Linux NFS with alternate platforms
A1. What are the primary differences between NFS Versions 2 and 3?
A. From the system point of view, the primary differences are these:

For more information on the NFS Version 3 protocol, read RFC 1813.

 

A2. Can I run NFS across the TCP/IP Transport Protocol?
A. Client support for NFS over TCP is integrated into all 2.4 and later kernels.

Server support for TCP appears in 2.4.19 and later 2.4 kernels, and in 2.6 and later kernels. Not all 2.4-based distributions support NFS over TCP in the Linux NFS server.

 

A3. Are there any other versions of NFS under development?
A. Yes. NFS Version 4 is being developed under the supervision of the Internet Engineering Task Force (IETF). The IETF hosts several documents that describe the NFS Version 4 working group's efforts to date. Several commercial vendors have already released NFS clients and servers that support the new version of NFS.

A Linux implementation of NFS Version 4 is under development at the University of Michigan's Center for Information Technology Integration under the direction of Andy Adamson. This version is available now in the Linux 2.6 kernel. Although this is a reference implementation of an NFS Version 4 client and server, one of two such implementations required as part of the IETF's standards process, it is still missing some features. These features are currently under development and should appear soon. For more information, visit CITI U-M's NFSv4 project web site.

 

A4. How can I prevent the use of NFS Version 2, or of other NFS versions?
A. The protocol version is determined at mount time, and can be modified by specifying the version of the NFS protocol, or the version of the transport protocol, supported by the server. For example, the client mount command
mount -o vers=3 foo:/ /bar will request that the server use NFS Version 3 when granting a mount request (Note that "vers" and "nfsvers" have the same meaning in the mount command; The string "vers" is compatible with NFS implementations on Solaris and other vendors). If you wish to prevent use of NFS Version 2 in all cases, then you must restart rpc.mountd on the server, with the option "-N 1 -N 2". The best way to do this is to modify the nfs rpc.mountd configuration on the server by modifying the NFS startup script options, and then shutting down and restarting NFS as a whole: You will now get the following error when attemping to nfs mount a file system using NFS Version 2 (now unrecognized) after restarting rpc.mountd:

mount: RPC: Unable to receive; errno = Connection refused

You will also subsequently get the following (non-fatal) warning when you unmount any nfs mounted file system at all, regardless of when it was mounted:

Bad UMNT RPC: RPC: Program/version mismatch; low version = 3, high version = 3

A5. Can I use Kerberos authentication with NFS on Linux?
A. Sun defined a new interface called RPCSEC GSSAPI that creates the ability to use authentication plug-ins for protocols like NFS that ride on top of RPC. This is the standard way of providing Kerberos authentication support for NFS.

Support for NFS security mechanisms using RPCSEC GSSAPI is now under development in Linux, based on work that is already in the 2.6 kernel. When completed, RPCSEC GSSAPI will work with all versions of the NFS protocol. In addition to the three flavors of Kerberos security (authentication, integrity checking, and full privacy), RPCSEC GSSAPI will eventually support other security flavors such as SPKM3, and will be fully compatible with other implementations such as the one in Solaris.

Besides kernel support for RPCSEC GSSAPI, additional support is required in the form of various user-level changes (the mount command, and a pair of rpcgss daemons, for example). Currently, only Fedora Core 2 has RPCSEC GSSAPI enabled in its kernels and user-level support integrated into its standard distribution. We expect that, as this work matures, it will be adopted by all 2.6-based distributions.

Currently Fedora Core 2 supports only the use of Kerberos 5 authentication with NFS Version 4. Because of bugs and missing features, for now support for Linux NFS with Kerberos is appropriate only for early adopters, and not for production use.

For more information on RPCSEC GSS, read RFC 2203. Information on the Linux implementation of RPCSEC GSSAPI is available here.

A6. What are the main new features in version 4 of the NFS protocol?
A. Here is a short summary of new features. For a complete discussion of these features, see the documentation provided by the NFSv4 Working Group.

For more information on the NFS Version 4 protocol, read RFC 3530.

A7. I've heard NFS Version 4 is not interoperable with earlier versions of NFS. What's the real deal?
A. In the same way that an NFS Version 3-only client cannot communicate with an NFS Version 2-only server, an NFS Version 4-only client or server cannot communicate with clients and servers that only support earlier versions of NFS. NFS Version 4 uses a different version number in RPC headers to distinguish the new protocol version. Thus, clients that support only NFS Version 4 cannot communicate with servers that support only versions 2 and 3. True interoperability is achieved by implementing clients and servers that can communicate using all three protocol versions: NFS Versions 2, 3, and 4.

Early versions of the Linux NFS Version 4 prototype used two separate clients: the original client that supported NFS Versions 2 and 3, and a new separate client that supported only NFS Version 4. For various reasons this prevented the ability to mount NFS Version 4 servers at the same time as NFS Version 2 and 3 servers were mounted. This was an implementation choice, not a protocol limitation. This is no longer the case: the Linux 2.5 NFS client, and all future versions of the Linux NFS client, support all three versions seamlessly, and can concurrently mount servers that export version 2, version 3, and version 4.

The goal is that NFS Version 4 will coexist with versions 2 and 3 in much the same way as NFS Version 3 coexists with NFS Version 2 today. Upgrading should be nearly transparent.

There are some minor interoperability issues when applications running on clients make use of some of the new features of NFS Version 4 such as mandatory locking, share reservations, and delegations. These features help make NFS Version 4 more compatible with traditional Windows file systems like CIFS. Network Appliance, who makes file servers that can export file systems via both CIFS and NFS concurrently, has published papers describing some of these issues. See:

 

A8. What is close-to-open cache consistency?
A. Perfect cache coherency among disparate NFS clients is very expensive to achieve, so NFS settles for something weaker that satisfies the requirements of most everyday types of file sharing. Everyday file sharing is most often completely sequential: first client A opens a file, writes something to it, then closes it; then client B opens the same file, and reads the changes.

So, when an application opens a file stored in NFS, the NFS client checks that it still exists on the server, and is permitted to the opener, by sending a GETATTR or ACCESS operation. When the application closes the file, the NFS client writes back any pending changes to the file so that the next opener can view the changes. This also gives the NFS client an opportunity to report any server write errors to the application via the return code from close(). This behavior is referred to as close-to-open cache consistency.

Linux implements close-to-open cache consistency by comparing the results of a GETATTR operation done just after the file is closed to the results of a GETATTR operation done when the file is next opened. If the results are the same, the client will assume its data cache is still valid; otherwise, the cache is purged.

Close-to-open cache consistency was introduced to the Linux NFS client in 2.4.20. If for some reason you have applications that depend on the old behavior, you can disable close-to-open support by using the "nocto" mount option.

There are still opportunities for a client's data cache to contain stale data. The NFS version 3 protocol introduced "weak cache consistency" (also known as WCC) which provides a way of checking a file's attributes before and after an operation to allow a client to identify changes that could have been made by other clients. Unfortunately when a client is using many concurrent operations that update the same file at the same time, it is impossible to tell whether it was that client's updates or some other client's updates that changed the file.

For this reason, some versions of the Linux 2.6 NFS client abandon WCC checking entirely, and simply trust their own data cache. On these versions, the client can maintain a cache full of stale file data if a file is opened for write. In this case, using file locking is the best way to ensure that all clients see the latest version of a file's data.

A system administrator can try using the "noac" mount option to achieve attribute cache coherency among multiple clients. Almost every client operation checks file attribute information. Usually the client keeps this information cached for a period of time to reduce network and server load. When "noac" is in effect, a client's file attribute cache is disabled, so each operation that needs to check a file's attributes is forced to go back to the server. This permits a client to see changes to a file very quickly, at the cost of many extra network operations.

Be careful not to confuse "noac" with "no data caching." The "noac" mount option will keep file attributes up-to-date with the server, but there are still races that may result in data incoherency between client and server. If you need absolute cache coherency among clients, applications can use file locking, where a client purges file data when a file is locked, and flushes changes back to the server before unlocking a file; or applications can open their files with the O_DIRECT flag to disable data caching entirely.

For a better understanding of the compromises faced in the design of NFS caching, see Callaghan's "NFS Illustrated."

A9. Why does opening files with O_APPEND on multiple clients cause the files to become corrupted?
A. The NFS protocol does not support atomic append writes, so append writes are never atomic on NFS for any platform.

Most NFS clients, including the Linux NFS client in kernels newer than 2.4.20, support "close to open" cache consistency, which provides good performance and meets the sharing needs of most applications. This style of cache consistency does not provide strict coherence of the file size attribute among multiple clients, which would be necessary to ensure that append writes are always placed at the end of a file.

Read all about the NFS cache consistency model here.

Alternately, the NFS protocol could include a specific atomic append write operation, but today's versions of the protocol do not. The designers of the NFS protocol felt that atomic append writes would be rarely used, so they never added the feature. Even with such a feature, keeping the file size attribute up to date would be challenging.

 

A10. What does it mean when my application fails because of an ESTALE error?
A. The NFS protocol does not refer to files and directories by name or by path; it uses an opaque binary value called a file handle. In NFSv3 this file handle can be up to 64 bytes long; NFSv4 allows them to be even larger. A file's file handle is assigned by an NFS server, and is supposed to be unique on that server for the life of that file. Clients discover the value of a file's file handle by doing a LOOKUP operation, or by using part of the results of a READDIRPLUS operation. There is usually a special process done while mounting an NFS file system to determine the file handle of the file system's root directory.

ESTALE is an error reported by a server when a file handle is not valid. Here are some common reasons why a file handle is not valid:

  1. The file resides in an export that is not accessible. It could have been unexported, the export's access list may have changed, or the server could be up but simply not exporting its shares.
  2. The file handle refers to a deleted file. After a file is deleted on the server, clients don't find out until they try to access the file with a file handle they had cached from a previous LOOKUP. Using rsync or mv to replace a file while it is in use on another client is a common scenario that results in an ESTALE error.
  3. The file was renamed to another directory, and subtree checking is enabled on a share exported by a Linux NFS server. See question C7 for more details on subtree checking on Linux NFS servers.
  4. The device ID of the partition that holds your exported files has changed. File handles often contain all or part of a physical device ID, and that ID can change after a reboot, RAID-related changes, or a hardware hot-swap event on your server. Using the "fsid" export option on Linux will force the fsid of an exported partition to remain the same. See the "exports" man page for more details.
  5. The exported file system doesn't support permanent inode numbers. Exporting FAT file systems via NFS is problematic for this reason. This problem can be avoided by exporting only local filesystems which have good NFS support. See question C6 for more information.

A client can recover when it encounters an ESTALE error during a pathname resolution, but not during a READ or WRITE operation. An NFS client prevents data corruption by notifying applications immediately when a file has been replaced during a read or write request. After all, it is usually catastrophic if an application writes to or reads from the wrong file.

Thus in general, to recover from an ESTALE error, an application must close the file or directory where the error occurred, and reopen it so the NFS client can resolve the pathname again and retrieve the new file handle.

Older Linux NFS clients do not recover from an ESTALE error, even during pathname resolution. In the 2.6.12 kernel and later, the Linux VFS layer can redrive pathname resolution when an ESTALE is encountered to recover appropriately.

 

B1. What can I do to to improve NFS performance in general?
A. Review the performance section of the NFS Howto doc and then look at several things:

 

B2. Everything seems so slow and I think the default rsize and wsize are set to 1024 - what's going on?
A. Normally, the Linux NFS client uses read-ahead and delayed writes to hide the latency of NFS read and write operations. However, the client can cache only a single read or write request per page. Thus, if reading or writing a whole page requires more than one on-the-wire read or write operation (which it certainly does if rsize or wsize is 1024), each of these operations must complete before the next one can be issued. In the case of small NFS Version 3 write operations, the write must be FILE_SYNC because the client must fully complete each write before it issues the next one.

Note that this limitation becomes especially significant for hardware that supports larger pages. For instance, many distributors provide a Linux kernel built for Itanium processors that uses 16KB pages rather than 4KB pages normally found on 32-bit x86 systems. On such a system, if wsize is smaller than 16KB, the client always sends write operations serially, if they occur in the same page.

Finally, note that the maximum transfer size permitted by the Linux server (NFSSVC_MAXBLKSIZE) is set to 32KB when applying all patches involved with the implementation of NFS over TCP in the 2.4 kernels. The latest 2.4 kernels have TCP support integrated, and allow transfer sizes up to 32KB.

B3. Why can't I mount more than 255 NFS file systems on my client? Why is it sometimes even less than 255?
A. On Linux, each mounted file system is assigned a major number, which indicates what file system type it is (eg. ext3, nfs, isofs); and a minor number, which makes it unique among the file systems of the same type. In kernels prior to 2.6, Linux major and minor numbers have only 8 bits, so they may range numerically from zero to 255. Because a minor number has only 8 bits, a system can mount only 255 file systems of the same type. So a system can mount up to 255 NFS file systems, another 255 ext3 file system, 255 more iosfs file systems, and so on. Kernels after 2.6 have 20-bit wide minor numbers, which alleviate this restriction.

For the Linux NFS client, however, the problem is somewhat worse because it is an anonymous file system. Local disk-based file systems have a block device associated with them, but anonymous file systems do not. /proc, for example, is an anonymous file system, and so are other network file systems like AFS. All anonymous file systems share the same major number, so there can be a maximum of only 255 anonymous file systems mounted on a single host.

Usually you won't need more than ten or twenty total NFS mounts on any given client. In some large enterprises, though, your work and users might be spread across hundreds of NFS file servers. To work around the limitation on the number of NFS file systems you can mount on a single host, we recommend that you set up and run one of the automounter daemons for Linux. An automounter finds and mounts file systems as they are needed, and unmounts any that it finds are inactive. You can find more information on Linux automounters here.

You may also run into a limit on the number of privileged network ports on your system. The NFS client uses a unique socket with its own port number for each NFS mount point. Using an automounter helps address the limited number of available ports by automatically unmounting file systems that are not in use, thus freeing their network ports. NFS version 4 support in the Linux NFS client uses a single socket per client-server pair, which also helps increase the allowable number of NFS mount points on a client.

B4. Why does NFS Version 2 seem so much faster than Version 3?
A. There are actually two problems here, plus a feature. First, some background; the NFS Version 2 protocol specification requires a server to record each write to permanent storage before it sends a reply to a client. This makes server and client reboot recovery very simple, and provides a good guarantee that data sent to the server is permanently stored. Linux servers (although not the Solaris reference implementation) allow this requirement to be relaxed by setting a per-export option in /etc/exports. The name of this export option is "[a]sync" (note that there is also a client-side mount option by the same name, but it has a different function, and does not defeat NFS protocol compliance).

When set to "sync," Linux server behavior strictly conforms to the NFS protocol. This is default behavior in most other server implementations. When set to "async," the Linux server replies to NFS clients before flushing data or metadata modifying operations to permanent storage, thus improving performance, but breaking all guarantees about server reboot recovery.

 

B5. Why does default NFS Version 2 performance seem equivalent to NFS Version 3 performance in 2.4 kernels?
A. See B4 for background information on how export options affect the Linux NFS server's write behavior.

Since Linux 2.4, the NFS Version 3 server recognizes the "async" export option. When this option is set, the server replies to clients before data has been written to permanent storage. The server also sends a FILE_SYNC response to the client, indicating that the client need not retain buffered data or send a subsequent COMMIT operation. This exposes the client to the same undetectable corruption as exists for NFS Version 2 (with "async") if the server crashes before it has actually written data to stable storage. (See question B6 for further discussion of this behavior and its consequences.) Note that even if a client sends a Version 3 COMMIT operation, the server replies immediately if the file system has been exported with the "async" option.

Conversely, when the "sync" export option is used on a Linux 2.4 server, both Version 2 and Version 3 writes behave as required by the NFS protocol specification. In this case, NFS Version 3 has a performance advantage over NFS Version 2, while maintaining data resilience during a server crash.

Note well that "[a]sync" also affects some metadata operations on the server.

 

B6. Why is the "async" export option unsafe, and is that really a serious problem?
A. The biggest problem is not just that it is unsafe, but that corruption may not be detected.

In the Linux implementation of NFS Version 2, when the "async" export option is in effect, a Linux NFS server may crash before posting all NFS write requests to disk. A Version 2 client, however, always assumes data is permanently written to stable storage, and that it is safe to discard buffers containing the written data.

After a server crash, the Version 2 client cannot know that unwritten data is lost; this is why Version 2 writes are supposed to be permanent before the server replies. Even if a client still has the modified data in its cache, the data on the server no longer matches what is cached on the client (since some or all of the writes did not complete before the server crashed). This may cause applications to make future decisions based on data cached by the client rather than what is on the server, thus further corrupting the file.

For the Linux implementation of NFS Version 3, using the "async" export option to allow faster writes is no longer necessary. NFS Version 3 explicitly allows a server to reply before writing data to disk, under controlled circumstances. It allows clients and servers to communicate about the disposition of written data so that in the event of a server reboot, a Version 3 client can detect the reboot and resend the data.

In summary, be sure all exports on your Linux NFS servers use the "sync" option by setting it explicitly or by upgrading your nfs-utils package to version 1.0.1 or later. If you need fast writes, be sure your clients mount using NFS Version 3. You may also improve write performance by adding the "wdelay" option to your exports.

 

B7. I have achieved pretty fast speeds in some client benchmarks, but when my client is heavily loaded, it slows down considerably. Why does that happen?
A. The Linux client limits the total number of pending read or write operations per mount point. This prevents the client from exhausting its memory with cached read or write requests when the network or server is slow. The hard limit is 256 outstanding read or write operations per mount point. When that limit is reached, the client does not issue a new read or write operation until at least one outstanding read or write operation completes, thus serializing all reads and writes on that mount point until load is reduced.

Two ways of mitigating this effect are to:

  1. Increase rsize and wsize on your client's mount points. This increases the amount of data that can be involved in outstanding reads or writes at any given time.
  2. Mount the same server partition multiple times on your clients, and spread your applications among the mount points.

This limit has been removed in 2.6 and later kernels.

 

B8. Why won't my client let me use rsize or wsize larger than 8KB when I mount my Linux NFS server?
A. NFS Version 2 supports up to 8KB reads and writes. NFS Version 3 allows larger reads and writes (see question A1). Stock 2.4 kernels earlier than 2.4.20 do not support read or write operations larger than 8192 bytes for either NFS Version 2 or 3. Server-side TCP support, introduced as an experimental compile-time option in 2.4.20, increases the server's maximum I/O size to 32KB by increasing the value of NFSSVC_MAXBLKSIZE (see question B2).

When a client mounts a file server, the file server advertises the largest number of bytes it can read or write in a single operation. Clients always use the smaller of the server's maximum and the value specified by the rsize and wsize values specified by the client in the mount command.

Large values of rsize and wsize may inhibit performance when using UDP. UDP datagrams must be separated into fragments that fit within your network's Maximum Transfer Unit. The loss of any of these fragments requires retransmission of the whole datagram. This may have a particularly adverse impact on client performance if your network is congested. TCP is considerably better at recovering one or two lost segments and managing network congestion, so larger I/O operations are usually more effective at reliably boosting performance when using NFS over TCP.

B9. I use the "sync" or "noac" mount options. I've increased my wsize, but write throughput is lower than I expect. Why is this?
A. Normally, an NFS client delays sending application write requests, allowing application processing to overlap with NFS write operations. An NFS client only causes an application to wait for writes to complete when the application closes or flushes a file. When a client sends write operations synchronously, however, the client causes applications to wait for each write operation to complete at the server. This results in much lower performance.

The Linux NFS client uses synchronous writes under many circumstances, some of which are obvious, and some of which you may not expect. Applications enable synchronous writes for a single file by opening a file with the O_SYNC or O_DSYNC flags. System administrators enable synchronous writes for all files in a local file system by mounting that file system with the "sync" option. The "noac" mount option also enables synchronous writes. If it didn't, applications running on other clients would have a difficult time retrieving file modifications if a client delayed writes.

Currently the Linux NFS client has a limitation which prevents it from safely generating large synchronous writes. The client breaks large write requests into on-the-wire write operations that are no larger than a single page to guarantee that write requests arrive on the server's disk in byte order (some applications depend on this behavior). Even if you set wsize larger than a page, the client will break any application write request into page-sized NFS write operations to meet this guarantee.

In addition, if the server's page size is larger than the client's page size, the server is forced to do additional work when the client writes in small chunks. NFS clients normally align reads and writes to their own page size, which then may be unaligned on the server if it uses larger pages. Depending on the server OS and filesystem, this could result in a number of performance limiting problems.

B10. Sometimes my server gets slow or becomes unresponsive, then comes back to life. I'm using NFS over UDP, and I've noticed a lot of IP fragmentation on my network. Is there anything I can do?
A. UDP datagrams larger than the IP Maximum Transfer Unit (MTU) must be divided into pieces that are small enough to be transmitted. If, for example, your network's MTU is 1524 bytes, the Linux IP layer must break UDP datagram larger than 1524 bytes into separate packets, all of which must be smaller than the MTU. These separated packets are called fragments.

The Linux IP layer transmits each fragment as it is breaking up a UDP datagram, encoding enough information in each fragment so that the receiving end can reassemble the individual fragments into the original UDP datagram. If something happens that prevents a client from continuing to fragment a packet (e.g., the output socket buffer space in the IP layer is exceeded), the IP layer stops sending fragments. In this case, the receiving end has a set of fragments that is incomplete, and after a certain time window, it will drop the fragments if it does not receive enough to assemble a complete datagram. When this occurs, the UDP datagram is lost. Clients detect this loss when they have not received a reply from the server after a certain time interval, and recover by retransmitting the datagram.

Under heavy write loads, the Linux NFS client can generate many large UDP datagrams. This can quickly exhaust output socket buffer space on the client. If this occurs many times in a short time, the client sends the server a large number of fragments, but almost never gets a whole datagram's worth of fragments to the server. This fills the server's IP reassembly queue, causing it to become unreachable via UDP until it expels the useless fragments from the queue.

Note that the same thing can occur on servers that are under a heavy read load. If the server's output socket buffers are too small, large reads will cause them to overflow during IP fragmentation. The client's IP reassembly queue then fills with worthless fragments, and little UDP traffic can get to the client.

Here are some symptoms of this problem:

The fix is to make the Linux's IP fragmentation logic continue fragmenting a datagram even when output socket buffer space is over its limit. This fix appears in kernels newer than 2.4.20. You can work around this problem in one of several ways:

  1. Use NFS over TCP. TCP does not use fragmentation, so it does not suffer from this problem. Using TCP may not be possible with older Linux NFS clients and servers that only support NFS over UDP.
  2. If you can't use NFS over TCP, upgrade your clients to 2.4.20 or later.
  3. If you can't upgrade your clients, increase the default size of your client's socket buffers (see below). 2.4.20 and later kernels do this automatically for the NFS client's socket buffers. See Section 5.3 of the NFS How-To for more information.
  4. If your rsize or wsize is very large, reduce it. This will reduce the load on your client's and server's output socket buffers.
  5. Reduce network congestion by ensuring your GbE links use full flow control, that your switch and router ports use adequate buffer sizes, and that all links are negotiating their fastest settings.

 

B11. Why does my server see so many ACCESS calls when using Linux clients?
A. Default NFS server behavior is to prevent root on client machines from having privileged access to exported files. Servers do this by mapping the "root" user to some unprivileged user (usually the user "nobody") on the server side. This is known as root squashing. Most servers, including the Linux NFS server, provide an export option to disable this behaviour and allow root on selected clients to enjoy full root privileges on exported file systems.

Unfortunately, an NFS client has no way to determine that a server is squashing root. Thus the Linux client uses NFS Version 3 ACCESS operations when an application is running on a client as root. If an application runs as a normal user, a client uses it's own authentication checking, and doesn't bother to contact the server.

The Linux NFS client should cache the results of these ACCESS operations. In fact, in the new 2.6.x kernels, it does this and it extends ACCESS checking to all users to allow for generic uid/gid mapping on the server. This also enables proper support for Access Control Lists in the server's local file system. In pre-2.6 kernels, the stock NFS client does not cache the results of ACCESS operations.

 

C1. How are exported file systems and client mount points tracked on the server?
A. /etc/exports contains information about how file systems should normally be exported. This is only read by exportfs.

 

C2. Can I modify export permissions without needing to remount clients in order to have them take effect?
A. Yes. The safest thing to do is edit /etc/exports and run "exportfs -r".

Note that when a mount request arrives, mountd check .../etab to see if that host is allowed access. If it is, an entry is placed in .../rmtab and the filesystem is exported thus creating an entry in /proc/fs/nfs/exports.

When you run "exportfs -io <options> host:/dir then the entry in ../etab is changed, or a new one is added. If it is a subnet/wildcard/netgroup entry, then every line in ../rmtab is checked to see if it matches. When a match is found, a host-specific entry is given to (or changed in) the kernel. When you run "exportfs -a" it makes sure that all entries in /etc/exports are properly reflected in ../etab. Any extra entries in etab are left alone. Once the correct content of etab has been determined, rmtab is examine to create a list of specific-host entries for any new entries in etab. This host-specific entries are given to the kernel.

When you run "exportfs -r" it ignores the prior contents of ../etab and initializes etab to the contents of /etc/exportfs. Then it inspects rmtab and make an changes to /proc/fs/nfs/export that are necessary.

 

C3. My exports seem to be readable by everyone - or /etc/exports is not giving the intended permissions
A. /etc/exports is VERY sensitive to whitespace - so the following statements are not the same, due to the space between the option "hostname" and the opening parentheses:

/export/dir hostname(rw,no_root_squash)
/export/dir hostname (rw,no_root_squash)

The first will grant hostname read and write access to /export/dir without squashing root privileges. The second will grant hostname read and write privileges with root squash, and it will grant everyone else read and write access, without squashing root privileges.

 

C4. I believe the Linux NFS server will not export a fat32 partition. Is that correct?
A. The FAT file systems can be exported, starting with the early 2.4 kernels, but if used extensively, it may cause grief. First, only those operations supported by the exported file system will be honoured. Operations such as "chown", "link", and "symlink" are not supported by these file systems, and will fail. Read/write/create etc., should be fine, as long as the files remain relatively unchanged.

The most serious problem is that the FAT filesystem layout does not contain enough information to create a lasting identity needed for NFS to create persistent filehandles. For example, if you take a file, rename it to another directory, trunctate it, and write new data to it, there is nothing stored in the filesystem that can be used to show that the resulting file is, in any sense, the "same" as the original file, and there is no way to find the new file given any details about the original file. Therefore, the Linux NFS server cannot guarantee that once you have opened a file, you can continue to have access to that file, if the file is modified in the ways given above. NFS may then be unable to locate or identify the file correctly, and so may return ESTALE errors.

C5. Sometimes my client gets a "permission denied" error when attempting to mount a file system, even though it managed it a few hours earlier with no change to the configuration on the server.
A. Your server's /etc/exports is probably misconfigured. If the exports file contains both domain names and IP addresses, it can result in random client behavior when mounting, especially if your clients have multiple IP addresses registered with DNS.

If you export a directory and one of its ancestors, and both reside on the same physical file system on the server, it can result in random client behavior when mounting.

 

C6. Which local file systems can I export with the Linux NFS server?
A. We expect the following local file systems to work, as they are tested often: ext2, ext3, jfs, reiserfs, xfs.

These local file systems may work or may have a few minor-ish issues: iso9660, ntfs, reiser4, udf. Ask on the NFS mailing list for details.

Any file system based on FAT or not having the ability to provide permanent inode numbers will have trouble with NFS versions 2 and 3 (see question C4).

Local file systems that are known not to work with the Linux NFS server are: procfs, sysfs, tmpfs (and friends).

 

C7. Why should I disable subtree checking on my NFS server exports?
A. When an NFS server exports a subdirectory of a local file system, but leaves the rest unexported, the NFS server must check whether each NFS request is against a file residing in the area that is exported. This check is called the subtree check.

To perform this check, the server includes information about the parent directory of each file in NFS file handles that are handed out to NFS clients. If the file is renamed to a different directory, for example, this changes the file handle, even though the file itself is still the same file. This breaks NFS protocol-compliance, often causing misbehavior on clients such as ESTALE errors, inappropriate access to renamed or deleted files, broken hard links, and so on.

In the opinion of many, subtree checking causes much more trouble than it saves, and should be avoided in most cases. The subtree_check option is necessary only when you want to prevent a file handle guessing attack from gaining access to files that fall outside the exported part of your server's local file systems. If you need to be certain that noone can access files outside the exported part of a local file system, set up the partitions on your server so that you only export whole file systems.

 

D1. I keep getting permission failure messages at my NFS server. What are they?
A. The messages you are mentioning take the following format:

Jan 7 09:15:29 server kernel: fh_verify: mail/guest permission failure, acc=4, error=13

Jan 7 09:23:51 server kernel: fh_verify: ekonomi/test permission failure, acc=4, error=13


They happen when a NFS setattr operation is attempted on a file you don't have write access to. These messages are harmless.

 

D2. What is a "silly rename"? Why do these .nfsXXXXX files keep showing up?
A. Unix applications often open a scratch file and then unlink it. They do this so that the file is not visible in the file system name space to any other applications, and so that the system will automatically clean up (delete) the file when the application exits. This is known as "delete on last close", and is a tradition among Unix applications.

Because of the design of the NFS protocol, there is no way for a file to be deleted from the name space but still remain in use by an application. Thus NFS clients have to emulate this using what already exists in the protocol. If an open file is unlinked, an NFS client renames it to a special name that looks like ".nfsXXXXX". This "hides" the file while it remains in use. This is known as a "silly rename." Note that NFS servers have nothing to do with this behavior.

After all applications on a client have closed the silly-renamed file, the client automatically finishes the unlink by deleting the file on the server. Generally this is effective, but if the client crashes before the file is removed, it will leave the .nfsXXXXX file. If you are sure that the applications using these files are no longer running, it is safe to delete these files manually.

The NFS version 4 protocol is stateful, and could actually support delete-on-last-close. Unfortunately there isn't an easy way to do this and remain backwards-compatible with version 2 and 3 accessors.

 

D3. What does this mean:   svc: unknown program 100227 (me 100003)
A. It refers to a mount request by an NFS client which supports the Solaris NFS_ACL side-band protocol. The Linux NFS server in the mainline kernels does not support this protocol, but many distributions include patches that provide NFS_ACL support in their NFS implementation. The message can be ignored safely.

 

D4. I frequently see this in my logs:
  kernel: nfs: server server.domain.name not responding, still trying
  kernel: nfs: task 10754 can't get a request slot
  kernel: nfs: server server.domain.name OK

A. The "can't get a request slot" message means that the client-side RPC code has detected a lot of timeouts (perhaps due to network congestion, perhaps due to an overloaded server), and is throttling back the number of concurrent outstanding requests in an attempt to lighten the load. Some possible causes:
 

 

D5. I just upgraded to the latest nfs-utils and now NLM locking no longer works on files residing on my NFS server. What's up?
A. There are permisions on the /var/lib/nfs/sm and /var/lib/nfs/sm.bak files that must be addressed. Whomever rpc.statd is running as must have ownership and rw access to those dirs. The permissions should be set to 700 for both. In addition, etab, rmtab, and xtab all must exist and be writable by root.
D6. I've mounted with the "intr" option but processes still become unkillable when my server is unavailable. How do I kill the processes so I can unmount them?
A. It is true that even when using the "intr" mount option, you will not always succeed in killing a task that is hanging on NFS. In these instances, the task is usually waiting in the kernel on some semaphore that is held by another process. Since signals cannot interrupt semaphores, the signal will have no effect on the hanging task.

There have been some suggestioned solutions, but none have been implemented. One is to set up a special class of semaphores which are killable with 'SIGKILL', but replacing the relevant semaphores in the VFS and VM layers will not be possible before the 2.7 kernels the earliest. Another solution under consideration is to cause rpciod to awaken all waiting requests when a user requests an unmount, allowing them to exit with an error.

Until these are implemented, you can work around this problem by killing all processes waiting for I/O to complete in a given file system:

 

Another, less desirable, workaround is to use "soft" mounts. This will cause processes to stop retrying I/O after a time. Eventually processes become unstuck and your file system can be unmounted. However, soft mounts are not completely safe. See question E4 for a description of the risks of using "soft" mounts.

D7. How come lock recovery doesn't work for me?
A. When a client reboots, it should notify any servers it had previously mounted to release all locks that were held. It does this by invoking rpc.statd during system start up.

There are several common problems that can prevent rpc.statd from working. First, be sure that your client has the appropriate startup script enabled (/etc/rc.d/init.d/nfslock on Red Hat distributions). Next, make certain that when rpc.statd starts up, the network is already available for it to work (some DHCP-configured hosts may have a problem with this, for example).

Make sure that the client's nodename (uname -n) is the same as what is returned by gethostbyname(3) on your client. These can differ because of your nsswitch configuration, the contents of /etc/hosts, because your client is configured via DHCP, or because of DNS misconfiguration. The in-kernel lockd process uses a client's nodename to identify its locks when sending lock requests. Rpc.statd must send an identical string when it sends a recovery notification, otherwise the server has no way to match the notification to any locks it may still hold for the client.

It is also recommended that the nodenames for your NFS clients be fully qualified domain names, not just a hostname. If another client in a different domain with the same hostname contacts your server, a fully qualified nodename on both clients will allow the server to distinguish between locks set on each client.

When traversing a firewall between your clients and server, bi-directional RPC traffic must be allowed if you need lock recovery to work, as NLM is callback-based. Two important issues that may prevent the server from calling the client are:

 

D8. When my application uses memory-mapped NFS files, it breaks. Why?
A. Usually this is because application developers rely on certain local file system behaviors to guarantee data consistency, rather than reading the mmap man pages carefully to understand what behavior is required by all file system implementations. Some examples:

Although some implementations of munmap(2) happen to write dirty pages to local file systems, the NFS version of munmap(2) does not. An msync(2) call is always required to guarantee that dirty mapped data is written to permanent storage. A subtle ramification of the Linux NFS client's treatment of munmap(2) is that does not consider munmap(2) to be a close operation for the purposes of close-to-open cache coherency.

The distinction between the MS_SYNC and MS_ASYNC flags is also important. MS_ASYNC will force dirty mapped pages to permanent storage eventually. Only MS_SYNC guarantees that the pages are written before msync(2) returns to your application. Therefore applications should use msync(MS_SYNC) to serialize data writes to mapped files.

Finally, the Linux NFS client may not flush dirty mapped pages when a file descriptor is closed via close(2). Oftentimes during close processing, the client may flush mapped pages along with pages dirtied by a write(2) call, but this behavior is not guaranteed. Many applications will open a file, map it, then close it and continue using the map. The behavior described above is an attempt to optimize the performance of this use case.

 

D9. When I update shared executable files on my NFS exports, programs running on my clients all segfault. How come?
A. If you simply copy the new executable or library over an old version, you are violating the NFS cache consistency rules (described here) by changing a file that is being held open on your clients.

Copying over executables creates a window during which an NFS client's cache may hold parts of the old version and parts of the new version, all combined in the same file. The correct way to update executables and shared libraries on your NFS shares is to use the install program with the '-b' option. That renames the version of the executable that is in use, then creates a brand new file to contain the new version of the executable.

 

D10. I'm trying to use flock()/BSD locks to lock files used on multiple clients, but the files become corrupted. How come?
A. flock()/BSD locks act only locally on Linux NFS clients prior to 2.6.12. Use fcntl()/POSIX locks to ensure that file locks are visible to other clients.

Here are some ways to serialize access to an NFS file.

It's worth noting that until early 2.6 kernels, O_EXCL creates were not atomic on Linux NFS clients. Don't use O_EXCL creates and expect atomic behavior among multiple NFS client unless you are running a kernel newer than 2.6.5.

It's a known issue that Perl uses flock()/BSD locking by default. This can break programs ported from other operating systems, such as Solaris, that expect flock/BSD locks to work like POSIX locks.

On Linux, using file locking instead of a hard link has the added benefit of checkpointing the client's cache with the server. When a file lock is acquired, the client will flush the page cache for that file so that any subsequent reads get new data from the server. When a file lock is released, any changes to the file on that client are flushed back to the server before the lock is released so that other clients waiting to lock that file can see the changes.

The NFS client in 2.6.12 provides support for flock()/BSD locks on NFS files by emulating the BSD-style locks in terms of POSIX byte range locks. Other NFS clients that use the same emulation mechanism, or that use fcntl()/POSIX locks, will then see the same locks that the Linux NFS client sees.

On local Linux filesystems, POSIX locks and BSD locks are invisible to one another. Thus, due to this emulation, applications running on a Linux NFS server will still see files locked by NFS clients as being locked with a fcntl()/POSIX lock, whether the application on the client is using a BSD-style or a POSIX-style lock. If the server application uses flock()BSD locks, it will not see the locks the NFS clients use.

 

D11. Why doesn't "mount -oremount,tcp" convert an NFS-mounted file system mounted with UDP to one mounted with TCP?
A. The "remount" option on the mount command only affects the generic mount options, such as ro/rw, sync, and so on (see man mount for a complete list of generic mount command options). The NFS-specific mount options listed on the nfs man page can't be changed with a "mount -oremount" style mount command. You must unmount your file system and mount it again with new options in order to modify the NFS-specific settings.

Note that the mount command may update the contents of /etc/mtab whether or not the actual mount settings have changed in the kernel. So when you try mount -oremount with an NFS-specific mount option, subsequent mount commands may report that the setting is in effect. This is only because the mount command is reading /etc/mtab. The /proc/mounts file reflects the true mount options that the kernel is using.

D12. I didn't mount with "intr" (the default is "nointr") and some processes are unkillable when my server becomes unavailable. What can I do?
A. Use the umount command's "-f" flag to force an unmount. There will be a brief pause while the umount command attempts to contact the server, and then all outstanding requests to the server will be failed, thus making the processes killable.

Some programs upon receiving an I/O error will just try more I/O, making them unkillable again. For this reason, try killing all processes on the stuck mount first first, and then run "umount -f". When the I/O requests fail, the process will become killable, will see the signal, and will die. Sometimes it can at a couple of interations of the "kill processes" then "umount -f" cycle until the filesystem is unmounted, but it usually works.

If all else fails, you can still unmount the partition on which the processes are hanging using the "umount -l" command. This causes the stuck mount to become detached from the file system name space hierarchy on your client, and will thus no longer be visible to other processes. You can replace that mount point with another mount to the same server when it becomes available again, or to some other server if the remote data has moved. Note, though, that the old mount point will continue to consume client memory until the stuck processes have all died.

 

E1. I use a Tru64 Unix 4.x or SunOS 4.1.x client. NFS File locking does not seem to work unless I give all users permissions on the file.
A. The default specifications for NFS Versions 2 and 3 allow any user to lock a file regardless whether that user has permission to access the file. The writers of the Linux NFS server regarded this behavior as insecure, and chose to only allow users who have access to a file to be able to lock it. However, older SunOS and Tru64 clients, and some HP/UX clients, take advantage of the NFS specification by making all NFS file lock requests with the credentials of the daemon. This means that if the daemon does not have access to the files, the server will refuse to lock them.

The export option no_auth_nlm is designed to alleviate this problem. Set it on any shares you wish to export to these clients. This will disable the authorization check on file lock requests.

 

E2. I'm not using Redhat or VALinux distros so the nfs-utils startup script in the rpm is broken. What do I do?
A. You should comment out the following line in the /etc/rc.d/init.d/nfs that says this:

. /etc/rc.d/init.d/functions

 

E3. I'm using an Irix Client and I'm seeing an array of problems with file lists and cwd from a Linux server. The server is running NFS Version 3. Is this a Linux bug?
A. IRIX improperly deals with file handles of less than 32 bytes which the NFS server in Linux 2.4.x uses. SGI has addressed this problem in IRIX 6.5.13, which was released in 2001.

A workaround to this problem is to use NFS Version 2. On the IRIX client, use vers=2 in your mount options.

 

E4. Why do I get NFS timeouts when I mount a Linux NFS server from my Solaris NFS client?
A. You get NFS timeouts because you are using soft mounts. Normally, mounts are hard, which requires the client to continue attempts to reach the server forever. A soft mount allows the client to stop trying an operation after a period of time. A soft timeout may cause silent data corruption if it occurs during data or metadata transmissions, so you should only use soft mounts in the cases where client responsiveness is more important than data integrity. If you require the use of soft mounts over an unreliable link such as DSL, try using TCP, which is what Solaris uses by default. This will help manage the impact of brief network interruptions. If using TCP is not possible, then you should reduce the risk of using soft mounts with UDP by specifying long retransmission timeout values and a relatively large number of retries in the mount command options (i.e., timeo=30, retrans=10).

Note that NFS over UDP now uses a retransmit timeout estimation algorithm in the latest 2.4 and 2.6 kernels, which means the timeo= mount option is less effective at preventing data corruption due to a soft timeout.




Etc

Society

Groupthink : Understanding Micromanagers and Control Freaks : Toxic Managers : BureaucraciesHarvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Two Party System as Polyarchy : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy

Quotes

Skeptical Finance : John Kenneth Galbraith : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotes : Oscar Wilde : Talleyrand : Somerset Maugham : War and Peace : Marcus Aurelius : Eric Hoffer : Kurt Vonnegut : Otto Von Bismarck : Winston Churchill : Napoleon Bonaparte : Ambrose Bierce : Oscar Wilde : Bernard Shaw : Mark Twain Quotes

Bulletin:

Vol 26, No.1 (January, 2013) Object-Oriented Cult : Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks: The efficient markets hypothesis : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Vol 23, No.10 (October, 2011) An observation about corporate security departments : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law

History:

Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

 

The Last but not Least


Copyright © 1996-2014 by Dr. Nikolai Bezroukov. www.softpanorama.org was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License. Site uses AdSense so you need to be aware of Google privacy policy. Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine. This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting hosting of this site with different providers to distribute and speed up access. Currently there are two functional mirrors: softpanorama.info (the fastest) and softpanorama.net.

Disclaimer:

The statements, views and opinions presented on this web page are those of the author and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: February, 19, 2014