Windows File System Filter Driver FAQ
What is the difference between
cached I/O, user non-cached I/O, and paging I/O?
In a file
system or file system filter driver, read and write operations fall into
several different categories. For the purpose of discussing them, we normally consider the following types:
- Cached I/O. This includes normal user I/O, both via the Fast I/O path as well
as via the IRP_MJ_READ and IRP_MJ_WRITE path. It also includes the MDL
operations (where the caller requests the FSD return an MDL pointing to the
data in the cache).
- Non-cached user I/O. This includes all non-cached I/O operations that
originate outside the virtual memory system.
- Paging I/O. These are I/O operations initiated by
the virtual memory system in order to satisfy the needs of the demand paging
system.
Cached I/O is any I/O that can be satisfied by the file
system data cache. In such a case, the operation is normally to copy the
data from the virtual cache buffer into the user buffer. If the virtual cache
buffer contents are resident in memory, the copy is fast and the results
returned to the application quickly. If the virtual cache buffer contents are
not all resident in memory, then the copy process will trigger a page fault,
which generates a second re-entrant I/O operation via the paging mechanism.
Non-cached user I/O is I/O that must bypass the cache - even if the data is
present in the cache. For read operations, the FSD can retrieve the data
directly from the storage device without making any changes to the cache. For
write operations, however, an FSD must ensure that the cached data is properly invalidated (if this is even possible, which it
will not be if the file is also memory mapped).
Paging I/O is I/O that must be satisfied from the storage device (whether local
to the system or located on some "other" computer system) and it is being requested by the virtual memory system as part of
the paging mechanism (and hence has special rules that apply to its behavior as
well as its serialization).
I see I/O requests with the
IRP_MN_MDL minor function code? What does this mean?
How should I handle it in my file system or filter driver?
Kernel mode
callers of the read and write interface (IRP_MJ_READ and IRP_MJ_WRITE) can
utilize an interface that allows retrieval of a pointer to the data as it is
located in the file system data cache. This allows the kernel mode caller to
retrieve the data for the file without an additional data copy.
For example, the AFD file system driver has an API function it exports that
takes a socket handle and a file handle. The file contents are
"copied" directly to the corresponding communications socket. The AFD
driver accomplishes this task by sending an IRP_MJ_READ with the IRP_MN_MDL
minor operation. The FSD then retrieves an MDL describing the cached data (at Irp->MdlAddress) and completes
the request. When AFD has completed processing the operation it must return the MDL to the FSD by sending an IRP_MJ_READ with the
IRP_MN_MDL_COMPLETE minor operation specified.
For a file system filter driver, the write operation
may be a bit more confusing. When a caller specifies IRP_MJ_WRITE/IRP_MN_MDL
the returned MDL may point to uninitialized data regions within the cache. That
is because the cache manager will refrain from reading the current data in from
disk (unless necessary) in anticipation of the caller replacing the data. When
the caller has updated the data, it releases the buffer by calling
IRP_MJ_WRITE/IRP_MN_MDL_COMPLETE. At that point the
data has been written back to the cache.
An FSD that is integrated with the cache manager can
implement these minor functions by calling CcMdlRead and CcPrepareMdlWrite. The corresponding functions
for completing these are CcMdlReadComplete and CcMdlWriteComplete. An FSD that is not integrated with the
cache manager can either indicate these operations are not supported (in which
case the caller must send a buffer and support the "standard"
read/write mechanism) or it can implement them in some other manner that is
appropriate.
Handling
FILE_COMPLETE_IF_OPLOCKED in a filter driver.
A filter
driver may be called by an application that has
indicated the FILE_COMPLETE_IF_OPLOCKED. If the filter in turn calls ZwCreateFile it may cause the
thread to deadlock.
A common problem for file systems is that of reentrancy. The Windows operating
system supports numerous reentrant operations. For example, an asynchronous
procedure call (APC) can be invoked within a given thread context as needed.
However, suppose an APC is delivered to a thread while
it is processing a file system request. Imagine that this APC in turn issues
another call into the file system. Recall that file systems utilize resource
locks internally to ensure correct operation. The file systems must also ensure
the correct order of lock acquisition in order to eliminate the possibility of
deadlocks arising. However it is not possible for the
file system to define a locking order in the face of arbitrary reentrancy!
To resolve this problem, the file systems disable certain types of reentrancy
that would not be safe. They do this by calling FsRtlEnterFileSystem when they enter a region of code that is not reentrant. When they leave that
region of code, they call FsRtlExitFileSystem to
enable reentrancy.
This is important to an understanding of this problem because the CIFS file
server uses oplocks as part of its cache consistency
mechanism between remote clients and local clients. This is
done using a "callback" mechanism, which is implemented using
APCs.
Normally, the FSD will block waiting for the completion of the APC that breaks
an oplock. Under certain circumstances, however, the
CIFS server thread that issued the operation requiring an oplock break is also the thread that must process the APC. Since the file system has
blocked APC delivery, and now the thread is blocked awaiting completion of the
APC, this approach leads to deadlock. Because of this, the Windows file system
developers introduced an additional option that advises the file system that if
an oplock break is required to process the
IRP_MJ_CREATE operation, it should not block, but instead should return a
special status code STATUS_OPLOCK_BREAK_IN_PROGRESS. This return value then
tells the caller that the file is not completely opened.
Instead, a subsequent call to the file system, using the
FSCTL_OPLOCK_BREAK_NOTIFY, must be made to ensure that
the oplock break has been completed.
Of course, this works because by returning this status code the APC can be delivered, once the thread exits the file system
driver.
Note that FSCTL_OPLOCK_BREAK_NOTIFY, and the other calls for the oplock protocol, are documented in the Windows Platform
SDK.
What are the rules for my file
system/filter driver for handling paging I/O? What about paging file I/O?
The rules for
handling page faults are quite strict because incorrect handling can lead to
catastrophic system failure. For this reason, there are specific rules used to
ensure correct cooperation between file systems (and file system filter
drivers) and the virtual memory system. This is necessary because page faults are trapped by the VM system, but are then ultimately
satisfied by the file system (and associated storage stack). Thus, the file
system must not generate additional page faults because this may lead to an
"infinite" recursion. Normally, hardware platforms have a finite
limit to the number of page faults they will handle in a "stacked"
fashion.
Thus, the most reliable of the paging paths is that for the paging file. Any
access to the paging file must not generate any additional page faults.
Further, to avoid serialization problems, the file system is not responsible
for serializing this access - the paging file belongs to the VM system, and the
VM system is responsible for serializing access to this file (this eliminates
some of the re-entrant locking problems that occur with general file access).
To achieve this, the drivers involved must ensure they will not generate a page
fault. They must not call any routines that could generate a page fault (e.g., code
modules that can be paged). The file system may only be called at APC_LEVEL but it should only call
those routines that are safe to call at DISPATCH_LEVEL, since such routines are
guaranteed not to cause page faults. None of the data being
used through this path should be pagable.
For all other files, paging I/O has less stringent rules. For a file system
driver the code paths cannot be paged - otherwise, the
page fault might be to fetch the very piece of code needed to satisfy the page
fault! For any file system, data may be paged, since
such a page fault can always be resolved eventually by retrieving the correct
contents off disk. Paging activity occurs at APC_LEVEL and thus limits
arbitrary re-entrancy in order to prevent calling
code paths that could generate yet more page faults.
IRP_MJ_CLEANUP vs IRP_MJ_CLOSE
The purpose
of IRP_MJ_CLEANUP is to indicate that the last handle reference against the
given file object has been released. The purpose of IRP_MJ_CLOSE is to indicate
that the last system reference against the given file object has been released.
This is because the operating system uses two distinct reference counts for any
object, including the file object. These values are stored within the object
header, with the HandleCount representing the number
of open handles against the object and the PointerCount representing the number of references against the object. Since the HandleCount always implies a reference (from a handle
table) to the object, the HandleCount is less than or
equal to the PointerCount.
Any kernel mode component may maintain a private reference to the object.
Routines such as ObReferenceObject, ObReferenceObjectByHandle, and IoGetDeviceObjectPointer all bump the reference count on a specific object. A kernel mode driver
releases that reference by using ObDereferenceObject to decrement the PointerCount on the given object.
A file system, or file system filter driver, will often see a long delay
between the IRP_MJ_CLEANUP and IRP_MJ_CLOSE because a component within the
operating system is maintaining a reference against the file object. Frequently,
this is because the memory manager maintains a reference against a file object
that is backing a section object. So long as the section object remains
"in use" the file object will be referenced.
Section objects, in turn, remain referenced for extended periods of time
because they are used by the memory manager in tracking the usage of memory for
file-backed shared memory regions (e.g., executables,
DLLs, memory mapped files). For example, the cache manager uses the section
object as part of its mappings of file system data within the cache. Thus, the period of time between the IRP_MJ_CLEANUP and the
IRP_MJ_CLOSE can be arbitrarily long.
The other complication here is that the memory manager uses only a single file
object to back the section object. Any subsequent file object created to access
that file will not be used to back the section and thus for these new file
objects the IRP_MJ_CLEANUP is typically followed by an IRP_MJ_CLOSE. Thus, the
first file object may be used for an extended period of time,
while subsequent file objects have a relatively short lifespan.
What are the rules for managing MDLs
and User Buffers? How do I substitute my own buffer in an IRP?
In all
fairness, there are no "rules" for managing MDLs and user buffers.
There are suggestions that we can offer based upon observed behavior of the
file systems. First, we note that there are two basic sources of I/O operations
for a file system - the applications layer, and other operating system
components.
For applications programs, most IRPs are still buffered.
The operations for which this is not necessarily the case are those for which
larger amounts of data are transferred. These are
IRP_MJ_READ, IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, IRP_MJ_SET_QUOTA,?and the per-control code options of
IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and
IRP_MJ_FILE_SYSTEM_CONTROL. If the Flags field in the device object specifies
DO_DIRECT_IO then the buffer for IRP_MJ_READ, IRP_MJ_WRITE,
IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA,?IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA,
?is specified as a memory descriptor list (MDL) pointed to by the MdlAddress field of the IRP. If the Flags field in the
device object specifies DO_BUFFERED_IO then the buffer for IRP_MJ_READ,
IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL,? IRP_MJ_QUERY_EA, IRP_MJ_SET_EA, IRP_MJ_QUERY_QUOTA, and IRP_MJ_SET_QUOTA,?is a non-paged pool buffer pointed to by the AssociatedIrp.SystemBuffer field of the IRP. The most common case for a file system driver is that neither of
these two flags is specified, in which case the buffer for IRP_MJ_READ,
IRP_MJ_WRITE, IRP_MJ_DIRECTORY_CONTROL, IRP_MJ_QUERY_SECURITY,
IRP_MJ_SET_SECURITY, and IRP_MJ_QUERY_EA, IRP_MJ_SET_EA is a direct pointer to
the caller-supplied buffer via the UserBuffer field
of the IRP.
Interestingly,
for IRP_MJ_QUERY_SECURITY and IRP_MJ_SET_SECURITY the buffer is always passed
as METHOD_NEITHER.? Thus, the user buffer is pointed
to by Irp->UserBuffer.? The file system is responsible for validating and
managing that buffer directly.
For the
control operations (IRP_MJ_DEVICE_CONTROL, IRP_MJ_INTERNAL_DEVICE_CONTROL, and IRP_MJ_FILE_SYSTEM_CONTROL) the buffer descriptions are a
function of the specified control code, of which there are four:
METHOD_BUFFERED, METHOD_IN_DIRECT, METHOD_OUT_DIRECT and METHOD_NEITHER.
? METHOD_BUFFERED - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. Upon completion of the operation the output data is in the same buffer. Transferring data between user and kernel mode is handled by the
I/O Manager.
? METHOD_IN_DIRECT - in this case the input data is in the buffer pointed to by AssociatedIrp.SystemBuffer. The secondary buffer is described by the MDL in MdlAddress.
When the I/O Manager probed and locked the memory corresponding to the memory
for the original buffer, it probed it for read access (hence the IN part of
this transfer description). Typically, this is confused because it is referred to
as the output buffer (although, probing it for input would suggest it is being
used as a secondary input buffer).
? METHOD_OUT_DIRECT - in this case the input data is in the buffer
pointed to by AssociatedIrp.SystemBuffer. The
secondary buffer is described by the MDL in MdlAddress. When the I/O Manager probed and locked the
memory corresponding to the memory for the original buffer, it probed it for
write access (hence the OUT part of this transfer description). ?
METHOD_NEITHER - in this case the input data is described by
a pointer in the I/O stack location
(Parameters.DeviceIoControl.Type3InputBuffer). The
output data is described by a pointer in the IRP (UserBuffer).
In both cases, these pointers are direct virtual address references to the
original buffer. The memory may, or may not, be valid.
Regardless of the type of transfer, any pointers stored within these buffers
will always be direct virtual memory references. For example, a driver that
accepts a further buffer pointer will need to treat them as ordinary direct
access to application address space.
For any driver, attempting to access a user buffer directly can lead to an
invalid memory reference. Such references cause the memory manager to throw
exceptions (such as is done using ExRaiseStatus for instance). If a driver has not protected against such exceptions, the default kernel exception handler will be invoked. This
handler will call KeBugCheckEx indicating
KMODE_EXCEPTION_NOT_HANDLED. The exception will be indicated as STATUS_ACCESS_VIOLATION (0xC0000005). Protecting against such exceptions
requires the use of structured exception handling, which is described elsewhere
(See Question Number 1.38 for more information).
A normal driver will associate an MDL it creates to describe the user buffer
with the IRP. This is useful because it ensures that when the IRP is completed
the I/O Manager will clean up these MDLs, which eliminates the need for the
driver to provide special handling for such clean-up.
As a result, it is normal that if there is both an MdlAddress and UserBuffer address, the MDL describes the
corresponding user address (you can confirm this by comparing the value
returned by MmGetMdlVirtualAddress with the value
stored in the UserBuffer field). Of course, it is possible that a driver might associate multiple MDLs
with a single IRP (using the Next field of the MDL itself). This could be done
explicitly (by setting the field directly) or implicitly (by using IoAllocateMdl and indicating TRUE for the SecondaryBuffer parameter). This could be problematic for file system filter drivers, should a file system be
implemented to exploit this behavior.
The other source of I/O operations is from other OS components. It is
acceptable for such OS components to use the same access mechanism used by
user-mode components, in which case the earlier analysis still applies. In
addition, kernel mode components may utilize direct I/O - regardless of the
value in the Flags field of the DEVICE_OBJECT for the given file system. For
example, paging I/O operations are always submitted to
the file system by utilizing MDLs that describe the new physical memory that
will be used to store the data. File systems should not reference the memory
pointed to by Irp->UserBuffer although this address will appear to be valid (and will even be the same as is
returned by MmGetMdlVirtualAddress). The address are
not, in fact valid, but may be used by the file system when constructing
multiple sub-operations. Memory Manager MDLs cannot be used for direct reference to that memory, as those buffers have not been mapped into
memory since they do not yet contain the correct data.
For a file system filter driver that wishes to modify
the data in some way, it is important to keep in mind the use of that memory.
For example, a traditional mistake for an encryption filter is to trap the
IRP_MJ_WRITE where the IRP_NOCACHE bit is set (which catches both user
non-cached I/O as well as paging I/O) and, using the provided MDL or user
buffer, encrypt the data in-place. The risk here is that some other thread will
gain access to that memory in its encrypted state. For example, if the file is
memory mapped, the application will observe the modified data, rather than the
original, cleartext data. Thus, there are a few rules
that need to be observed by file system filter drivers that choose to modify
the data buffers associated with a given IRP:
? The IRP_PAGING_IO bit changes the completion behavior of the I/O Manager.
MDLs in such IRPs are not discarded or cleaned up by the I/O
Manager, because they belong to the Memory Manager (see IoPageRead as an example). Thus, filter drivers should be careful when setting this
bit (e.g., if they create a new IRP and send it down with the resulting data).
? The Irp->UserBuffer must have the same value as is returned by MmGetMdlVirtualAddress.
If it does not and the underlying file system must break up the I/O operation
into a series of sub-operations, it will do so incorrectly (see how this is handled in deviosup.c within
the FAT file system example in the IFS Kit, where it builds a partial MDL using IoBuildPartialMdl. It uses Irp->UserBuffer as an index reference for Irp->MdlAddress). For example, if substituting a new piece of
memory (such as for the encryption driver), make sure that this parameter is
set correctly.
? Never modify the buffer provided by the caller unless you are willing
to make those changes visible to the caller immediately. Keep in mind that in a
multi-threaded shared memory operating system the change is - literally -
available and visible to other threads/processes/processors as you make them.
Changes should always be made to a separate buffer
component. That buffer can then be used in lieu of the original buffer, either
within the original IRP, or by using a new IRP for that task.
? Use the correct routine for the type of buffer (e.g., MmBuildMdlForNonPagedPool if the memory is allocated from
non-paged pool).
? Any reference to a pointer within the user's address space must be
protected using __try and __except in order to prevent invalid user addresses
from causing the system to crash.
What are the issues with respect to
IRQL APC_LEVEL? What does it do? Why should I use (or not use) FsRtlEnterFileSystem?
Windows is designed to be a fully re-entrant operating system. Thus,
in general, kernel components may make calls back into the OS without worrying
about deadlocks or other potential re-entrancy problems.
Windows also provides out-of-band execution of operations, such as asynchronous
procedure calls (APC). And APC is a mechanism that
allows the operating system to invoke a given routine within a specific thread
context. This is, in turn, done by using a queue of
pending APC objects that is stored in the control structure used by the OS for
tracking a thread's state. Periodically, the kernel checks to see if there are pending APC objects that need to be processed by the
given thread (where "periodically" is an arbitrary decision of the
operating system). From a pragmatic programming standpoint, an APC can be
"delivered" (that is, the routine can be called) by a thread between
any two instructions. The delivery of APCs can be blocked by kernel code using
one of two mechanisms:
? Kernel APC delivery may be disabled by using KeEnterCriticalRegion and re-enabled by using KeLeaveCriticalRegion. Note that Special Kernel APCs
are not disabled using this mechanism.
? Special Kernel APC delivery may be disabled by raising the IRQL of the thread to APC_LEVEL (or higher).
There are numerous uses for this, but some of the reasons for this are because
of the nature of threads and APCs on Windows. First, we note that a given
thread is restricted to running on only a single processor at any given time.
Thus, the operating system can eliminate multi-processor serialization issues
by requiring that an operation be done in one specific
thread context. For example, each thread maintains a list of outstanding I/O
operations it has initiated (this is ETHREAD->IrpList).
This list is only modified in the given thread's
context. By exploiting this, and by raising to
APC_LEVEL the list can be safely modified without resorting to more expensive
locking mechanisms (such as spin locks).
The primary disadvantage to APC_LEVEL is that it disables I/O completion APC
routines from running. This in turn means that the driver must be careful to
handle completion correctly (that is, it cannot use the Zw routines, and it must use its own signaling mechanism from its completion
routine when sending an IRP so that it may signal completion of the I/O).