1.. SPDX-License-Identifier: GPL-2.0
2.. _iomap_operations:
3
4..
5        Dumb style notes to maintain the author's sanity:
6        Please try to start sentences on separate lines so that
7        sentence changes don't bleed colors in diff.
8        Heading decorations are documented in sphinx.rst.
9
10=========================
11Supported File Operations
12=========================
13
14.. contents:: Table of Contents
15   :local:
16
17Below are a discussion of the high level file operations that iomap
18implements.
19
20Buffered I/O
21============
22
23Buffered I/O is the default file I/O path in Linux.
24File contents are cached in memory ("pagecache") to satisfy reads and
25writes.
26Dirty cache will be written back to disk at some point that can be
27forced via ``fsync`` and variants.
28
29iomap implements nearly all the folio and pagecache management that
30filesystems have to implement themselves under the legacy I/O model.
31This means that the filesystem need not know the details of allocating,
32mapping, managing uptodate and dirty state, or writeback of pagecache
33folios.
34Under the legacy I/O model, this was managed very inefficiently with
35linked lists of buffer heads instead of the per-folio bitmaps that iomap
36uses.
37Unless the filesystem explicitly opts in to buffer heads, they will not
38be used, which makes buffered I/O much more efficient, and the pagecache
39maintainer much happier.
40
41``struct address_space_operations``
42-----------------------------------
43
44The following iomap functions can be referenced directly from the
45address space operations structure:
46
47 * ``iomap_dirty_folio``
48 * ``iomap_release_folio``
49 * ``iomap_invalidate_folio``
50 * ``iomap_is_partially_uptodate``
51
52The following address space operations can be wrapped easily:
53
54 * ``read_folio``
55 * ``readahead``
56 * ``writepages``
57 * ``bmap``
58 * ``swap_activate``
59
60``struct iomap_write_ops``
61--------------------------
62
63.. code-block:: c
64
65 struct iomap_write_ops {
66     struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
67                                unsigned len);
68     void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
69                       struct folio *folio);
70     bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
71     int (*read_folio_range)(const struct iomap_iter *iter,
72     			struct folio *folio, loff_t pos, size_t len);
73 };
74
75iomap calls these functions:
76
77  - ``get_folio``: Called to allocate and return an active reference to
78    a locked folio prior to starting a write.
79    If this function is not provided, iomap will call
80    ``iomap_get_folio``.
81    This could be used to `set up per-folio filesystem state
82    <https://lore.kernel.org/all/20190429220934.10415-5-agruenba@redhat.com/>`_
83    for a write.
84
85  - ``put_folio``: Called to unlock and put a folio after a pagecache
86    operation completes.
87    If this function is not provided, iomap will ``folio_unlock`` and
88    ``folio_put`` on its own.
89    This could be used to `commit per-folio filesystem state
90    <https://lore.kernel.org/all/20180619164137.13720-6-hch@lst.de/>`_
91    that was set up by ``->get_folio``.
92
93  - ``iomap_valid``: The filesystem may not hold locks between
94    ``->iomap_begin`` and ``->iomap_end`` because pagecache operations
95    can take folio locks, fault on userspace pages, initiate writeback
96    for memory reclamation, or engage in other time-consuming actions.
97    If a file's space mapping data are mutable, it is possible that the
98    mapping for a particular pagecache folio can `change in the time it
99    takes
100    <https://lore.kernel.org/all/20221123055812.747923-8-david@fromorbit.com/>`_
101    to allocate, install, and lock that folio.
102
103    For the pagecache, races can happen if writeback doesn't take
104    ``i_rwsem`` or ``invalidate_lock`` and updates mapping information.
105    Races can also happen if the filesystem allows concurrent writes.
106    For such files, the mapping *must* be revalidated after the folio
107    lock has been taken so that iomap can manage the folio correctly.
108
109    fsdax does not need this revalidation because there's no writeback
110    and no support for unwritten extents.
111
112    Filesystems subject to this kind of race must provide a
113    ``->iomap_valid`` function to decide if the mapping is still valid.
114    If the mapping is not valid, the mapping will be sampled again.
115
116    To support making the validity decision, the filesystem's
117    ``->iomap_begin`` function may set ``struct iomap::validity_cookie``
118    at the same time that it populates the other iomap fields.
119    A simple validation cookie implementation is a sequence counter.
120    If the filesystem bumps the sequence counter every time it modifies
121    the inode's extent map, it can be placed in the ``struct
122    iomap::validity_cookie`` during ``->iomap_begin``.
123    If the value in the cookie is found to be different to the value
124    the filesystem holds when the mapping is passed back to
125    ``->iomap_valid``, then the iomap should considered stale and the
126    validation failed.
127
128  - ``read_folio_range``: Called to synchronously read in the range that will
129    be written to. If this function is not provided, iomap will default to
130    submitting a bio read request.
131
132These ``struct kiocb`` flags are significant for buffered I/O with iomap:
133
134 * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
135
136 * ``IOCB_DONTCACHE``: Turns on ``IOMAP_DONTCACHE``.
137
138Internal per-Folio State
139------------------------
140
141If the fsblock size matches the size of a pagecache folio, it is assumed
142that all disk I/O operations will operate on the entire folio.
143The uptodate (memory contents are at least as new as what's on disk) and
144dirty (memory contents are newer than what's on disk) status of the
145folio are all that's needed for this case.
146
147If the fsblock size is less than the size of a pagecache folio, iomap
148tracks the per-fsblock uptodate and dirty state itself.
149This enables iomap to handle both "bs < ps" `filesystems
150<https://lore.kernel.org/all/20230725122932.144426-1-ritesh.list@gmail.com/>`_
151and large folios in the pagecache.
152
153iomap internally tracks two state bits per fsblock:
154
155 * ``uptodate``: iomap will try to keep folios fully up to date.
156   If there are read(ahead) errors, those fsblocks will not be marked
157   uptodate.
158   The folio itself will be marked uptodate when all fsblocks within the
159   folio are uptodate.
160
161 * ``dirty``: iomap will set the per-block dirty state when programs
162   write to the file.
163   The folio itself will be marked dirty when any fsblock within the
164   folio is dirty.
165
166iomap also tracks the amount of read and write disk IOs that are in
167flight.
168This structure is much lighter weight than ``struct buffer_head``
169because there is only one per folio, and the per-fsblock overhead is two
170bits vs. 104 bytes.
171
172Filesystems wishing to turn on large folios in the pagecache should call
173``mapping_set_large_folios`` when initializing the incore inode.
174
175Buffered Readahead and Reads
176----------------------------
177
178The ``iomap_readahead`` function initiates readahead to the pagecache.
179The ``iomap_read_folio`` function reads one folio's worth of data into
180the pagecache.
181The ``flags`` argument to ``->iomap_begin`` will be set to zero.
182The pagecache takes whatever locks it needs before calling the
183filesystem.
184
185Buffered Writes
186---------------
187
188The ``iomap_file_buffered_write`` function writes an ``iocb`` to the
189pagecache.
190``IOMAP_WRITE`` or ``IOMAP_WRITE`` | ``IOMAP_NOWAIT`` will be passed as
191the ``flags`` argument to ``->iomap_begin``.
192Callers commonly take ``i_rwsem`` in either shared or exclusive mode
193before calling this function.
194
195mmap Write Faults
196~~~~~~~~~~~~~~~~~
197
198The ``iomap_page_mkwrite`` function handles a write fault to a folio in
199the pagecache.
200``IOMAP_WRITE | IOMAP_FAULT`` will be passed as the ``flags`` argument
201to ``->iomap_begin``.
202Callers commonly take the mmap ``invalidate_lock`` in shared or
203exclusive mode before calling this function.
204
205Buffered Write Failures
206~~~~~~~~~~~~~~~~~~~~~~~
207
208After a short write to the pagecache, the areas not written will not
209become marked dirty.
210The filesystem must arrange to `cancel
211<https://lore.kernel.org/all/20221123055812.747923-6-david@fromorbit.com/>`_
212such `reservations
213<https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_
214because writeback will not consume the reservation.
215The ``iomap_write_delalloc_release`` can be called from a
216``->iomap_end`` function to find all the clean areas of the folios
217caching a fresh (``IOMAP_F_NEW``) delalloc mapping.
218It takes the ``invalidate_lock``.
219
220The filesystem must supply a function ``punch`` to be called for
221each file range in this state.
222This function must *only* remove delayed allocation reservations, in
223case another thread racing with the current thread writes successfully
224to the same region and triggers writeback to flush the dirty data out to
225disk.
226
227Zeroing for File Operations
228~~~~~~~~~~~~~~~~~~~~~~~~~~~
229
230Filesystems can call ``iomap_zero_range`` to perform zeroing of the
231pagecache for non-truncation file operations that are not aligned to
232the fsblock size.
233``IOMAP_ZERO`` will be passed as the ``flags`` argument to
234``->iomap_begin``.
235Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
236mode before calling this function.
237
238Unsharing Reflinked File Data
239~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
240
241Filesystems can call ``iomap_file_unshare`` to force a file sharing
242storage with another file to preemptively copy the shared data to newly
243allocate storage.
244``IOMAP_WRITE | IOMAP_UNSHARE`` will be passed as the ``flags`` argument
245to ``->iomap_begin``.
246Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
247mode before calling this function.
248
249Truncation
250----------
251
252Filesystems can call ``iomap_truncate_page`` to zero the bytes in the
253pagecache from EOF to the end of the fsblock during a file truncation
254operation.
255``truncate_setsize`` or ``truncate_pagecache`` will take care of
256everything after the EOF block.
257``IOMAP_ZERO`` will be passed as the ``flags`` argument to
258``->iomap_begin``.
259Callers typically hold ``i_rwsem`` and ``invalidate_lock`` in exclusive
260mode before calling this function.
261
262Pagecache Writeback
263-------------------
264
265Filesystems can call ``iomap_writepages`` to respond to a request to
266write dirty pagecache folios to disk.
267The ``mapping`` and ``wbc`` parameters should be passed unchanged.
268The ``wpc`` pointer should be allocated by the filesystem and must
269be initialized to zero.
270
271The pagecache will lock each folio before trying to schedule it for
272writeback.
273It does not lock ``i_rwsem`` or ``invalidate_lock``.
274
275The dirty bit will be cleared for all folios run through the
276``->writeback_range`` machinery described below even if the writeback fails.
277This is to prevent dirty folio clots when storage devices fail; an
278``-EIO`` is recorded for userspace to collect via ``fsync``.
279
280The ``ops`` structure must be specified and is as follows:
281
282``struct iomap_writeback_ops``
283~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
284
285.. code-block:: c
286
287 struct iomap_writeback_ops {
288    int (*writeback_range)(struct iomap_writepage_ctx *wpc,
289        struct folio *folio, u64 pos, unsigned int len, u64 end_pos);
290    int (*writeback_submit)(struct iomap_writepage_ctx *wpc, int error);
291 };
292
293The fields are as follows:
294
295  - ``writeback_range``: Sets ``wpc->iomap`` to the space mapping of the file
296    range (in bytes) given by ``offset`` and ``len``.
297    iomap calls this function for each dirty fs block in each dirty folio,
298    though it will `reuse mappings
299    <https://lore.kernel.org/all/20231207072710.176093-15-hch@lst.de/>`_
300    for runs of contiguous dirty fsblocks within a folio.
301    Do not return ``IOMAP_INLINE`` mappings here; the ``->iomap_end``
302    function must deal with persisting written data.
303    Do not return ``IOMAP_DELALLOC`` mappings here; iomap currently
304    requires mapping to allocated space.
305    Filesystems can skip a potentially expensive mapping lookup if the
306    mappings have not changed.
307    This revalidation must be open-coded by the filesystem; it is
308    unclear if ``iomap::validity_cookie`` can be reused for this
309    purpose.
310
311    If this methods fails to schedule I/O for any part of a dirty folio, it
312    should throw away any reservations that may have been made for the write.
313    The folio will be marked clean and an ``-EIO`` recorded in the
314    pagecache.
315    Filesystems can use this callback to `remove
316    <https://lore.kernel.org/all/20201029163313.1766967-1-bfoster@redhat.com/>`_
317    delalloc reservations to avoid having delalloc reservations for
318    clean pagecache.
319    This function must be supplied by the filesystem.
320
321  - ``writeback_submit``: Submit the previous built writeback context.
322    Block based file systems should use the iomap_ioend_writeback_submit
323    helper, other file system can implement their own.
324    File systems can optionall to hook into writeback bio submission.
325    This might include pre-write space accounting updates, or installing
326    a custom ``->bi_end_io`` function for internal purposes, such as
327    deferring the ioend completion to a workqueue to run metadata update
328    transactions from process context before submitting the bio.
329    This function must be supplied by the filesystem.
330
331Pagecache Writeback Completion
332~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
333
334To handle the bookkeeping that must happen after disk I/O for writeback
335completes, iomap creates chains of ``struct iomap_ioend`` objects that
336wrap the ``bio`` that is used to write pagecache data to disk.
337By default, iomap finishes writeback ioends by clearing the writeback
338bit on the folios attached to the ``ioend``.
339If the write failed, it will also set the error bits on the folios and
340the address space.
341This can happen in interrupt or process context, depending on the
342storage device.
343Filesystems that need to update internal bookkeeping (e.g. unwritten
344extent conversions) should set their own bi_end_io on the bios
345submitted by ``->submit_writeback``
346This function should call ``iomap_finish_ioends`` after finishing its
347own work (e.g. unwritten extent conversion).
348
349Some filesystems may wish to `amortize the cost of running metadata
350transactions
351<https://lore.kernel.org/all/20220120034733.221737-1-david@fromorbit.com/>`_
352for post-writeback updates by batching them.
353They may also require transactions to run from process context, which
354implies punting batches to a workqueue.
355iomap ioends contain a ``list_head`` to enable batching.
356
357Given a batch of ioends, iomap has a few helpers to assist with
358amortization:
359
360 * ``iomap_sort_ioends``: Sort all the ioends in the list by file
361   offset.
362
363 * ``iomap_ioend_try_merge``: Given an ioend that is not in any list and
364   a separate list of sorted ioends, merge as many of the ioends from
365   the head of the list into the given ioend.
366   ioends can only be merged if the file range and storage addresses are
367   contiguous; the unwritten and shared status are the same; and the
368   write I/O outcome is the same.
369   The merged ioends become their own list.
370
371 * ``iomap_finish_ioends``: Finish an ioend that possibly has other
372   ioends linked to it.
373
374Direct I/O
375==========
376
377In Linux, direct I/O is defined as file I/O that is issued directly to
378storage, bypassing the pagecache.
379The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
380writes for files.
381
382.. code-block:: c
383
384 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
385                      const struct iomap_ops *ops,
386                      const struct iomap_dio_ops *dops,
387                      unsigned int dio_flags, void *private,
388                      size_t done_before);
389
390The filesystem can provide the ``dops`` parameter if it needs to perform
391extra work before or after the I/O is issued to storage.
392The ``done_before`` parameter tells the how much of the request has
393already been transferred.
394It is used to continue a request asynchronously when `part of the
395request
396<https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103>`_
397has already been completed synchronously.
398
399The ``done_before`` parameter should be set if writes for the ``iocb``
400have been initiated prior to the call.
401The direction of the I/O is determined from the ``iocb`` passed in.
402
403The ``dio_flags`` argument can be set to any combination of the
404following values:
405
406 * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
407   kiocb is not synchronous.
408
409 * ``IOMAP_DIO_OVERWRITE_ONLY``: Perform a pure overwrite for this range
410   or fail with ``-EAGAIN``.
411   This can be used by filesystems with complex unaligned I/O
412   write paths to provide an optimised fast path for unaligned writes.
413   If a pure overwrite can be performed, then serialisation against
414   other I/Os to the same filesystem block(s) is unnecessary as there is
415   no risk of stale data exposure or data loss.
416   If a pure overwrite cannot be performed, then the filesystem can
417   perform the serialisation steps needed to provide exclusive access
418   to the unaligned I/O range so that it can perform allocation and
419   sub-block zeroing safely.
420   Filesystems can use this flag to try to reduce locking contention,
421   but a lot of `detailed checking
422   <https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/>`_
423   is required to do it `correctly
424   <https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/>`_.
425
426 * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
427   progress has already been made.
428   The caller may deal with the page fault and retry the operation.
429   If the caller decides to retry the operation, it should pass the
430   accumulated return values of all previous calls as the
431   ``done_before`` parameter to the next call.
432
433These ``struct kiocb`` flags are significant for direct I/O with iomap:
434
435 * ``IOCB_NOWAIT``: Turns on ``IOMAP_NOWAIT``.
436
437 * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
438   before completing the call.
439   In the case of pure overwrites, the I/O may be issued with FUA
440   enabled.
441
442 * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
443   interrupt.
444   Only meaningful for asynchronous I/O, and only if the entire I/O can
445   be issued as a single ``struct bio``.
446
447 * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
448   process context.
449   See ``linux/fs.h`` for more details.
450
451Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
452``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
453function for the file.
454They should not set ``->direct_IO``, which is deprecated.
455
456If a filesystem wishes to perform its own work before direct I/O
457completion, it should call ``__iomap_dio_rw``.
458If its return value is not an error pointer or a NULL pointer, the
459filesystem should pass the return value to ``iomap_dio_complete`` after
460finishing its internal work.
461
462Return Values
463-------------
464
465``iomap_dio_rw`` can return one of the following:
466
467 * A non-negative number of bytes transferred.
468
469 * ``-ENOTBLK``: Fall back to buffered I/O.
470   iomap itself will return this value if it cannot invalidate the page
471   cache before issuing the I/O to storage.
472   The ``->iomap_begin`` or ``->iomap_end`` functions may also return
473   this value.
474
475 * ``-EIOCBQUEUED``: The asynchronous direct I/O request has been
476   queued and will be completed separately.
477
478 * Any of the other negative error codes.
479
480Direct Reads
481------------
482
483A direct I/O read initiates a read I/O from the storage device to the
484caller's buffer.
485Dirty parts of the pagecache are flushed to storage before initiating
486the read io.
487The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
488any combination of the following enhancements:
489
490 * ``IOMAP_NOWAIT``, as defined previously.
491
492Callers commonly hold ``i_rwsem`` in shared mode before calling this
493function.
494
495Direct Writes
496-------------
497
498A direct I/O write initiates a write I/O to the storage device from the
499caller's buffer.
500Dirty parts of the pagecache are flushed to storage before initiating
501the write io.
502The pagecache is invalidated both before and after the write io.
503The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
504IOMAP_WRITE`` with any combination of the following enhancements:
505
506 * ``IOMAP_NOWAIT``, as defined previously.
507
508 * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
509   blocks is not allowed.
510   The entire file range must map to a single written or unwritten
511   extent.
512   The file I/O range must be aligned to the filesystem block size
513   if the mapping is unwritten and the filesystem cannot handle zeroing
514   the unaligned regions without exposing stale contents.
515
516 * ``IOMAP_ATOMIC``: This write is being issued with torn-write
517   protection.
518   Torn-write protection may be provided based on HW-offload or by a
519   software mechanism provided by the filesystem.
520
521   For HW-offload based support, only a single bio can be created for the
522   write, and the write must not be split into multiple I/O requests, i.e.
523   flag REQ_ATOMIC must be set.
524   The file range to write must be aligned to satisfy the requirements
525   of both the filesystem and the underlying block device's atomic
526   commit capabilities.
527   If filesystem metadata updates are required (e.g. unwritten extent
528   conversion or copy-on-write), all updates for the entire file range
529   must be committed atomically as well.
530   Untorn-writes may be longer than a single file block. In all cases,
531   the mapping start disk block must have at least the same alignment as
532   the write offset.
533   The filesystems must set IOMAP_F_ATOMIC_BIO to inform iomap core of an
534   untorn-write based on HW-offload.
535
536   For untorn-writes based on a software mechanism provided by the
537   filesystem, all the disk block alignment and single bio restrictions
538   which apply for HW-offload based untorn-writes do not apply.
539   The mechanism would typically be used as a fallback for when
540   HW-offload based untorn-writes may not be issued, e.g. the range of the
541   write covers multiple extents, meaning that it is not possible to issue
542   a single bio.
543   All filesystem metadata updates for the entire file range must be
544   committed atomically as well.
545
546Callers commonly hold ``i_rwsem`` in shared or exclusive mode before
547calling this function.
548
549``struct iomap_dio_ops:``
550-------------------------
551.. code-block:: c
552
553 struct iomap_dio_ops {
554     void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
555                       loff_t file_offset);
556     int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
557                   unsigned flags);
558     struct bio_set *bio_set;
559 };
560
561The fields of this structure are as follows:
562
563  - ``submit_io``: iomap calls this function when it has constructed a
564    ``struct bio`` object for the I/O requested, and wishes to submit it
565    to the block device.
566    If no function is provided, ``submit_bio`` will be called directly.
567    Filesystems that would like to perform additional work before (e.g.
568    data replication for btrfs) should implement this function.
569
570  - ``end_io``: This is called after the ``struct bio`` completes.
571    This function should perform post-write conversions of unwritten
572    extent mappings, handle write failures, etc.
573    The ``flags`` argument may be set to a combination of the following:
574
575    * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
576      should mark the extent as written.
577
578    * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
579      copy on write operation, so the ioend should switch mappings.
580
581  - ``bio_set``: This allows the filesystem to provide a custom bio_set
582    for allocating direct I/O bios.
583    This enables filesystems to `stash additional per-bio information
584    <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
585    for private use.
586    If this field is NULL, generic ``struct bio`` objects will be used.
587
588Filesystems that want to perform extra work after an I/O completion
589should set a custom ``->bi_end_io`` function via ``->submit_io``.
590Afterwards, the custom endio function must call
591``iomap_dio_bio_end_io`` to finish the direct I/O.
592
593DAX I/O
594=======
595
596Some storage devices can be directly mapped as memory.
597These devices support a new access mode known as "fsdax" that allows
598loads and stores through the CPU and memory controller.
599
600fsdax Reads
601-----------
602
603A fsdax read performs a memcpy from storage device to the caller's
604buffer.
605The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
606combination of the following enhancements:
607
608 * ``IOMAP_NOWAIT``, as defined previously.
609
610Callers commonly hold ``i_rwsem`` in shared mode before calling this
611function.
612
613fsdax Writes
614------------
615
616A fsdax write initiates a memcpy to the storage device from the caller's
617buffer.
618The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
619IOMAP_WRITE`` with any combination of the following enhancements:
620
621 * ``IOMAP_NOWAIT``, as defined previously.
622
623 * ``IOMAP_OVERWRITE_ONLY``: The caller requires a pure overwrite to be
624   performed from this mapping.
625   This requires the filesystem extent mapping to already exist as an
626   ``IOMAP_MAPPED`` type and span the entire range of the write I/O
627   request.
628   If the filesystem cannot map this request in a way that allows the
629   iomap infrastructure to perform a pure overwrite, it must fail the
630   mapping operation with ``-EAGAIN``.
631
632Callers commonly hold ``i_rwsem`` in exclusive mode before calling this
633function.
634
635fsdax mmap Faults
636~~~~~~~~~~~~~~~~~
637
638The ``dax_iomap_fault`` function handles read and write faults to fsdax
639storage.
640For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
641``flags`` argument to ``->iomap_begin``.
642For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
643passed as the ``flags`` argument to ``->iomap_begin``.
644
645Callers commonly hold the same locks as they do to call their iomap
646pagecache counterparts.
647
648fsdax Truncation, fallocate, and Unsharing
649------------------------------------------
650
651For fsdax files, the following functions are provided to replace their
652iomap pagecache I/O counterparts.
653The ``flags`` argument to ``->iomap_begin`` are the same as the
654pagecache counterparts, with ``IOMAP_DAX`` added.
655
656 * ``dax_file_unshare``
657 * ``dax_zero_range``
658 * ``dax_truncate_page``
659
660Callers commonly hold the same locks as they do to call their iomap
661pagecache counterparts.
662
663fsdax Deduplication
664-------------------
665
666Filesystems implementing the ``FIDEDUPERANGE`` ioctl must call the
667``dax_remap_file_range_prep`` function with their own iomap read ops.
668
669Seeking Files
670=============
671
672iomap implements the two iterating whence modes of the ``llseek`` system
673call.
674
675SEEK_DATA
676---------
677
678The ``iomap_seek_data`` function implements the SEEK_DATA "whence" value
679for llseek.
680``IOMAP_REPORT`` will be passed as the ``flags`` argument to
681``->iomap_begin``.
682
683For unwritten mappings, the pagecache will be searched.
684Regions of the pagecache with a folio mapped and uptodate fsblocks
685within those folios will be reported as data areas.
686
687Callers commonly hold ``i_rwsem`` in shared mode before calling this
688function.
689
690SEEK_HOLE
691---------
692
693The ``iomap_seek_hole`` function implements the SEEK_HOLE "whence" value
694for llseek.
695``IOMAP_REPORT`` will be passed as the ``flags`` argument to
696``->iomap_begin``.
697
698For unwritten mappings, the pagecache will be searched.
699Regions of the pagecache with no folio mapped, or a !uptodate fsblock
700within a folio will be reported as sparse hole areas.
701
702Callers commonly hold ``i_rwsem`` in shared mode before calling this
703function.
704
705Swap File Activation
706====================
707
708The ``iomap_swapfile_activate`` function finds all the base-page aligned
709regions in a file and sets them up as swap space.
710The file will be ``fsync()``'d before activation.
711``IOMAP_REPORT`` will be passed as the ``flags`` argument to
712``->iomap_begin``.
713All mappings must be mapped or unwritten; cannot be dirty or shared, and
714cannot span multiple block devices.
715Callers must hold ``i_rwsem`` in exclusive mode; this is already
716provided by ``swapon``.
717
718File Space Mapping Reporting
719============================
720
721iomap implements two of the file space mapping system calls.
722
723FS_IOC_FIEMAP
724-------------
725
726The ``iomap_fiemap`` function exports file extent mappings to userspace
727in the format specified by the ``FS_IOC_FIEMAP`` ioctl.
728``IOMAP_REPORT`` will be passed as the ``flags`` argument to
729``->iomap_begin``.
730Callers commonly hold ``i_rwsem`` in shared mode before calling this
731function.
732
733FIBMAP (deprecated)
734-------------------
735
736``iomap_bmap`` implements FIBMAP.
737The calling conventions are the same as for FIEMAP.
738This function is only provided to maintain compatibility for filesystems
739that implemented FIBMAP prior to conversion.
740This ioctl is deprecated; do **not** add a FIBMAP implementation to
741filesystems that do not have it.
742Callers should probably hold ``i_rwsem`` in shared mode before calling
743this function, but this is unclear.
744