http://www.linux-mag.com/cache/7497/1.html
Metadata Performance of Four Linux File Systems
Jeffrey B. Layton
Wed, 2 Sep 2009

Introduction

  Using the principles of good benchmarking, we explore the metadata
  performance of four linux file systems using a simple benchmark, fdtree.

  In a previous article, the case was made for how low file system benchmarks
  have fallen.  Benchmarks have become the tool of marketing to the point
  where they are mere numbers and do not prove of much use.  The article
  reviewed a paper that examined nine years of storage and file system
  benchmarking and made some excellent observations.  The paper also made
  some recommendations about how to improve benchmarks.

  This article isn't so much about benchmarks as a product, but rather
  it is an exploration looking for interesting observations or trends
  or the lack thereof.  In particular this article examines the metadata
  performance of several Linux file systems using a specific micro-benchmark.
  Fundamentally this article is really an exploration to understand if
  there is any metadata performance differences between 4 Linux file systems
  (ext3, ext4, btrfs, and nilfs) using a metadata benchmark called fdtree.
  So now it's time to eat our dog food and do benchmarking with the
  recommendations previously mentioned.

Start at the Beginning - Why?

  The previous article made several observations about benchmarking, one
  of which is that storage and file system benchmarks seldom, if ever,
  explain why they are performing a benchmark.  This is a point that is
  not to be underestimated.  Specifically, if the reason why the benchmark
  was performed can not be adequately explained, then the benchmark itself
  becomes suspect (it may just be pure marketing material).

  Given this point, the reason the benchmark in this article is being
  performed is to examine or explore if, and possibly how much, difference
  there is between the metadata performance of four Linux file systems
  using a single metadata benchmark.  The search is not to find which file
  system is the best because it is a single benchmark, fdtree.  Rather it
  is to search for differences and contrast the metadata performance of
  the file systems.

  Why is examining the metadata performance a worthwhile exploration?
  Glad that you asked.  There are a number of applications, workloads,
  and classes of applications that are metadata intensive.  Mail servers
  can be very metadata intensive applications because of the need to
  read and write very small files.  Sometimes databases have workloads
  that do a great deal of reading and writing small files.  In the world
  of technical computing, many bioinformatic applications such as gene
  sequencing applications, do a great deal of small reads and writes.

  The metadata benchmark used in this article is called fdtree.  It is
  a simple bash script that stresses the metadata aspects of the file
  system using standard *nix commands.  While it is not the most well-known
  benchmark in the storage and file system world, it is a bit better known
  in the HPC (High Performance Computing) world.

An Examination of fdtree

  Before jumping into the results, it is appropriate and highly recommended
  to examine the benchmark itself.  fdtree is a simple bash script that
  performs four different metadata tests:

    * Directory creation
    * File creation
    * File removal
    * Directory Removal

  It creates a specified number of files of a given size (in blocks) in a
  top-level directory.  Then it creates a specified number of subdirectories
  and then in turn subdirectories are recursively created up to a specified
  number of levels and are populated with files.

Directory Creation

  This phase of the benchmark begins by creating the number of specified
  directories in the main directory using the simple "mkdir" command in
  a bash function "create_dir".

  mkdir $base_name"L"$nl"D"$nd"/"

  The bash variables specify the details of the directory names.  The next
  step is to call the "create_dir" function recursively with a different
  "base name" (directory) to create all of the required directories.

  create_dirs $((nl-1)) $base_name"L"$nl"D"$nd"/"

File Creation

  This step of the benchmark creates the required number of files using the
  "dd" command in a bash function, "create_files".

  dd if=/dev/zero bs=4096 count=$fsize of=$file_name > /dev/null 2>&1

  To create files in the subdirectories, the bash function is called
  recursively.  As part of the benchmark, the number of 4 KiB blocks is
  specified ($fsize).

File Removal

  The third function in the benchmark is to remove the files that were
  created.  This is done with the standard "rm" command in a function called
  "remove_files".

  rm -f $file_name

  The function "remove_files" is called recursively to remove all of
  the files.

Directory Removal

  The fourth and final function in the benchmark is to remove the
  directories.  This is done in a bash function "remove_dirs" using the
  *nix command "rmdir $dir_names"

  rmdir $dir_names

  The function "remove_dirs" is called recursively to remove all of the
  directories.

  Overall the script uses standard *nix commands for the benchmark.
  It does not use any recursive options for any of the *nix commands.
  It stresses the metadata capabilities of the file system because of the
  potentially large number of files and directories.

  The one interesting thing that the test does, is round the results,
  time and rates, to integer values.  So there could be times when the
  time to execute the test could be 0 seconds.  That is, the test ran in
  less than 1 second.

Running the benchmark

  In the benchmark exploration in this article, fdtree was used in 4
  different approaches to stressing the metadata capability:

    * Small files (4 KiB)
        + Shallow directory structure
        + Deep directory structure

    * Larger files (4 MiB)
        + Shallow directory structure
        + Deep directory structure

  The two file sizes, 4 KiB (1 block) and 4 MiB (1,000 blocks) were used
  to get some feel for a range of performance as a function of the amount
  of data.  The two directory structures were used to stress the metadata
  in different ways to discover if there is any impact on the metadata
  performance.  The shallow directory structure means that there are many
  directories but not very many levels down.  The deep directory structure
  means that there are not many directories at a particular level but that
  there are many levels.

  To create the specific parameters for fdtree used in the exploration,
  there were three overall goals:

    * Keep the total run time to approximately 10-12 minutes at a maximum
    * Keep the total data for the two directory structures approximately the
      same
    * Keep the run time for each of the four functions greater than 1 minute
      if possible

  All four functions were not always run for 1 minute, sometimes only for
  a few seconds.  These will be noted in the results.  Next we show the
  command lines for the four combinations.

Small Files - Shallow Directory Structure

  ./fdtree.bash -d 20 -f 40 -s 1 -l 3

  This command creates 20 sub-directories from each upper level directory
  at each level ("-d 20") and there are 3 levels ("-l 3").  It's a basic
  tree structure.  This is a total of 8,421 directories.  In each directory
  there are 40 files ("-f 40") each sized at 1 block (4 KiB) denoted by
  "-s 1".  This is a total of 336,840 files and 1,347,360 KiB total data.

Small Files - Deep Directory Structure

  ./fdtree.bash -d 3 -f 4 -s 1 -l 10

  This command creates 3 sub-directories from each upper level directory at
  each level ("-d 3") and there are 10 levels ("-l 10").  This is a total
  of 88,573 directories.  In each directory there are 4 files each sized
  at 1 block (4 KiB).  This is a total of 354,292 files and 1,417,168 KiB
  total data.

Medium Files - Shallow Directory Structure

  ./fdtree.bash -d 17 -f 10 -s 1000 -l 2

  This command creates 17 sub-directories from each upper level directory
  at each level ("-d 17") and there are 2 levels ("-l 2").  This is a total
  of 307 directories.  In each directory there are 10 files each sized
  at 1,000 blocks (4 MiB).  This is a total of 3,070 files and 12,280,000
  KiB total data.

Medium Files - Deep Directory Structure

  ./fdtree.bash -d 2 -f 2 -s 1000 -l 10

  This command creates 2 sub-directories from each upper level directory at
  each level ("-d 2") and there are 10 levels ("-l 10").  This is a total
  of 2,047 directories.  In each directory there are 2 files each sized
  at 1,000 blocks (4 MiB).  This is a total of 4,094 files and 16,376,000
  KiB total data.

  Each test was run 10 times with the four combinations of the file systems
  (ext3, ext4, btrfs, nilfs).  The test system used for these tests was a
  stock CentOS 5.3 distribution but with a 2.6.30 kernel and e3fsprogs was
  upgraded to the latest version as of the writing of this article, 1.41.9.
  The tests were run on the following system:

    * GigaByte MAA78GM-US2H motherboard
    * An AMD Phenom II X4 920 CPU
    * 8GB of memory
    * Linux 2.6.30 kernel
    * The OS and boot drive are on an IBM DTLA-307020 (20GB drive at Ulta
      ATA/100)
    * /home is on a Seagate ST1360827AS
    * There are two drives for testing. They are Seagate ST3500641AS-RK with
      16 MB cache each. These are /dev/sdb and /dev/sdc.

  Only the first Seagate drive was used, /dev/sdb for all of the tests.

  For all 4 file systems, the defaults were used in building the
  file systems.  For btrfs, version btrfs-progs v0.18.  For nilfs2,
  nilfs-utils-2.0.14 was used.  Both ext3 and ext4 were mounted with
  "data=ordered" since this is recommended practice to prevent data loss.

Benchmark Results

  This section presents the results of the testing (exploration).
  The results are presented in tables listing the average value and just
  below it, in red, is the standard deviation.  This is done for each of
  the four combinations and each of the four file systems.

  The first combination tested was for small files (4 KiB) with a shallow
  directory structure.  Table 1 below lists the results with an average
  value and just below it, in red, is the standard deviation.

  Table 1 - Benchmark Times Small Files (4 KiB) - Shallow Directory
  Structure

  File System Directory Create File Create File Remove Directory Remove
                  (secs.)        (secs.)     (secs.)       (secs.)
  ext3                   13.00      342.90       69.40             1.30
                          3.61       42.69        6.92             0.46
  ext4                   10.60      327.20       58.10             1.40
                          0.92        4.89        1.87             0.92
  btrfs                   8.80      335.00       65.30             1.40
                          0.40        1.00        0.78             0.66
  nilfs2                  9.10      345.70       51.60             1.20
                          0.30        8.14        0.92             0.40

  The first test, directory creates, had an average run time of 12 seconds
  for all four file systems, so the results may not be that meaningful.
  In addition, the directory remove test ran in about 1 second or less.
  Consequently, this test may not have much value.

  Table 2 below lists the performance results with an average value and
  just below it, in red, is the standard deviation.

  Table 2 - Performance Results of Small Files (4 KiB) - Shallow Directory
  Structure

                Directory       File       File       File       Directory
     File        Create        Create     Create     Remove       Remove
    System     (Dirs/sec)     (Files/   (KiB/sec)   (Files/     (Dirs/sec)
                                sec)                  sec)
  ext3                695.20     993.70   3,975.90   4,900.30        7,578.80
                      177.37      94.66     378.91     473.28        1,684.40
  ext4                800.00   1,029.10   4,118.30   5,803.40        7,368.30
                       69.88      15.21      60.59     111.90        2,157.37
  btrfs               958.40   1,005.00   4,021.70   5,167.10        7,017.40
                       46.80       3.00      12.01      78.22        2,174.42
  nilfs2              925.70     974.70   3,889.20   6,529.20        7,578.80
                       27.90      21.67      88.54     112.75        1,684.40

  The second combination tested was for small files (4 KiB) with a deep
  directory structure.  Table 3 below lists the benchmark times with an
  average value and just below it, in red, is the standard deviation.

  Table 3 - Benchmark Times Small Files (4 KiB) - Deep Directory Structure

  File System Directory Create File Create File Remove Directory Remove
                  (secs.)        (secs.)     (secs.)       (secs.)
  ext3                   46.20      182.40       53.70            14.60
                         26.97       72.55       24.78             7.55
  ext4                  187.00      443.20      192.50            73.30
                         11.22        7.69       12.51            42.09
  btrfs                 102.40       398.6      132.50            38.10
                          0.66        1.91        0.67             0.70
  nilfs2                108.20      417.30      122.10            37.20
                          2.68        6.48        3.39             0.60

  For these tests, the first test, directory creates, took about 40 seconds
  for ext3 (the fastest). This time is fairly small and, consequently, the
  results may not be as applicable the other tests which had a much longer
  run time. The last test, directory removes, took 11 seconds for ext3 (the
  fastest). Again, this time is fairly quick so it may not be as useful
  because the time is so short.

  Table 4 below lists the performance results with an average value and just
  below it, in red, is the standard deviation.

  Table 4 - Performance Results of Small Files (4 KiB) - Deep Directory
  Structure

                Directory       File       File       File       Directory
     File        Create        Create     Create     Remove       Remove
    System     (Dirs/sec)     (Files/   (KiB/sec)   (Files/     (Dirs/sec)
                                sec)                  sec)
  ext3                783.90     927.90   3,713.00   3,180.70        2,452.40
                       39.08      16.58      65.88     209.90          207.90
  ext4                475.00     799.10   3,198.00   1,848.00        1,539.60
                       29.05      13.45      53.73     124.31          201.76
  btrfs               864.30     888.10   3,554.80   2,673.60        2,324.90
                        5.76       4.23      16.92      13.87           42.57
  nilfs2              818.60     848.50   3,396.40   2,903.60        2,380.80
                       19.11      12.71      51.73      75.52           36.60

  The third combination tested was for medium files (4 MiB) with a shallow
  directory structure.  Table 5 below lists the benchmark times with an
  average value and just below it, in red, is the standard deviation.

  Table 5 - Benchmark Times Medium Files (4 MiB) - Shallow Directory
  Structure

  File System Directory Create File Create File Remove Directory Remove
                  (secs.)        (secs.)     (secs.)       (secs.)
  ext3                    0.30      174.90       17.40             0.00
                          0.46       17.46        3.47             0.00
  ext4                    0.20      156.80       11.80             0.20
                          0.40        4.75        2.99             0.40
  btrfs                   0.50      114.40       15.60             0.10
                          0.50        1.11        0.49             0.30
  nilfs2                  0.70      196.30        7.50             0.20
                          0.78        3.07        2.87             0.40

  For these tests, the first test, directory creates, took less than
  1 second.  This time is very small and, consequently, the results are
  not as applicable as some of the other tests.  The file removes test
  took about 10-15 seconds.  Again this is a very short time and the
  results may not be as applicable.  The last test, directory removes,
  took 0-1.4 seconds.  This time too, is very short.

  Table 6 below lists the performance results with an average value and
  just below it, in red, is the standard deviation.

  Table 6 - Performance Results of Medium Files (4 MiB) - Shallow Directory
  Structure

                Directory       File       File       File       Directory
     File        Create        Create     Create     Remove       Remove
    System     (Dirs/sec)     (Files/   (KiB/sec)   (Files/     (Dirs/sec)
                                sec)                  sec)
  ext3                 92.10      17.30  70,889.80     182.30            0.00
                      140.69       1.90   6,798.06      32.53            0.00
  ext4                 61.40      18.90  78,393.20     278.30           61.40
                      122.80       0.54   2,252.90      75.69          122.80
  btrfs               153.50      26.20 107,342.50     196.20           30.70
                      153.50       0.60   1,063.70       6.37           92.10
  nilfs2              122.70      15.00  62,572.00     442.50           61.40
                      133.80       0.00     968.91      90.62          122.80

  The fourth and final combination tested was for medium files (4 MiB)
  with a deep directory structure.  Table 7 below lists the benchmark times
  with an average value and just below it, in red, is the standard deviation.

  Table 7 - Benchmark Times Medium Files (4 MiB) - Deep Directory Structure

  File System Directory Create File Create File Remove Directory Remove
                  (secs.)        (secs.)     (secs.)       (secs.)
  ext3                    2.70      248.30       18.80             1.80
                          0.78        9.99        4.07             1.08
  ext4                    3.20      219.50       13.40             1.20
                          0.75        1.12        4.72             0.40
  btrfs                   2.40      159.30       16.20             1.10
                          0.49        1.42        1.17             0.30
  nilfs2                  2.50      287.70       11.50             1.40
                          0.50       10.67        0.50             0.49

  The first test, directory creates, took 2-3 seconds, which is very short.
  The time for the third test, file removal, was also fairly short at
  11-19 seconds.  The last test, directory removes, was extremely fast at
  less than 2 seconds.  These three results are somewhat suspect because
  of short run time.

  Table 8 below lists the performance results with an average value and
  just below it, in red, is the standard deviation.

  Table 8 - Results of Medium Files (4 MiB) - Deep Directory Structure

                Directory       File       File       File       Directory
     File        Create        Create     Create     Remove       Remove
    System     (Dirs/sec)     (Files/   (KiB/sec)   (Files/     (Dirs/sec)
                                sec)                  sec)
  ext3                818.30      16.20  66,053.10     225.60        1,518.00
                      213.10       0.60   2,515.72      35.42          658.48
  ext4                671.70      18.10  74,607.50     331.50        1,842.20
                      147.98       0.30     380.54     112.06          409.60
  btrfs               886.60      25.20 102,807.40     253.20        1,944.60
                      167.06       0.40     917.56      17.72          307.20
  nilfs2              852.50      13.70  56,998.60     356.50        1,637.40
                      170.50       0.64   2,122,26      15.50          501.66

Discussion of Results

  There are four different combinations that were tested for the four file
  systems.  Comparing the file systems is interesting but also comparing
  the same file system for the different tests is interesting as well.

  Firstly, let's examine the shallow directory structure results (Tables 2
  and 6).  For small files (4 KiB) the four file systems performed about the
  same for the file create and file removal tests (the directory create and
  remove tests ran too quickly to be really useful).  All four file systems
  achieved about 1,000 file creates per second or about 4,000 KiB per second
  (see Table 2).  But medium files (4 MiB) produced very different results.
  For this case, btrfs was almost twice as fast as nilfs and 50% faster
  than ext3 or ext3 with respect to the number of files creates per second
  or throughput (KiB/s) (see Table 6).

  However, the small file case produced 109X more files, 27X more
  directories, but only produced about 1/10-th the total amount of data.
  This points out the extreme pressure that is put on metadata performance
  of file systems for small files.

  Second, we can perform the same comparison for deep directory structure
  (Tables 4 and 8).  Small files (1 KiB) put extreme pressure on file
  system metadata performance as with the shallow directory structure.

  Examining the results, the following observations are made:

    * Small files put extreme pressure on metadata performance regardless of
      file systems
    * For small files, a shallow or deep directory structure did not
      appreciably impact metadata performance
    * For larger files, a shallow or deep directory structure also did not
      appreciably impact metadata performance
    * For small files, btrfs has good file creation performance but file
      removal performance is not as good as ext3 and ext4 at this time
    * For larger files, btrfs has both excellent file creation and removal
      performance relative to the other 3 file systems
    * Log-base file systems such as nilfs2 should work well with metadata
      tests. But the developers are evolving the garbage collection (gc)
      algorithm which should improve performance.
    * The standard deviation, or the spread in the data was much greater for
      ext3 than the other three file systems. The reason(s) for this are not
      known.

  This is the first attempt at useful benchmarks and analysis for Linux
  file systems using the approach outlined in a previous article around
  benchmarking.  There will be future articles that use the same basic
  tenants.  But be ready because this approach introduces a great deal of
  data into the article.  But overall it does give much more information
  that a quick table and a conclusion "file system X is better" (usually
  followed by a run for cover).  Please let me know in the forums if this
  type of article is useful.