昨天晚上研究了一下e2fsprogs中关于flex_bg的实现。flex_bg,即Flexible Block Groups,是EXT4文件系统引入的一个feature。简言之就是将之前EXT3中分散在各个group中的bitmaps (block bitmap, inode bitmap)及inode table分别集中起来管理。集中起来至少有两个好处:
- 减少了磁盘寻道操作:将频繁访问的block group资源放到有磁盘上一块连续区域
- 可以一次性分配更多block给一个extent/run:以前的group将磁盘空间划分为众多不连续的空间片段,从而导致一个分配请求最大能只申请到一个group所管理的blocks。以最常用的4K BLOCK_SIZE来说,一个group最多能管理4K*8=32K个block(128M),但除去group本身的metadata (bitmap blocks: 2 inodes table: (32768 * 128 + 4095)/4096 = 1024),还能剩下31742个block空闲,如果此group包含spare_super,还要减去super_block所占用的一个block。
明白了上面的问题,再来看flex_bg的实现就比较容易了。在创建EXT4卷时,mke2fs会根据用户指定的flex block group大小(flex_bg_size必须为,2的幂,单位为group),将最前的flex_bg_size个groups集中起来管理。看下面的例子:
实验用得是320G的硬盘,只有一个分区:
[root@srv ~]# fdisk -l /dev/sdc
Disk /dev/sdc: 320.1 GB, 320072931328 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe4afe4afDevice Boot Start End Blocks Id System
/dev/sdc1 1 38914 312568832 7 HPFS/NTFS
指定flex_bg_size为256个group:
[root@srv ~]# mke2fs -j -O flex_bg,extents,uninit_bg -G 256 -I 256 /dev/sdc
再用debugfs来查看新建EXT4卷group descriptions:
[root@srv ~]# debugfs /dev/sdc1
debugfs: stats
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: 5be014f5-5a27-4cf1-81dc-d1f55e71dfdd
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 19537920
Block count: 78142208
Reserved block count: 3907110
Free blocks: 76867144
Free inodes: 19537909
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1005
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 256
Filesystem created: Fri Jul 8 23:02:47 2011
Last mount time: n/a
Last write time: Fri Jul 8 23:06:48 2011
Mount count: 0
Maximum mount count: 20
Last checked: Fri Jul 8 23:02:47 2011
Check interval: 15552000 (6 months)
Next check after: Wed Jan 4 23:02:47 2012
Lifetime writes: 4904 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 63bc1c54-cf76-4546-99a3-7aca37c86fc1
Journal backup: inode blocks
Directories: 2
Group 0: block bitmap at 1025, inode bitmap at 1281, inode table at 1537
4089 free blocks, 8181 free inodes, 2 used directories, 8181 unused inodes
[Checksum 0x2e4a]
Group 1: block bitmap at 1026, inode bitmap at 1282, inode table at 2049
0 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0x4578]
Group 2: block bitmap at 1027, inode bitmap at 1283, inode table at 2561
4095 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0xa897]……
Group 255: block bitmap at 1280, inode bitmap at 1536, inode table at 142337
32768 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Block not init, Checksum 0xcd10]
Group 256: block bitmap at 8388608, inode bitmap at 8388864, inode table at 8389120
0 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0x42cd]……
Group 2384: block bitmap at 75497552, inode bitmap at 75497808, inode table at 75538944
23296 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0xd1d2]
可以看出,group 0-255的block bitmap,inode bitmap及inodes table是连在一起的,如block bitmap是从1025开始到1280,inode bitmap则从1281开始,直到1536结束。inodes talbe也同样。后面从group 256开始,又是常规的不连续方式。
不妨再做个实验,将指定flex_bg_size设为4096,将所有group(共2385个)都包含进flexible block group:
[root@srv ~]# mke2fs -j -O flex_bg,extents,uninit_bg -G 4096 -I 256 /dev/sdc
启动debugfs来查看group descriptions:
[root@srv ~]# debugfs /dev/sdc1
Filesystem volume name: <none>
Last mounted on: <not available>
Filesystem UUID: ab2057e4-2510-4c25-bd72-c2867bebb294
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 19537920
Block count: 78142208
Reserved block count: 3907110
Free blocks: 76867144
Free inodes: 19537909
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1005
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 4096
Filesystem created: Fri Jul 8 23:14:02 2011
Last mount time: n/a
Last write time: Fri Jul 8 23:17:59 2011
Mount count: 0
Maximum mount count: 39
Last checked: Fri Jul 8 23:14:02 2011
Check interval: 15552000 (6 months)
Next check after: Wed Jan 4 23:14:02 2012
Lifetime writes: 4904 MB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: f6f4dbbc-3673-4b58-b11c-2fbae02d7ee3
Journal backup: inode blocks
Directories: 2
Group 0: block bitmap at 1025, inode bitmap at 5121, inode table at 9217
7511 free blocks, 8181 free inodes, 2 used directories, 8181 unused inodes
[Checksum 0xee4f]
Group 1: block bitmap at 1026, inode bitmap at 5122, inode table at 9729
0 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0x255d]
Group 2: block bitmap at 1027, inode bitmap at 5123, inode table at 10241
4095 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0x047e]……
Group 2384: block bitmap at 3409, inode bitmap at 7505, inode table at 1265665
23296 free blocks, 8192 free inodes, 0 used directories, 8192 unused inodes
[Inode not init, Checksum 0x8406]
看得出,整个flexible block group被分隔三部分,第一部分是所有的block bitmap,第二部分是所有的inode bitmap,最后是所有的inodes table。所有的表项都是连续存放于磁盘上的。