Skip to content

8. Working with the staging area and the index file 处理暂存区和索引文件

8.1. What’s the index file? 什么是索引文件?

This final step will bring us to where commits happen (although actually creating them is for the next section!)

最后一步将引导我们进入提交的实际发生地(虽然实际创建提交是在下一节!)

You probably know that to commit in Git, you first “stage” some changes, using git add and git rm, and only then do you commit those changes. This intermediate stage between the last and the next commit is called the staging area.

你可能知道,在 Git 中进行提交时,首先要“暂存”一些更改,使用 git addgit rm,然后才提交这些更改。最后一次提交和下一次提交之间的这个中间阶段称为 暂存区

It would seem natural to use a commit or tree object to represent the staging area, but Git actually and uses a completely different mechanism, in the form of what it calls the index file.

看起来自然的是使用提交或树对象来表示暂存区,但 Git 实际上使用的是一种完全不同的机制,即所谓的 索引文件

After a commit, the index file is a sort of copy of that commit: it holds the same path/blob association than the corresponding tree. But it also holds extra information about files in the worktree, like their creation/modification time, so git status doesn’t often need to actually compare files: it just checks that their modification time is the same as the one stored in the index file, and only if it isn’t does it perform an actual comparison.

在提交之后,索引文件可以看作是该提交的某种副本:它持有与对应树相同的路径/Blob 关联。但它还包含关于工作区中文件的额外信息,比如创建/修改时间,因此 git status 并不需要实际比较文件:它只需检查文件的修改时间是否与索引文件中存储的时间相同,只有在不相同时才会进行实际比较。

You can thus consider the index file as a three-way association list: not only paths with blobs, but also paths with actual filesystem entries.

因此,你可以将索引文件视为一个三方关联列表:不仅包含路径与 Blob 的关联,还包含路径与实际文件系统条目的关联。

Another important characteristic of the index file is that unlike a tree, it can represent inconsistent states, like a merge conflict, whereas a tree is always a complete, unambiguous representation.

索引文件 的另一个重要特性是,与树不同,它可以表示不一致的状态,比如合并冲突,而树始终是完整且明确的表示。

When you commit, what git actually does is turn the index file into a new tree object. To summarize:

当你提交时,Git 实际上是将索引文件转换为一个新的树对象。总结如下:

  1. When the repository is “clean”, the index file holds the exact same contents as the HEAD commit, plus metadata about the corresponding filesystem entries. For instance, it may contain something like:

    There’s a file called src/disp.c whose contents are blob 797441c76e59e28794458b39b0f1eff4c85f4fa0. The real src/disp.c file, in the worktree, was created on 2023-07-15 15:28:29.168572151, and last modified 2023-07-15 15:28:29.1689427709. It is stored on device 65026, inode 8922881.

  2. 当仓库“干净”时,索引文件包含与 HEAD 提交完全相同的内容,以及对应文件系统条目的元数据。例如,它可能包含如下内容:

    有一个名为 src/disp.c 的文件,其内容为 Blob 797441c76e59e28794458b39b0f1eff4c85f4fa0。实际的 src/disp.c 文件在工作区中创建于 2023-07-15 15:28:29.168572151,最后修改于 2023-07-15 15:28:29.1689427709。它存储在设备 65026,inode 8922881 上。

  3. When you git add or git rm, the index file is modified accordingly. In the example above, if you modify src/disp.c, and add your changes, the index file will be updated with a new blob ID (the blob itself will also be created in the process, of course), and the various file metadata will be updated as well so git status knows when not to compare file contents.

  4. 当你使用 git addgit rm 时,索引文件会相应地被修改。在上述示例中,如果你修改了 src/disp.cadd 你的更改,索引文件将更新为新的 Blob ID(当然,Blob 本身也会在此过程中被创建),并且各种文件元数据也会被更新,以便 git status 知道何时不需要比较文件内容。

  5. When you git commit those changes, a new tree is produced from the index file, a new commit object is generated with that tree, branches are updated and we’re done.

  6. 当你将这些更改 git commit 时,将从索引文件生成一个新的树对象,生成一个新的提交对象,并更新分支,然后完成。

备注

A note on words关于术语的说明

The staging area and the index are thus the same thing, but the name “staging area” is more the name of the git user-exposed feature (that could have been implemented otherwise), the abstraction if you will; while “index file” refers specifically to the way this abstract feature is actually implemented in git.

因此,暂存区和索引是同一个概念,但“暂存区”这个名称更像是 Git 用户可见的功能名称(可以用其他方式实现),是某种抽象;而“索引文件”则专指这一抽象功能在 Git 中的实际实现方式。

8.2. Parsing the index 解析索引

The index file is by far the most complicated piece of data a Git repository can hold. Its complete documentation can be found in Git source tree at Documentation/gitformat-index.txt; you can browse it on the Github mirror. It’s made of three parts:

索引文件是 Git 仓库中最复杂的数据结构。其完整文档可以在 Git 源代码树的 Documentation/gitformat-index.txt 中找到;你可以在 GitHub 镜像上浏览。它由三部分组成:

  • An header with the format version number and the number of entries the index holds;
  • 一个包含格式版本号和索引条目数量的头部;
  • A series of entries, sorted, each representing a file; padded to multiple of 8 bytes.
  • 一系列已排序的条目,每个条目代表一个文件,填充到 8 字节的倍数;
  • A series of optional extensions, which we’ll ignore.
  • 一系列可选扩展,我们将忽略它们。

The first thing we need to represent is a single entry. It actually holds quite a lot of stuff, I’m leaving the details in comments. It’s worth observing that an entry stores both the SHA-1 of the associated blob in the object store and a ton of metadata about the actual file on the actual filesystem. Again, this is because git/wyag status will need to determine which files in the index were modified: it is much more efficient to begin by checking the last-modified timestamp and comparing it with a known values, before comparing actual files.

我们需要表示的第一件事是单个条目。它实际上包含了很多内容,具体细节将在注释中说明。值得注意的是,一个条目同时存储了与对象存储中的 blob 相关联的 SHA-1 和关于实际文件的许多元数据。这是因为 git/wyag status 需要确定索引中的哪些文件被修改:首先检查最后修改的时间戳并与已知值进行比较,效率更高,然后再比较实际文件。

python
class GitIndexEntry (object):
    def __init__(self, ctime=None, mtime=None, dev=None, ino=None,
                 mode_type=None, mode_perms=None, uid=None, gid=None,
                 fsize=None, sha=None, flag_assume_valid=None,
                 flag_stage=None, name=None):
      # The last time a file's metadata changed.  This is a pair
      # (timestamp in seconds, nanoseconds)
      # 文件元数据最后一次更改的时间。 这是一个对
      # (秒级时间戳,纳秒级时间戳)的元组
      self.ctime = ctime
      # The last time a file's data changed.  This is a pair
      # (timestamp in seconds, nanoseconds)
      # 文件数据最后一次更改的时间。 这是一个对
      # (秒级时间戳,纳秒级时间戳)的元组
      self.mtime = mtime
      # The ID of device containing this file
      # 包含此文件的设备 ID
      self.dev = dev
      # The file's inode number
      # 文件的 inode 编号
      self.ino = ino
      # The object type, either b1000 (regular), b1010 (symlink),
      # 对象类型,可以是 b1000(常规),b1010(符号链接),
      # b1110 (gitlink).
      # b1110(gitlink)。
      self.mode_type = mode_type
      # The object permissions, an integer.
      # 对象权限,整数值。
      self.mode_perms = mode_perms
      # User ID of owner
      # 拥有者的用户 ID
      self.uid = uid
      # Group ID of ownner
      # 拥有者的组 ID
      self.gid = gid
      # Size of this object, in bytes
      # 此对象的大小,以字节为单位
      self.fsize = fsize
      # The object's SHA
      # 对象的 SHA
      self.sha = sha
      self.flag_assume_valid = flag_assume_valid
      self.flag_stage = flag_stage
      # Name of the object (full path this time!)
      # 对象名称(这次是完整路径!)
      self.name = name

The index file is a binary file, likely for performance reasons. The format is reasonably simple, though. It begins with a header with the DIRC magic bytes, a version number and the total number of entries in that index file. We create the GitIndex class to hold them:

索引文件是一个二进制文件,可能出于性能原因。格式相对简单,它以一个包含 DIRC 魔术字节、版本号和索引文件中条目总数的头部开始。我们创建 GitIndex 类来保存这些信息:

python
class GitIndex (object):
    version = None
    entries = []
    # ext = None
    # sha = None

    def __init__(self, version=2, entries=None):
        if not entries:
            entries = list()

        self.version = version
        self.entries = entries

And a parser to read index files into those objects. After reading the 12-bytes header, we just parse entries in the order they appear. An entry begins with a set of fixed-length data, followed by a variable-length name.

接下来是一个解析器,将索引文件读入这些对象。在读取了 12 字节的头部后,我们按照出现的顺序解析条目。一个条目以一组固定长度的数据开始,后面跟着一个可变长度的名称。

The code is quite straightforward, but as it’s reading a binary format, it feels more messy than what we did so far. We use the int.from_bytes(bytes, endianness) a lot to read raw bytes into an integer, and just a few bitwise operations to separate data that share the same byte.

代码相当简单,但由于它在读取二进制格式,感觉比我们之前做的要复杂一些。我们大量使用 int.from_bytes(bytes, endianness) 来将原始字节读入整数,并使用少量的位操作来分离共享相同字节的数据。

(This format was probably designed so index files could just be mmapp()ed to memory, and read directly as C structs, with an index built in O(n) time in most cases. This kind of approach tends to produce more elegant code in C than in Python…)

这个格式可能是为了让索引文件能够直接通过 mmapp() 映射到内存,并作为 C 结构直接读取,从而在大多数情况下以 O(n) 时间构建索引。这种方法通常会在 C 语言中产生比在 Python 中更优雅的代码……

python
def index_read(repo):
    index_file = repo_file(repo, "index")

    # New repositories have no index!
    # 新仓库没有索引文件!
    if not os.path.exists(index_file):
        return GitIndex()

    with open(index_file, 'rb') as f:
        raw = f.read()

    header = raw[:12]
    signature = header[:4]
    assert signature == b"DIRC"  # Stands for "DirCache" # 代表 "DirCache"
    version = int.from_bytes(header[4:8], "big")
    assert version == 2, "wyag 仅支持索引文件版本 2"
    count = int.from_bytes(header[8:12], "big")

    entries = list()

    content = raw[12:]
    idx = 0
    for i in range(0, count):
        # Read creation time, as a unix timestamp (seconds since
        # 1970-01-01 00:00:00, the "epoch")
        # 读取创建时间,作为 UNIX 时间戳(自 1970-01-01 00:00:00 起的秒数)
        ctime_s = int.from_bytes(content[idx: idx+4], "big")
        # Read creation time, as nanoseconds after that timestamps,
        # for extra precision.
        # 读取创建时间,作为该时间戳后的纳秒数,以获得额外的精度
        ctime_ns = int.from_bytes(content[idx+4: idx+8], "big")
        # Same for modification time: first seconds from epoch.
        # 同样处理修改时间:先是从纪元起的秒数
        mtime_s = int.from_bytes(content[idx+8: idx+12], "big")
        # Then extra nanoseconds
        # 然后是额外的纳秒数
        mtime_ns = int.from_bytes(content[idx+12: idx+16], "big")
        # Device ID
        # 设备 ID
        dev = int.from_bytes(content[idx+16: idx+20], "big")
        # inode
        ino = int.from_bytes(content[idx+20: idx+24], "big")
        # Ignored.
        # 忽略的字段
        unused = int.from_bytes(content[idx+24: idx+26], "big")
        assert 0 == unused
        mode = int.from_bytes(content[idx+26: idx+28], "big")
        mode_type = mode >> 12
        assert mode_type in [0b1000, 0b1010, 0b1110]
        mode_perms = mode & 0b0000000111111111
        # User ID
        # 用户 ID
        uid = int.from_bytes(content[idx+28: idx+32], "big")
        # Group ID
        # 组 ID
        gid = int.from_bytes(content[idx+32: idx+36], "big")
        # Size
        # 大小
        fsize = int.from_bytes(content[idx+36: idx+40], "big")
        # SHA (object ID).  We'll store it as a lowercase hex string
        # for consistency.
        # SHA(对象 ID)。我们将其存储为小写的十六进制字符串,以保持一致性
        sha = format(int.from_bytes(content[idx+40: idx+60], "big"), "040x")
        # Flags we're going to ignore
        # 我们将忽略的标志
        flags = int.from_bytes(content[idx+60: idx+62], "big")
        # Parse flags
        # 解析标志
        flag_assume_valid = (flags & 0b1000000000000000) != 0
        flag_extended = (flags & 0b0100000000000000) != 0
        assert not flag_extended
        flag_stage = flags & 0b0011000000000000
        # Length of the name.  This is stored on 12 bits, some max
        # value is 0xFFF, 4095.  Since names can occasionally go
        # beyond that length, git treats 0xFFF as meaning at least
        # 0xFFF, and looks for the final 0x00 to find the end of the
        # name --- at a small, and probably very rare, performance
        # cost.
        # 名称的长度。这是以 12 位存储的,最大值为 0xFFF,4095。由于名称有时可能超过该长度,git 将 0xFFF 视为表示至少 0xFFF,并寻找最终的 0x00 以找到名称的结束——这会带来小而可能非常罕见的性能损失。
        name_length = flags & 0b0000111111111111

        # We've read 62 bytes so far.
        # 到目前为止我们已经读取了 62 字节。
        idx += 62

        if name_length < 0xFFF:
            assert content[idx + name_length] == 0x00
            raw_name = content[idx:idx+name_length]
            idx += name_length + 1
        else:
            print("注意:名称长度为 0x{:X} 字节。".format(name_length))
            # 这可能没有经过足够的测试。它适用于长度恰好为 0xFFF 字节的路径。任何额外字节可能会在 git、我的 shell 和我的文件系统之间造成问题。
            null_idx = content.find(b'\x00', idx + 0xFFF)
            raw_name = content[idx:null_idx]
            idx = null_idx + 1

        # Just parse the name as utf8.
        # 将名称解析为 UTF-8
        name = raw_name.decode("utf8")

        # Data is padded on multiples of eight bytes for pointer
        # alignment, so we skip as many bytes as we need for the next
        # read to start at the right position.

        # 数据按 8 字节的倍数填充以进行指针对齐,因此我们跳过需要的字节,以便下次读取从正确的位置开始。

        idx = 8 * ceil(idx / 8)

        # And we add this entry to our list.
        # 然后我们将此条目添加到我们的列表中。
        entries.append(GitIndexEntry(ctime=(ctime_s, ctime_ns),
                                     mtime=(mtime_s, mtime_ns),
                                     dev=dev,
                                     ino=ino,
                                     mode_type=mode_type,
                                     mode_perms=mode_perms,
                                     uid=uid,
                                     gid=gid,
                                     fsize=fsize,
                                     sha=sha,
                                     flag_assume_valid=flag_assume_valid,
                                     flag_stage=flag_stage,
                                     name=name))

    return GitIndex(version=version, entries=entries)

8.3. The ls-files command | ls-files 命令

git ls-files displays the names of files in the staging area, with, as usual, a ton of options. Our ls-files will be much simpler, but we’ll add a --verbose option that doesn’t exist in git, just so we can display every single bit of info in the index file.

git ls-files 显示暂存区中文件的名称,通常带有大量选项。我们的 ls-files 将简单得多,但我们会添加一个 --verbose 选项,这是 git 中不存在的,以便显示索引文件中的每一个信息。

python
argsp = argsubparsers.add_parser("ls-files", help = "List all the stage files")
argsp.add_argument("--verbose", action="store_true", help="Show everything.")

def cmd_ls_files(args):
  repo = repo_find()
  index = index_read(repo)
  if args.verbose:
    print("Index file format v{}, containing {} entries.".format(index.version, len(index.entries)))

  for e in index.entries:
    print(e.name)
    if args.verbose:
      print("  {} with perms: {:o}".format(
        { 0b1000: "regular file",
          0b1010: "symlink",
          0b1110: "git link" }[e.mode_type],
        e.mode_perms))
      print("  on blob: {}".format(e.sha))
      print("  created: {}.{}, modified: {}.{}".format(
        datetime.fromtimestamp(e.ctime[0])
        , e.ctime[1]
        , datetime.fromtimestamp(e.mtime[0])
        , e.mtime[1]))
      print("  device: {}, inode: {}".format(e.dev, e.ino))
      print("  user: {} ({})  group: {} ({})".format(
        pwd.getpwuid(e.uid).pw_name,
        e.uid,
        grp.getgrgid(e.gid).gr_name,
        e.gid))
      print("  flags: stage={} assume_valid={}".format(
        e.flag_stage,
        e.flag_assume_valid))
python
argsp = argsubparsers.add_parser("ls-files", help="列出所有暂存文件")
argsp.add_argument("--verbose", action="store_true", help="显示所有信息。")

def cmd_ls_files(args):
  repo = repo_find()
  index = index_read(repo)
  if args.verbose:
    print("索引文件格式 v{}, 包含 {} 条目。".format(index.version, len(index.entries)))

  for e in index.entries:
    print(e.name)
    if args.verbose:
      print("  {},权限:{:o}".format(
        { 0b1000: "常规文件",
          0b1010: "符号链接",
          0b1110: "git 链接" }[e.mode_type],
        e.mode_perms))
      print("  对应的 blob: {}".format(e.sha))
      print("  创建时间:{}.{}, 修改时间:{}.{}".format(
        datetime.fromtimestamp(e.ctime[0]),
        e.ctime[1],
        datetime.fromtimestamp(e.mtime[0]),
        e.mtime[1]))
      print("  设备:{}, inode: {}".format(e.dev, e.ino))
      print("  用户:{} ({})  组:{} ({})".format(
        pwd.getpwuid(e.uid).pw_name,
        e.uid,
        grp.getgrgid(e.gid).gr_name,
        e.gid))
      print("  标志:stage={} assume_valid={}".format(
        e.flag_stage,
        e.flag_assume_valid))

If you run ls-files, you’ll notice that on a “clean” worktree (an unmodified checkout of HEAD), it lists all files on HEAD. Again, the index is not a delta (a set of differences) from the HEAD commit, but starts as a copy of it, in a different format.

如果你运行 ls-files,你会注意到在“干净”的工作区(未修改的 HEAD 检出)中,它列出了 HEAD 上的所有文件。再次强调,索引并不是从 HEAD 提交的一个增量(一组差异),而是以不同的格式作为它的一个副本。

8.4. A detour: the check-ignore command 绕道:check-ignore 命令

We want to write status, but status needs to know about ignore rules, that are stored in the various .gitignore files. So we first need to add some rudimentary support for ignore files in wyag. We’ll expose this support as the check-ignore command, which takes a list of paths and outputs back those of those paths that should be ignored.

我们想要编写 status,但 status 需要了解忽略规则,这些规则存储在各种 .gitignore 文件中。因此,我们首先需要在 wyag 中添加一些基本的忽略文件支持。我们将以 check-ignore 命令的形式暴露这一支持,该命令接受一个路径列表,并输出那些应该被忽略的路径。

Again, the command parser is trivial:

命令解析器同样很简单:

python
argsp = argsubparsers.add_parser("check-ignore", help = "Check path(s) against ignore rules.")
argsp.add_argument("path", nargs="+", help="Paths to check")
python
argsp = argsubparsers.add_parser("check-ignore", help="检查路径是否符合忽略规则。")
argsp.add_argument("path", nargs="+", help="待检查的路径")

And the function is just as simple:

函数也同样简单:

python
def cmd_check_ignore(args):
  repo = repo_find()
  rules = gitignore_read(repo)
  for path in args.path:
      if check_ignore(rules, path):
        print(path)

But of course, most of the function we call don’t exist yet in wyag. We’ll begin by writing a reader for rules in ignore files, gitignore_read(). The syntax of those rules is quite simple: each line in an ignore file is an exclusion pattern, so files that match this pattern are ignored by status, add -A and so on. There are three special cases, though:

当然,我们调用的大多数函数在 wyag 中还不存在。我们将首先编写一个读取忽略文件规则的函数 gitignore_read()。这些规则的语法相当简单:每行都是一个排除模式,匹配该模式的文件将被 statusadd -A 等忽略。不过,有三个特殊情况:

  1. Lines that begin with an exclamation mark ! negate the pattern (files that match this pattern are included, even they were ignored by an earlier pattern)
  2. 以感叹号 ! 开头的行会 否定 模式(匹配该模式的文件会被 包含,即使它们之前被忽略)。
  3. Lines that begin with a dash # are comments, and are skipped.
  4. 以井号 # 开头的行是注释,会被跳过。
  5. A backslash \ at the beginning treats ! and # as literal characters.
  6. 行首的反斜杠 \!# 视为字面字符。

First, a parser for a single pattern. This parser returns a pair: the pattern itself, and a boolean to indicate if files matching the pattern should be excluded (True) or included (False). In other words, False if the pattern did start with !, True otherwise.

首先,单个模式的解析器。该解析器返回一对值:模式本身,以及一个布尔值,用于指示匹配该模式的文件是 应该 被排除 (True) 还是包含 (False)。换句话说,如果模式以 ! 开头,则返回 False,否则返回 True

python
def gitignore_parse1(raw):
    raw = raw.strip()  # Remove leading/trailing spaces # 去除前后空格

    if not raw or raw[0] == "#":
        return None
    elif raw[0] == "!":
        return (raw[1:], False)
    elif raw[0] == "\\":
        return (raw[1:], True)
    else:
        return (raw, True)

Parsing a file is just collecting all rules in that file. Notice this function doesn’t parse files, but just lists of lines: that’s because we’ll need to read rules from git blobs as well, and not just regular files.

解析文件的过程就是收集该文件中的所有规则。请注意,这个函数并不解析 文件,而只是解析行的列表:这是因为我们也需要从 git blobs 中读取规则,而不仅仅是常规文件。

python
def gitignore_parse(lines):
    ret = list()

    for line in lines:
        parsed = gitignore_parse1(line)
        if parsed:
            ret.append(parsed)

    return ret

Last thing we need to do is collect the various ignore files. They come in two kinds:

最后,我们需要做的就是收集各种忽略文件。这些文件分为两种:

  • Some of these files live in the index: they’re the various gitignore files. Emphasis on the plural; although there often is only one such file, at the root, there can be one in each directory, and it applies to this directory and its subdirectories. I’ll call those scoped, because they only apply to paths under their directory.

  • 一些文件位于索引中:它们是各种 gitignore 文件。强调一下复数形式;虽然通常只有一个这样的文件在根目录,但每个目录中也可以有一个,并且它适用于该目录及其子目录。我称这些为作用域文件,因为它们只适用于其目录下的路径。

  • The others live outside the index. They’re the global ignore file (usually in ~/.config/git/ignore) and the repository-specific .git/info/exclude. I call those absolute, because they apply everywhere, but at a lower priority.

  • 其他文件位于索引之外。它们是全局忽略文件(通常在 ~/.config/git/ignore)和特定于仓库的 .git/info/exclude。我称这些为绝对文件,因为它们适用于所有地方,但优先级较低。

Again, a class to hold that: a list of absolute rules, a dict (hashmap) of relative rules. The keys to this hashmap are directories, relative to the root of a worktree.

再次,我们定义一个类来持有这些信息:一个包含绝对规则的列表,以及一个包含相对规则的字典(哈希表)。这个哈希表的键是目录,相对于工作树的根目录。

python
class GitIgnore(object):
    absolute = None
    scoped = None

    def __init__(self, absolute, scoped):
        self.absolute = absolute
        self.scoped = scoped

And finally our function to collect all gitignore rules in a repository, and return a GitIgnore object. Notice how it reads scoped files from the index, and not the worktree: only staged .gitignore files matter (also remember: HEAD is already staged — the staging area is a copy, not a delta).

最后,我们的函数将收集仓库中的所有 gitignore 规则,并返回一个 GitIgnore 对象。请注意,它是从索引中读取作用域文件,而不是从工作树中读取:只有已暂存.gitignore 文件才重要(还要记住:HEAD 已经 被暂存——暂存区是一个副本,而不是增量)。

python
def gitignore_read(repo):
    ret = GitIgnore(absolute=list(), scoped=dict())

    # Read local configuration in .git/info/exclude
    # 读取 .git/info/exclude 中的本地配置
    repo_file = os.path.join(repo.gitdir, "info/exclude")
    if os.path.exists(repo_file):
        with open(repo_file, "r") as f:
            ret.absolute.append(gitignore_parse(f.readlines()))

    # Global configuration
    # 全局配置
    if "XDG_CONFIG_HOME" in os.environ:
        config_home = os.environ["XDG_CONFIG_HOME"]
    else:
        config_home = os.path.expanduser("~/.config")
    global_file = os.path.join(config_home, "git/ignore")

    if os.path.exists(global_file):
        with open(global_file, "r") as f:
            ret.absolute.append(gitignore_parse(f.readlines()))

    # .gitignore files in the index
    # 索引中的 .gitignore 文件
    index = index_read(repo)

    for entry in index.entries:
        if entry.name == ".gitignore" or entry.name.endswith("/.gitignore"):
            dir_name = os.path.dirname(entry.name)
            contents = object_read(repo, entry.sha)
            lines = contents.blobdata.decode("utf8").splitlines()
            ret.scoped[dir_name] = gitignore_parse(lines)
    return ret

We’re almost there. To tie everything together, we need the check_ignore function that matches a path, relative to the root of a worktree, against a set of rules. This is how this function will work:

我们快完成了。为了将所有内容结合在一起,我们需要 check_ignore 函数,该函数将路径(相对于工作树的根目录)与一组规则进行匹配。这个函数的工作原理如下:

  • It will first try to match this path against the scoped rules. It will do this from the deepest parent of the path to the farthest. That is, if the path is src/support/w32/legacy/sound.c~, it will first look for rules in src/support/w32/legacy/.gitignore, then src/support/w32/.gitignore, src/support/.gitignore, and so on up to simply .gitignore" at the root.

  • 它首先尝试将这个路径与作用域规则匹配。从路径的最深父级开始,向上查找。也就是说,如果路径是 src/support/w32/legacy/sound.c~,它将首先查找 src/support/w32/legacy/.gitignore 中的规则,然后是 src/support/w32/.gitignore,接着是 src/support/.gitignore,依此类推,直到根目录的 .gitignore

  • If nothing matches, it will continue with the absolute rules.

  • 如果没有匹配的规则,它将继续查找绝对规则。

We write three small support functions. One to match a path against a set of rules, and return the result of the last matching rule. Notice how it’s not a real boolean functions, since it has three possible return values: True, False but also None. It returns None if nothing matched, so the caller knows it should continue trying with more general ignore files (eg, go one directory level up).

我们写三个小支持函数。一个是将路径与一组规则进行匹配,并返回最后一个匹配规则的结果。请注意,这不是一个真实的布尔函数,因为它有种可能的返回值:TrueFalseNone。如果没有匹配,则返回 None,这样调用者就知道应该继续尝试更一般的忽略文件(例如,向上移动一级目录)。

python
def check_ignore1(rules, path):
    result = None
    for (pattern, value) in rules:
        if fnmatch(path, pattern):
            result = value
    return result

A function to match against the dictionary of scoped rules (the various .gitignore files). It just starts at the path’s directory then moves up to the parent directory, recursively, until it has tested root. Notice that this function (and the next two as well), never breaks inside a given .gitignore file. Even if a rule matches, they keep going through the file, because another rule there may negate reverse the effect (rules are processed in order, so if you want to exclude *.c but not generator.c, the general rule must come before the specific one). But as soon as at least one rule matched in a file, we drop the remaining files, because a more general file never cancels the effect of a more specific one (this is why check_ignore1 is ternary and not boolean)

另一个函数用于与作用域规则(各种 .gitignore 文件)的字典进行匹配。它从路径的目录开始,递归向上移动到父目录,直到测试到根目录。请注意,这个函数(以及接下来的两个函数)从不在给定的 .gitignore 文件内部中中断。即使某个规则匹配,它们仍会继续遍历该文件,因为另一个规则可能会否定之前的效果(规则按顺序处理,因此如果你想排除 *.c 但不想排除 generator.c,一般规则必须在特定规则之前)。但是,只要在一个文件中至少有一个规则匹配,我们就丢弃剩余的文件,因为更一般的文件永远不会取消更具体的文件的效果(这就是为什么 check_ignore1 是三元的而不是布尔的原因)。

python
def check_ignore_scoped(rules, path):
    parent = os.path.dirname(path)
    while True:
        if parent in rules:
            result = check_ignore1(rules[parent], path)
            if result is not None:
                return result
        if parent == "":
            break
        parent = os.path.dirname(parent)
    return None

A much simpler function to match against the list of absolute rules. Notice that the order we push those rules to the list matters (we did read the repository rules before the global ones!)

一个更简单的函数用于与绝对规则列表进行匹配。注意,我们将这些规则推送到列表中的顺序很重要(我们确实先读取了仓库规则,然后才是全局规则!)。

python
def check_ignore_absolute(rules, path):
    parent = os.path.dirname(path)
    for ruleset in rules:
        result = check_ignore1(ruleset, path)
        if result is not None:
            return result
    return False  # This is a reasonable default at this point. 这在此时是一个合理的默认值。

And finally, a function to bind them all.

最后,定义一个函数将它们绑定在一起。

python
def check_ignore(rules, path):
    if os.path.isabs(path):
        raise Exception("This function requires path to be relative to the repository's root")

    result = check_ignore_scoped(rules.scoped, path)
    if result != None:
        return result

    return check_ignore_absolute(rules.absolute, path)
python
def check_ignore(rules, path):
    if os.path.isabs(path):
        raise Exception("此函数要求路径相对于仓库的根目录")

    result = check_ignore_scoped(rules.scoped, path)
    if result is not None:
        return result

    return check_ignore_absolute(rules.absolute, path)

You can now call wyag check-ignore. On its own source tree:

现在你可以调用 wyag check-ignore。在它自己的源树中:

txt
$ wyag check-ignore hello.el hello.elc hello.html wyag.zip wyag.tar
hello.elc
hello.html
wyag.zip

警告

This is only an approximation

这只是一个近似实现

This isn’t a perfect reimplementation. In particular, excluding whole directories with a rule that’s only the directory name (eg __pycache__) won’t work, because fnmatch would want the pattern as __pycache__/**. If you really want to play with ignore rules, this may be a good starting point.

这并不是一个完美的重新实现。特别是,通过仅使用目录名称的规则(例如 __pycache__)来排除整个目录将不起作用,因为 fnmatch 需要模式为 __pycache__/**。如果你真的想玩弄忽略规则,这可能是一个不错的起点

8.5. The status command | status 命令

status is more complex than ls-files, because it needs to compare the index with both HEAD and the actual filesystem. You call git status to know which files were added, removed or modified since the last commit, and which of these changes are actually staged, and will make it to the next commit. So status actually compares the HEAD with the staging area, and the staging area with the worktree. This is what its output looks like:

statusls-files 更复杂,因为它需要将索引与 HEAD 和实际文件系统进行比较。你调用 git status 来知道自上一个提交以来哪些文件被添加、删除或修改,以及这些更改中哪些实际上是已暂存的,并将包含在下一个提交中。因此,status 实际上比较 HEAD 与暂存区,以及暂存区与工作树之间的差异。它的输出看起来像这样:

bash
On branch master

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    modified:   write-yourself-a-git.org

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
    modified:   write-yourself-a-git.org

Untracked files:
  (use "git add <file>..." to include in what will be committed)
    org-html-themes/
    wl-copy
bash
在分支 master

待提交的更改:
  (使用 "git restore --staged <file>..." 来取消暂存)
    修改:   write-yourself-a-git.org

未暂存的更改:
  (使用 "git add <file>..." 来更新将要提交的内容)
  (使用 "git restore <file>..." 来放弃工作目录中的更改)
    修改:   write-yourself-a-git.org

未跟踪的文件:
  (使用 "git add <file>..." 将其包含在将要提交的内容中)
    org-html-themes/
    wl-copy

We’ll implement status in three parts: first the active branch or “detached HEAD”, then the difference between the index and the worktree (“Changes not staged for commit”), then the difference between HEAD and the index (“Changes to be committed” and “Untracked files”).

我们将 status 实现分为三个部分:首先是活动分支或“分离的 HEAD”,然后是索引与工作树之间的差异(“未暂存的更改”),最后是 HEAD 与索引之间的差异(“待提交的更改”和“未跟踪的文件”)。

The public interface is dead simple, our status will take no argument:

公共接口非常简单,我们的状态命令不接受任何参数:

python
argsp = argsubparsers.add_parser("status", help = "Show the working tree status.")
python
argsp = argsubparsers.add_parser("status", help = "显示工作树状态。")

And the bridge function just calls the three component functions in order:

桥接函数按顺序调用三个组件函数:

python
def cmd_status(_):
    repo = repo_find()
    index = index_read(repo)

    cmd_status_branch(repo)
    cmd_status_head_index(repo, index)
    print()
    cmd_status_index_worktree(repo, index)

8.5.1. Finding the active branch 查找活动分支

First we need to know if we’re on a branch, and if so which one. We do this by just looking at .git/HEAD. It should contain either an hexadecimal ID (a ref to a commit, in detached HEAD state), or an indirect reference to something in refs/heads/: the active branch. We either return its name, or False.

首先,我们需要知道我们是否在一个分支上,如果是的话是哪一个。我们通过查看 .git/HEAD 来实现。它应该包含一个十六进制 ID(指向一个提交,表示分离的 HEAD 状态),或者一个指向 refs/heads/ 中某个内容的间接引用:即活动分支。我们返回其名称或 False

python
def branch_get_active(repo):
    with open(repo_file(repo, "HEAD"), "r") as f:
        head = f.read()

    if head.startswith("ref: refs/heads/"):
        return(head[16:-1])
    else:
        return False

Based on this, we can write the first of the three cmd_status_* functions the bridge calls. This one prints the name of the active branch, or the hash of the detached HEAD:

基于此,我们可以编写桥接调用的三个 cmd_status_* 函数中的第一个。这个函数打印活动分支的名称,或者分离 HEAD 的哈希值:

python
def cmd_status_branch(repo):
    branch = branch_get_active(repo)
    if branch:
        print("On branch {}.".format(branch))
    else:
        print("HEAD detached at {}".format (object_find(repo, "HEAD")))
python
def cmd_status_branch(repo):
    branch = branch_get_active(repo)
    if branch:
        print("在分支 {} 上。".format(branch))
    else:
        print("HEAD 在 {} 上分离".format(object_find(repo, "HEAD")))

8.5.2. Finding changes between HEAD and index 查找 HEAD 和索引之间的变化

The second block of the status output is the “changes to be committed”, that is, how the staging area differs from HEAD. To do this, we’re going first to read the HEAD tree, and flatten it as a single dict (hashmap) with full paths as keys, so it’s closer to the (flat) index associating paths to blobs. Then we’ll just compare them and output their differences.

状态输出的第二部分是“待提交的更改”,即暂存区与 HEAD 的不同之处。为此,我们首先需要读取 HEAD 树,并将其展平为一个包含完整路径作为键的字典(哈希映射),这样它就更接近于将路径与 blob 关联的(扁平)索引。然后我们只需比较它们并输出它们的差异。

First, a function to convert a tree (recursive, remember) to a (flat) dict. And since trees are recursive, so the function itself is, again — sorry about that 😃

首先,编写一个将树(递归的,记住)转换为(扁平的)字典的函数。由于树是递归的,因此该函数本身也是递归的——对此表示歉意 😃

python
def tree_to_dict(repo, ref, prefix=""):
  ret = dict()
  tree_sha = object_find(repo, ref, fmt=b"tree")
  tree = object_read(repo, tree_sha)

  for leaf in tree.items:
      full_path = os.path.join(prefix, leaf.path)

      # We read the object to extract its type (this is uselessly
      # expensive: we could just open it as a file and read the
      # first few bytes)
      # 我们读取对象以提取其类型(这无谓地昂贵:我们可以直接将其作为文件打开并读取前几个字节)
      is_subtree = leaf.mode.startswith(b'04')
      # Depending on the type, we either store the path (if it's a
      # blob, so a regular file), or recurse (if it's another tree,
      # so a subdir)
      # 根据类型,我们要么存储路径(如果是 blob,表示常规文件),要么递归(如果是另一个树,表示子目录)
      if is_subtree:
        ret.update(tree_to_dict(repo, leaf.sha, full_path))
      else:
        ret[full_path] = leaf.sha

  return ret

And the command itself:

接下来是命令本身:

python
def cmd_status_head_index(repo, index):
    print("Changes to be committed:")

    head = tree_to_dict(repo, "HEAD")
    for entry in index.entries:
        if entry.name in head:
            if head[entry.name] != entry.sha:
                print("  modified:", entry.name)
            del head[entry.name] # Delete the key
        else:
            print("  added:   ", entry.name)

    # Keys still in HEAD are files that we haven't met in the index,
    # and thus have been deleted.
    for entry in head.keys():
        print("  deleted: ", entry)
python
def cmd_status_head_index(repo, index):
    print("待提交的更改:")

    head = tree_to_dict(repo, "HEAD")
    for entry in index.entries:
        if entry.name in head:
            if head[entry.name] != entry.sha:
                print("  修改了:", entry.name)
            del head[entry.name]  # 删除该键
        else:
            print("  添加了:", entry.name)

    # 仍在 HEAD 中的键是我们在索引中未遇到的文件,因此这些文件已被删除。
    for entry in head.keys():
        print("  已删除:", entry)

8.5.3. Finding changes between index and worktree | 查找索引与工作树之间的变化

python
def cmd_status_index_worktree(repo, index):
    print("Changes not staged for commit:")

    ignore = gitignore_read(repo)

    gitdir_prefix = repo.gitdir + os.path.sep

    all_files = list()

    # We begin by walking the filesystem
    for (root, _, files) in os.walk(repo.worktree, True):
        if root==repo.gitdir or root.startswith(gitdir_prefix):
            continue
        for f in files:
            full_path = os.path.join(root, f)
            rel_path = os.path.relpath(full_path, repo.worktree)
            all_files.append(rel_path)

    # We now traverse the index, and compare real files with the cached
    # versions.

    for entry in index.entries:
        full_path = os.path.join(repo.worktree, entry.name)

        # That file *name* is in the index

        if not os.path.exists(full_path):
            print("  deleted: ", entry.name)
        else:
            stat = os.stat(full_path)

            # Compare metadata
            ctime_ns = entry.ctime[0] * 10**9 + entry.ctime[1]
            mtime_ns = entry.mtime[0] * 10**9 + entry.mtime[1]
            if (stat.st_ctime_ns != ctime_ns) or (stat.st_mtime_ns != mtime_ns):
                # If different, deep compare.
                # @FIXME This *will* crash on symlinks to dir.
                with open(full_path, "rb") as fd:
                    new_sha = object_hash(fd, b"blob", None)
                    # If the hashes are the same, the files are actually the same.
                    same = entry.sha == new_sha

                    if not same:
                        print("  modified:", entry.name)

        if entry.name in all_files:
            all_files.remove(entry.name)

    print()
    print("Untracked files:")

    for f in all_files:
        # @TODO If a full directory is untracked, we should display
        # its name without its contents.
        if not check_ignore(ignore, f):
            print(" ", f)
python
def cmd_status_index_worktree(repo, index):
    print("未暂存的更改:")

    ignore = gitignore_read(repo)

    gitdir_prefix = repo.gitdir + os.path.sep

    all_files = list()

    # 我们首先遍历文件系统
    for (root, _, files) in os.walk(repo.worktree, True):
        if root == repo.gitdir or root.startswith(gitdir_prefix):
            continue
        for f in files:
            full_path = os.path.join(root, f)
            rel_path = os.path.relpath(full_path, repo.worktree)
            all_files.append(rel_path)

    # 现在我们遍历索引,并比较真实文件与缓存版本。

    for entry in index.entries:
        full_path = os.path.join(repo.worktree, entry.name)

        # 该文件名在索引中

        if not os.path.exists(full_path):
            print("  已删除:", entry.name)
        else:
            stat = os.stat(full_path)

            # 比较元数据
            ctime_ns = entry.ctime[0] * 10**9 + entry.ctime[1]
            mtime_ns = entry.mtime[0] * 10**9 + entry.mtime[1]
            if (stat.st_ctime_ns != ctime_ns) or (stat.st_mtime_ns != mtime_ns):
                # 如果不同,进行深度比较。
                # @FIXME 如果是指向目录的符号链接,这将崩溃。
                with open(full_path, "rb") as fd:
                    new_sha = object_hash(fd, b"blob", None)
                    # 如果哈希相同,文件实际上是相同的。
                    same = entry.sha == new_sha

                    if not same:
                        print("  修改了:", entry.name)

        if entry.name in all_files:
            all_files.remove(entry.name)

    print()
    print("未跟踪的文件:")

    for f in all_files:
        # @TODO 如果整个目录未跟踪,我们应该仅显示其名称而不包含内容。
        if not check_ignore(ignore, f):
            print(" ", f)

Our status function is done. It should output something like:

我们的状态函数完成了。它的输出应该类似于:

bash
$ wyag status
On branch main.
Changes to be committed:
  added:    src/main.c

Changes not staged for commit:
  modified: build.py
  deleted:  README.org

Untracked files:
  src/cli.c
bash
$ wyag status
在分支 main 上。
待提交的更改:
  添加了:src/main.c

未暂存的更改:
  修改了:build.py
  已删除:README.org

未跟踪的文件:
  src/cli.c

The real status is a lot smarter: it can detect renames, for example, where ours cannot. Another significant difference worth mentioning is that git status actually writes the index back if a file metadata was modified, but not its content. You can see it with our special ls-files:

真实的 status 更加智能:例如,它可以检测重命名,而我们的版本则无法。还有一个显著的区别值得提及的是,git status 实际上会在文件元数据被修改但内容未被修改时,写回 索引。您可以通过我们的特殊 ls-files 查看这一点:

bash
$ wyag ls-files --verbose
Index file format v2, containing 1 entries.
file
  regular file with perms: 644
  on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
  created: 2023-07-18 18:26:15.771460869, modified: 2023-07-18 18:26:15.771460869
  ...
$ touch file
$ git status > /dev/null
$ wyag ls-files --verbose
Index file format v2, containing 1 entries.
file
  regular file with perms: 644
  on blob: f2f279981ce01b095c42ee7162aadf60185c8f67
  created: 2023-07-18 18:26:41.421743098, modified: 2023-07-18 18:26:41.421743098
  ...
bash
$ wyag ls-files --verbose
索引文件格式 v2,包含 1 个条目。
file
  普通文件,权限:644
  对应的 blob: f2f279981ce01b095c42ee7162aadf60185c8f67
  创建时间:2023-07-18 18:26:15.771460869,修改时间:2023-07-18 18:26:15.771460869
  ...
$ touch file
$ git status > /dev/null
$ wyag ls-files --verbose
索引文件格式 v2,包含 1 个条目。
file
  普通文件,权限:644
  对应的 blob: f2f279981ce01b095c42ee7162aadf60185c8f67
  创建时间:2023-07-18 18:26:41.421743098,修改时间:2023-07-18 18:26:41.421743098
  ...

Notice how both timestamps, from the index file, were updated by git status to reflect the changes in the real file’s metadata.

注意,索引文件中的两个时间戳都被 git status 更新,以反映真实文件元数据的变化。