Skip to content

6. Reading commit data: checkout 读取提交数据:检出

It’s all well that commits hold a lot more than files and directories in a given state, but that doesn’t make them really useful. It’s probably time to start implementing tree objects as well, so we’ll be able to checkout commits into the work tree.

虽然提交包含了比给定状态下的文件和目录更多的信息,但这并不使它们真正有用。现在可能是时候开始实现树对象了,这样我们就能将提交检出到工作树中。

6.1. What’s in a tree? 树中有什么?

Informally, a tree describes the content of the work tree, that it, it associates blobs to paths. It’s an array of three-element tuples made of a file mode, a path (relative to the worktree) and a SHA-1. A typical tree contents may look like this:

非正式地说,树描述了工作树的内容,也就是说,它将 blobs 关联到路径。它是由三个元素的元组组成的数组,每个元组包含一个文件模式、一个相对于工作树的路径和一个 SHA-1。一个典型的树内容可能看起来像这样:

Mode 文件模式SHA-1Path 路径
100644894a44cc066a027465cd26d634948d56d13af9af.gitignore
10064494a9ed024d3859793618152ea559a168bbcbb5e2LICENSE
100644bab489c4f4600a38ce6dbfd652b90383a4aa3e45README.md
1006446d208e47659a2a10f5f8640e0155d9276a2130a9src
040000e7445b03aea61ec801b20d6ab62f076208b7d097tests
040000d5ec863f17f3a2e92aa8f6b66ac18f7b09fd1b38main.c

Mode is just the file’s mode, path is its location. The SHA-1 refers to either a blob or another tree object. If a blob, the path is a file, if a tree, it’s directory. To instantiate this tree in the filesystem, we would begin by loading the object associated to the first path (.gitignore) and check its type. Since it’s a blob, we’ll just create a file called .gitignore with this blob’s contents; and same for LICENSE and README.md. But the object associated with src is not a blob, but another tree: we’ll create the directory src and repeat the same operation in that directory with the new tree.

模式只是文件的 模式,路径是它的位置。SHA-1 可能指向一个 blob 或另一个树对象。如果是 blob,路径就是文件;如果是树,则是目录。为了在文件系统中实例化这个树,我们将首先加载与第一个路径(.gitignore)相关联的对象,并检查它的类型。由于它是一个 blob,我们将创建一个名为 .gitignore 的文件,内容为这个 blob 的内容;对 LICENSEREADME.md 也是如此。但与 src 相关联的对象不是一个 blob,而是另一个树:我们将创建目录 src,并在该目录中用新的树重复相同的操作。

A path is a single filesystem entry

The path identifies exactly one file or directory. Not two, not three. If you have five levels of nested directories, even if four are empty save the next directory, you’re going to need five tree objects recursively referring to one another. You cannot take the shortcut of putting a full path in a single tree entry, like dir1/dir2/dir3/dir4/dir5.

警告

路径是单一的文件系统条目

路径精确地标识一个文件或目录。不是两个,也不是三个。如果你有五层嵌套的目录,即使四个目录是空的,只有下一个目录有内容,你也需要五个树对象递归地相互引用。你不能通过将完整路径放在单个树条目中来走捷径,例如 dir1/dir2/dir3/dir4/dir5

6.2. Parsing trees 解析树对象

Unlike tags and commits, tree objects are binary objects, but their format is actually quite simple. A tree is the concatenation of records of the format:

与标签和提交不同,树对象是二进制对象,但它们的格式实际上非常简单。一个树是格式记录的串联,格式如下:

txt
[mode] space [path] 0x00 [sha-1]
txt
[mode] 空格 [path] 0x00 [sha-1]
  • [mode] is up to six bytes and is an octal representation of a file mode, stored in ASCII. For example, 100644 is encoded with byte values 49 (ASCII “1”), 48 (ASCII “0”), 48, 54, 52, 52. The first two digits encode the file type (file, directory, symlink or submodule), the last four the permissions.
  • [mode] 是最多六个字节,表示文件 模式 的八进制表示,存储为 ASCII。例如,100644 被编码为字节值 49(ASCII “1”)、48(ASCII “0”)、48、54、52、52。前两位数字编码文件类型(文件、目录、符号链接或子模块),最后四位表示权限。
  • It’s followed by 0x20, an ASCII space;
  • 接下来是 0x20,一个 ASCII 空格
  • Followed by the null-terminated (0x00) path;
  • 然后是以空字符(0x00)终止的 路径
  • Followed by the object’s SHA-1 in binary encoding, on 20 bytes.
  • 最后是对象的 SHA-1 以二进制编码,长度为 20 字节。

The parser is going to be quite simple. First, create a tiny object wrapper for a single record (a leaf, a single path):

解析器将会非常简单。首先,为单个记录(一个叶子,一个路径)创建一个小的对象包装:

python
class GitTreeLeaf(object):
    def __init__(self, mode, path, sha):
        self.mode = mode
        self.path = path
        self.sha = sha

Because a tree object is just the repetition of the same fundamental data structure, we write the parser in two functions. First, a parser to extract a single record, which returns parsed data and the position it reached in input data:

由于树对象只是相同基本数据结构的重复,我们将解析器写成两个函数。首先是提取单个记录的解析器,它返回解析的数据和在输入数据中达到的位置:

python
def tree_parse_one(raw, start=0):
    # Find the space terminator of the mode
    x = raw.find(b' ', start)
    assert x-start == 5 or x-start==6

    # Read the mode
    mode = raw[start:x]
    if len(mode) == 5:
        # Normalize to six bytes.
        mode = b" " + mode

    # Find the NULL terminator of the path
    y = raw.find(b'\x00', x)
    # and read the path
    path = raw[x+1:y]

    # Read the SHA and convert to a hex string
    sha = format(int.from_bytes(raw[y+1:y+21], "big"), "040x")
    return y+21, GitTreeLeaf(mode, path.decode("utf8"), sha)
python
def tree_parse_one(raw, start=0):
    # 查找模式的空格终止符
    x = raw.find(b' ', start)
    assert x - start == 5 or x - start == 6

    # 读取模式
    mode = raw[start:x]
    if len(mode) == 5:
        # 标准化为六个字节。
        mode = b" " + mode

    # 查找路径的 NULL 终止符
    y = raw.find(b'\x00', x)
    # 读取路径
    path = raw[x + 1:y]

    # 读取 SHA 并转换为十六进制字符串
    sha = format(int.from_bytes(raw[y + 1:y + 21], "big"), "040x")
    return y + 21, GitTreeLeaf(mode, path.decode("utf8"), sha)

And the “real” parser which just calls the previous one in a loop, until input data is exhausted.

接下来是“真正”的解析器,它在循环中调用前一个解析器,直到输入数据被耗尽。

python
def tree_parse(raw):
    pos = 0
    max = len(raw)
    ret = list()
    while pos < max:
        pos, data = tree_parse_one(raw, pos)
        ret.append(data)

    return ret

We’ll finally need a serializer to write trees back. Because we may have added or modified entries, we need to sort them again. Consistently sorting matters, because we need to respect git’s identity rules, which says that no two equivalent object can have a different hash — but differently sorted trees with the same contents would be equivalent (describing the same directory structure), and still numerically distinct (different SHA-1 identifiers). Incorrectly sorted trees are invalid, but git doesn’t enforce that. I created some invalid trees by accident writing wyag, and all I got was weird bugs in git status (specifically, status would report an actually clean worktree as fully modified). We don’t want that.

我们最终需要一个序列化器来将树写回。因为我们可能已经添加或修改了条目,所以需要重新对它们进行排序。一致的排序很重要,因为我们需要遵循 Git 的 身份规则,即没有两个等效对象可以有不同的哈希——但同样内容的不同排序的树 是等效的(描述相同的目录结构),同时仍然是数值上不同的(不同的 SHA-1 标识符)。排序不正确的树是无效的,但 Git 并不强制执行这一点。在编写 wyag 时,我意外创建了一些无效树,结果在 git status 中遇到了奇怪的错误(具体来说,status 会报告实际干净的工作区为完全修改)。我们不希望发生这种情况。

The ordering function is quite simple, with an unexpected twist. are Entries sorted by name, alphabetically, but directories (that is, tree entries) are sorted with a final / added. It matters, because it means that if whatever names a regular file, it will sort beforewhatever.c, but if whatever is a dir, it will sort after, as whatever/. (I’m not sure why git does that. If you’re curious, see the function base_name_compare in tree.c in the git source)

排序函数非常简单,但有一个意外的变化。条目按名称字母顺序排序, 目录(即树条目)则添加了最终的 / 进行排序。这很重要,因为这意味着如果 whatever 是一个常规文件,它会在 whatever.c 之前排序,但如果 whatever 是一个目录,它会在之后排序,表现为 whatever/。(我不确定为什么 Git 这样做。如果你感兴趣,可以查看 Git 源代码中的 tree.c 文件中的 base_name_compare 函数。)

python
# Notice this isn't a comparison function, but a conversion function.
# Python's default sort doesn't accept a custom comparison function,
# like in most languages, but a `key` arguments that returns a new
# value, which is compared using the default rules.  So we just return
# the leaf name, with an extra / if it's a directory.
def tree_leaf_sort_key(leaf):
    if leaf.mode.startswith(b"10"):
        return leaf.path
    else:
        return leaf.path + "/"
python
# 注意这不是比较函数,而是转换函数。
# Python 的默认排序不接受自定义比较函数,
# 和大多数语言不同,而是接受返回新值的 `key` 参数,
# 该值使用默认规则进行比较。所以我们只是返回
# 叶子名称,如果是目录则多加一个 /。
def tree_leaf_sort_key(leaf):
    if leaf.mode.startswith(b"10"):
        return leaf.path
    else:
        return leaf.path + "/"

Then the serializer itself. This one is very simple: we sort the items using our newly created function as a transformer, then write them in order.

然后是序列化器本身。这个非常简单:我们使用新创建的函数作为转换器对条目进行排序,然后按顺序写入它们。

python
def tree_serialize(obj):
    obj.items.sort(key=tree_leaf_sort_key)
    ret = b''
    for i in obj.items:
        ret += i.mode
        ret += b' '
        ret += i.path.encode("utf8")
        ret += b'\x00'
        sha = int(i.sha, 16)
        ret += sha.to_bytes(20, byteorder="big")
    return ret

And now we just have to combine all that into a class:

现在我们只需将所有这些组合成一个类:

python
class GitTree(GitObject):
    fmt=b'tree'

    def deserialize(self, data):
        self.items = tree_parse(data)

    def serialize(self):
        return tree_serialize(self)

    def init(self):
        self.items = list()

6.3. Showing trees: ls-tree 显示树:ls-tree

While we’re at it, let’s add the ls-tree command to wyag. It’s so easy there’s no reason not to. git ls-tree [-r] TREE simply prints the contents of a tree, recursively with the -r flag. In recursive mode, it doesn’t show subtrees, just final objects with their full paths.

既然我们在这方面,不妨给 wyag 添加ls-tree命令。这非常简单,没有理由不这样做。git ls-tree [-r] TREE简单地打印树的内容,使用-r标志时递归显示。在递归模式下,它不显示子树,只显示最终对象及其完整路径。

python
argsp = argsubparsers.add_parser("ls-tree", help="Pretty-print a tree object.")
argsp.add_argument("-r",
                   dest="recursive",
                   action="store_true",
                   help="Recurse into sub-trees")

argsp.add_argument("tree",
                   help="A tree-ish object.")

def cmd_ls_tree(args):
    repo = repo_find()
    ls_tree(repo, args.tree, args.recursive)

def ls_tree(repo, ref, recursive=None, prefix=""):
    sha = object_find(repo, ref, fmt=b"tree")
    obj = object_read(repo, sha)
    for item in obj.items:
        if len(item.mode) == 5:
            type = item.mode[0:1]
        else:
            type = item.mode[0:2]

        match type: # Determine the type.
            case b'04': type = "tree"
            case b'10': type = "blob" # A regular file.
            case b'12': type = "blob" # A symlink. Blob contents is link target.
            case b'16': type = "commit" # A submodule
            case _: raise Exception("Weird tree leaf mode {}".format(item.mode))

        if not (recursive and type=='tree'): # This is a leaf
            print("{0} {1} {2}\t{3}".format(
                "0" * (6 - len(item.mode)) + item.mode.decode("ascii"),
                # Git's ls-tree displays the type
                # of the object pointed to.  We can do that too :)
                type,
                item.sha,
                os.path.join(prefix, item.path)))
        else: # This is a branch, recurse
            ls_tree(repo, item.sha, recursive, os.path.join(prefix, item.path))
python
argsp = argsubparsers.add_parser("ls-tree", help="美观地打印树对象。")
argsp.add_argument("-r",
                   dest="recursive",
                   action="store_true",
                   help="递归进入子树")

argsp.add_argument("tree",
                   help="一个树状对象。")

def cmd_ls_tree(args):
    repo = repo_find()
    ls_tree(repo, args.tree, args.recursive)

def ls_tree(repo, ref, recursive=None, prefix=""):
    sha = object_find(repo, ref, fmt=b"tree")
    obj = object_read(repo, sha)
    for item in obj.items:
        if len(item.mode) == 5:
            type = item.mode[0:1]
        else:
            type = item.mode[0:2]

        match type: # 确定类型。
            case b'04': type = "tree"
            case b'10': type = "blob" # 常规文件。
            case b'12': type = "blob" # 符号链接。Blob 内容是链接目标。
            case b'16': type = "commit" # 子模块
            case _: raise Exception("奇怪的树叶模式 {}".format(item.mode))

        if not (recursive and type=='tree'): # 这是一个叶子
            print("{0} {1} {2}\t{3}".format(
                "0" * (6 - len(item.mode)) + item.mode.decode("ascii"),
                # Git 的 ls-tree 显示指向对象的类型。
                # 我们也可以这样做 :)
                type,
                item.sha,
                os.path.join(prefix, item.path)))
        else: # 这是一个分支,递归
            ls_tree(repo, item.sha, recursive, os.path.join(prefix, item.path))

6.4. The checkout command | checkout 命令

git checkout simply instantiates a commit in the worktree. We’re going to oversimplify the actual git command to make our implementation clear and understandable. We’re also going to add a few safeguards. Here’s how our version of checkout will work:

git checkout 只是将一个提交实例化到工作区。我们将简化实际的 git 命令,以便让我们的实现更加清晰和易于理解。同时,我们将添加一些安全措施。以下是我们版本的 checkout 的工作方式:

  • It will take two arguments: a commit, and a directory. Git checkout only needs a commit.
  • 它将接受两个参数:一个提交和一个目录。Git checkout 只需要一个提交。
  • It will then instantiate the tree in the directory, if and only if the directory is empty. Git is full of safeguards to avoid deleting data, which would be too complicated and unsafe to try to reproduce in wyag. Since the point of wyag is to demonstrate git, not to produce a working implementation, this limitation is acceptable.
  • 然后它将在目录中实例化树,仅当目录为空时。Git 充满了避免删除数据的安全措施,而在 wyag 中重现这些措施太复杂且不安全。由于 wyag 的目的是演示 git,而不是生成一个实际的实现,这个限制是可以接受的。

Let’s get started. As usual, we need a subparser:

让我们开始吧。像往常一样,我们需要一个子解析器:

python
argsp = argsubparsers.add_parser("checkout", help="Checkout a commit inside of a directory.")

argsp.add_argument("commit",
                   help="The commit or tree to checkout.")

argsp.add_argument("path",
                   help="The EMPTY directory to checkout on.")
python
argsp = argsubparsers.add_parser("checkout", help="在一个目录中签出一个提交。")

argsp.add_argument("commit",
                   help="要签出的提交或树。")

argsp.add_argument("path",
                   help="要签出的空目录。")

A wrapper function:

包装函数:

python
def cmd_checkout(args):
    repo = repo_find()

    obj = object_read(repo, object_find(repo, args.commit))

    # If the object is a commit, we grab its tree
    if obj.fmt == b'commit':
        obj = object_read(repo, obj.kvlm[b'tree'].decode("ascii"))

    # Verify that path is an empty directory
    if os.path.exists(args.path):
        if not os.path.isdir(args.path):
            raise Exception("Not a directory {0}!".format(args.path))
        if os.listdir(args.path):
            raise Exception("Not empty {0}!".format(args.path))
    else:
        os.makedirs(args.path)

    tree_checkout(repo, obj, os.path.realpath(args.path))
python
def cmd_checkout(args):
    repo = repo_find()

    obj = object_read(repo, object_find(repo, args.commit))

    # 如果对象是一个提交,我们获取它的树
    if obj.fmt == b'commit':
        obj = object_read(repo, obj.kvlm[b'tree'].decode("ascii"))

    # 验证路径是否是一个空目录
    if os.path.exists(args.path):
        if not os.path.isdir(args.path):
            raise Exception("不是目录 {0}!".format(args.path))
        if os.listdir(args.path):
            raise Exception("不是空的 {0}!".format(args.path))
    else:
        os.makedirs(args.path)

    tree_checkout(repo, obj, os.path.realpath(args.path))

And a function to do the actual work:

实际工作的函数:

src-python
def tree_checkout(repo, tree, path):
    for item in tree.items:
        obj = object_read(repo, item.sha)
        dest = os.path.join(path, item.path)

        if obj.fmt == b'tree':
            os.mkdir(dest)
            tree_checkout(repo, obj, dest)
        elif obj.fmt == b'blob':
            # @TODO Support symlinks (identified by mode 12****)
            with open(dest, 'wb') as f:
                f.write(obj.blobdata)
python
def tree_checkout(repo, tree, path):
    for item in tree.items:
        obj = object_read(repo, item.sha)
        dest = os.path.join(path, item.path)

        if obj.fmt == b'tree':
            os.mkdir(dest)
            tree_checkout(repo, obj, dest)
        elif obj.fmt == b'blob':
            # @TODO 支持符号链接(通过模式 12**** 识别)
            with open(dest, 'wb') as f:
                f.write(obj.blobdata)