Skip to content

4. Reading and writing objects: hash-object and cat-file | 读取和写入对象:hash-object 和 cat-file

4.1. What are objects? | 什么是对象?

Now that we have repositories, putting things inside them is in order. Also, repositories are boring, and writing a Git implementation shouldn’t be just a matter of writing a bunch of mkdir. Let’s talk about objects, and let’s implement git hash-object and git cat-file.

现在我们已经有了仓库,接下来可以往里面放东西了。此外,仓库本身是比较无聊的,编写一个 Git 实现不应该只是简单地写一堆 mkdir。让我们来谈谈 对象,并实现 git hash-objectgit cat-file

Maybe you don’t know these two commands — they’re not exactly part of an everyday git toolbox, and they’re actually quite low-level (“plumbing”, in git parlance). What they do is actually very simple: hash-object converts an existing file into a git object, and cat-file prints an existing git object to the standard output.

也许你对这两个命令并不熟悉——它们并不是日常 Git 工具箱的一部分,实际上它们是相当底层的(在 Git 行话中称为“管道”)。它们的功能其实非常简单:hash-object 将一个现有文件转换为 Git 对象,而 cat-file 则将一个现有的 Git 对象打印到标准输出。

Now, what actually is a Git object? At its core, Git is a “content-addressed filesystem”. That means that unlike regular filesystems, where the name of a file is arbitrary and unrelated to that file’s contents, the names of files as stored by Git are mathematically derived from their contents. This has a very important implication: if a single byte of, say, a text file, changes, its internal name will change, too. To put it simply: you don’t modify a file in git, you create a new file in a different location. Objects are just that: files in the git repository, whose paths are determined by their contents.

那么,Git 对象到底是什么? 从本质上讲,Git 是一个“基于内容寻址的文件系统”。这意味着,与普通文件系统不同,普通文件系统中,文件的名称是任意的,与文件内容无关,而 Git 存储的文件名称是根据其内容数学推导而来的。这有一个非常重要的含义:如果某个文本文件的单个字节发生变化,它的内部名称也会随之改变。简单来说:你在 Git 中并不是 修改 文件,而是在不同的位置创建一个新文件。对象就是这样:在 Git 仓库中的文件,其路径由其内容决定

WARNING

Git is not (really) a key-value store

Some documentation, including the excellent Pro Git, call Git a “key-value store”. This is not incorrect, but may be misleading. Regular filesystems are actually closer to a key-value store than Git is. Because it computes keys from data, Git could rather be called a value-value store.

警告

Git 其实并不是一个真正的键值存储

一些文档,包括优秀的 Pro Git,将 Git 称为“键值存储”。这并不错误,但可能会误导人。普通的文件系统实际上更接近于键值存储,而不是 Git。由于 Git 是从数据计算键的,因此可以更准确地称其为 值值存储

Git uses objects to store quite a lot of things: first and foremost, the actual files it keeps in version control — source code, for example. Commit are objects, too, as well as tags. With a few notable exceptions (which we’ll see later!), almost everything, in Git, is stored as an object.

Git 使用对象来存储很多东西:首先也是最重要的,就是它在版本控制中保存的实际文件——例如源代码。提交(commit)也是对象,标签(tag)也是。除了少数显著的例外(稍后会看到!),几乎所有东西在 Git 中都以对象的形式存储。

The path where git stores a given object is computed by calculating the SHA-1hash of its contents. More precisely, Git renders the hash as a lowercase hexadecimal string, and splits it in two parts: the first two characters, and the rest. It uses the first part as a directory name, the rest as the file name (this is because most filesystems hate having too many files in a single directory and would slow down to a crawl. Git’s method creates 256 possible intermediate directories, hence dividing the average number of files per directory by 256)

Git 存储给定对象的路径是通过计算其内容的 SHA-1 哈希值 来确定的。更确切地说,Git 将哈希值表示为小写的十六进制字符串,并将其分为两部分:前两位字符和其余部分。它使用前两位作为目录名,其余部分作为文件名(这是因为大多数文件系统不喜欢在单个目录中有太多文件,这会导致性能下降。Git 的方法创建了 256 个可能的中间目录,从而将每个目录的平均文件数减少到 256 分之一)。

**What is a hash function?**

SHA-1 is what we call a “hash function”. Simply put, a hash function is a kind of unidirectional mathematical function: it is easy to compute the hash of a value, but there’s no way to compute back which value produced a hash.

A very simple example of a hash function is the classical len (or strlen) function, which returns the length of a string. It’s really easy to compute the length of a string, and the length of a given string will never change (unless the string itself changes, of course!) but it’s impossible to retrieve the original string, given only its length. Cryptographic hash functions are a much more complex version of the same, with the added property that computing an input meant to produce a given hash is hard enough to be practically impossible. (To produce an input i with strlen(i) == 12, you just type twelve random characters. With algorithms such as SHA-1. it would take much, much longer — long enough to be practically impossible[1]).

备注

什么是哈希函数?

SHA-1 被称为“哈希函数”。简单来说,哈希函数是一种单向数学函数:计算一个值的哈希值很简单,但无法反向计算出哪个值生成了该哈希。

哈希函数的一个非常简单的例子是经典的 len(或 strlen)函数,它返回字符串的长度。计算字符串的长度非常容易,而且给定字符串的长度永远不会改变(当然,除非字符串本身发生变化!),但仅凭长度是不可能恢复原始字符串的。密码学哈希函数是同类函数的复杂版本,增加了一个特性:计算出一个输入值以生成给定的哈希是相当困难的,几乎不可能。(要生成一个长度为 12 的输入 i,你只需输入 12 个随机字符。使用如 SHA-1 这样的算法,则需要更长的时间——长到几乎不可能的程度[1:1])。

Before we start implementing the object storage system, we must understand their exact storage format. An object starts with a header that specifies its type: blob, commit, tag or tree (more on that in a second). This header is followed by an ASCII space (0x20), then the size of the object in bytes as an ASCII number, then null (0x00) (the null byte), then the contents of the object. The first 48 bytes of a commit object in Wyag’s repo look like this:

在我们开始实现对象存储系统之前,必须了解它们的确切存储格式。一个对象以一个头部开始,头部指定其类型:blobcommittagtree(稍后会详细介绍)。这个头部后面跟着一个 ASCII 空格(0x20),然后是以 ASCII 数字表示的对象大小(以字节为单位),接着是一个空字节(0x00),最后是对象的内容。在 Wyag 的仓库中,一个提交对象的前 48 个字节如下所示:

txt
00000000  63 6f 6d 6d 69 74 20 31  30 38 36 00 74 72 65 65  |commit 1086.tree|
00000010  20 32 39 66 66 31 36 63  39 63 31 34 65 32 36 35  | 29ff16c9c14e265|
00000020  32 62 32 32 66 38 62 37  38 62 62 30 38 61 35 61  |2b22f8b78bb08a5a|

In the first line, we see the type header, a space (0x20), the size in ASCII (1086) and the null separator 0x00. The last four bytes on the first line are the beginning of that object’s contents, the word “tree” — we’ll discuss that further when we’ll talk about commits.

在第一行中,我们看到类型头部、一个空格(0x20)、以 ASCII 表示的大小(1086)和空分隔符 0x00。第一行的最后四个字节是该对象内容的开头,单词“tree”——当我们讨论提交时会进一步探讨这个。

The objects (headers and contents) are stored compressed with zlib.

对象(头部和内容)是使用 zlib 压缩存储的。

4.2. A generic object object | 通用对象

Objects can be of multiple types, but they all share the same storage/retrieval mechanism and the same general header format. Before we dive into the details of various types of objects, we need to abstract over these common features. The easiest way is to create a generic GitObject with two unimplemented methods: serialize() and deserialize(), and a default init() to create a new, empty object if needed (sorry pythonistas, this isn’t very nice design but it’s probably easier to read than superconstructors). Our __init__ either loads the object from the provided data, or calls the subclass-provided init() to create a new, empty object.

对象可以有多种类型,但它们都共享相同的存储/检索机制和相同的通用头格式。在深入各种对象类型的细节之前,我们需要抽象出这些共同特征。最简单的方法是创建一个通用的 GitObject,并实现两个未完成的方法:serialize()deserialize(),以及一个默认的 init(),用于在需要时创建一个新的空对象(抱歉,Python 爱好者,这样的设计不太优雅,但可能比超级构造函数更容易阅读)。我们的 __init__ 要么从提供的数据加载对象,要么调用子类提供的 init() 来创建一个新的空对象。

Later, we’ll subclass this generic class, actually implementing these functions for each object format.

稍后,我们将对这个通用类进行子类化,为每种对象格式实际实现这些函数。

python
class GitObject (object):

    def __init__(self, data=None):
        if data != None:
            self.deserialize(data)
        else:
            self.init()

    def serialize(self, repo):
        """This function MUST be implemented by subclasses.

It must read the object's contents from self.data, a byte string, and do
whatever it takes to convert it into a meaningful representation.  What exactly that means depend on each subclass."""
        raise Exception("Unimplemented!")

    def deserialize(self, data):
        raise Exception("Unimplemented!")

    def init(self):
        pass # Just do nothing. This is a reasonable default!
python
class GitObject (object):

    def __init__(self, data=None):
        if data is not None:
            self.deserialize(data)
        else:
            self.init()

    def serialize(self, repo):
        """这个函数必须由子类实现。

它必须从 self.data 读取对象的内容,这是一个字节字符串,并进行必要的转换以生成有意义的表示。具体意味着什么取决于每个子类。"""
        raise Exception("未实现!")

    def deserialize(self, data):
        raise Exception("未实现!")

    def init(self):
        pass  # 什么也不做。这是一个合理的默认值!

4.3. Reading objects | 读取对象

To read an object, we need to know its SHA-1 hash. We then compute its path from this hash (with the formula explained above: first two characters, then a directory delimiter /, then the remaining part) and look it up inside of the “objects” directory in the gitdir. That is, the path to e673d1b7eaa0aa01b5bc2442d570a765bdaae751 is .git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751.

要读取一个对象,我们需要知道它的 SHA-1 哈希值。然后,我们根据这个哈希计算它的路径(使用上面解释的公式:前两个字符,然后是目录分隔符 /,然后是剩余部分),并在 gitdir 的“objects”目录中查找它。也就是说,e673d1b7eaa0aa01b5bc2442d570a765bdaae751 的路径是 .git/objects/e6/73d1b7eaa0aa01b5bc2442d570a765bdaae751

We then read that file as a binary file, and decompress it using zlib.

接下来,我们将该文件作为二进制文件读取,并使用 zlib 进行解压缩。

From the decompressed data, we extract the two header components: the object type and its size. From the type, we determine the actual class to use. We convert the size to a Python integer, and check if it matches.

从解压缩的数据中,我们提取两个头部组件:对象类型和大小。根据类型,我们确定实际使用的类。我们将大小转换为 Python 整数,并检查其是否匹配。

When all is done, we just call the correct constructor for that object’s format.

完成所有操作后,我们只需调用该对象格式的正确构造函数。

python
def object_read(repo, sha):
    """Read object sha from Git repository repo.  Return a
    GitObject whose exact type depends on the object."""

    path = repo_file(repo, "objects", sha[0:2], sha[2:])

    if not os.path.isfile(path):
        return None

    with open (path, "rb") as f:
        raw = zlib.decompress(f.read())

        # Read object type
        x = raw.find(b' ')
        fmt = raw[0:x]

        # Read and validate object size
        y = raw.find(b'\x00', x)
        size = int(raw[x:y].decode("ascii"))
        if size != len(raw)-y-1:
            raise Exception("Malformed object {0}: bad length".format(sha))

        # Pick constructor
        match fmt:
            case b'commit' : c=GitCommit
            case b'tree'   : c=GitTree
            case b'tag'    : c=GitTag
            case b'blob'   : c=GitBlob
            case _:
                raise Exception("Unknown type {0} for object {1}".format(fmt.decode("ascii"), sha))

        # Call constructor and return object
        return c(raw[y+1:])
python
def object_read(repo, sha):
    """从 Git 仓库 repo 读取对象 sha。返回一个
    GitObject,其确切类型取决于对象。"""

    path = repo_file(repo, "objects", sha[0:2], sha[2:])

    if not os.path.isfile(path):
        return None

    with open(path, "rb") as f:
        raw = zlib.decompress(f.read())

        # 读取对象类型
        x = raw.find(b' ')
        fmt = raw[0:x]

        # 读取并验证对象大小
        y = raw.find(b'\x00', x)
        size = int(raw[x:y].decode("ascii"))
        if size != len(raw) - y - 1:
            raise Exception("格式错误的对象 {0}: 长度错误".format(sha))

        # 选择构造函数
        match fmt:
            case b'commit': c = GitCommit
            case b'tree': c = GitTree
            case b'tag': c = GitTag
            case b'blob': c = GitBlob
            case _:
                raise Exception("对象 {1} 的未知类型 {0}".format(fmt.decode("ascii"), sha))

        # 调用构造函数并返回对象
        return c(raw[y + 1:])

4.4. Writing objects 写入对象

Writing an object is reading it in reverse: we compute the hash, insert the header, zlib-compress everything and write the result in the correct location. This really shouldn’t require much explanation, just notice that the hash is computed after the header is added (so it’s the hash of the object itself, uncompressed, not just its contents)

写入对象实际上是读取它的反向过程:我们计算哈希值,插入头部,使用 zlib 进行压缩,然后将结果写入正确的位置。这实际上不需要太多解释,只需注意哈希是在添加头部之后计算的(因此它是对象本身的哈希值,而不是仅仅是其内容)。

python
def object_write(obj, repo=None):
    # Serialize object data
    data = obj.serialize()
    # Add header
    result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data
    # Compute hash
    sha = hashlib.sha1(result).hexdigest()

    if repo:
        # Compute path
        path=repo_file(repo, "objects", sha[0:2], sha[2:], mkdir=True)

        if not os.path.exists(path):
            with open(path, 'wb') as f:
                # Compress and write
                f.write(zlib.compress(result))
    return sha
python
def object_write(obj, repo=None):
    # 序列化对象数据
    data = obj.serialize()
    # 添加头部
    result = obj.fmt + b' ' + str(len(data)).encode() + b'\x00' + data
    # 计算哈希
    sha = hashlib.sha1(result).hexdigest()

    if repo:
        # 计算路径
        path = repo_file(repo, "objects", sha[0:2], sha[2:], mkdir=True)

        if not os.path.exists(path):
            with open(path, 'wb') as f:
                # 压缩并写入
                f.write(zlib.compress(result))
    return sha

4.5. Working with blobs 处理 Blob

We said earlier that the type header could be one of four: blob, commit, tag and tree — so git has four object types.

我们之前提到过,类型头可以是四种之一:blobcommittagtree——因此 Git 有四种对象类型。

Blobs are the simplest of those four types, because they have no actual format. Blobs are user data: the content of every file you put in git (main.c, logo.png, README.md) is stored as a blob. That makes them easy to manipulate, because they have no actual syntax or constraints beyond the basic object storage mechanism: they’re just unspecified data. Creating a GitBlob class is thus trivial, the serialize and deserialize functions just have to store and return their input unmodified.

Blob 是这四种类型中最简单的一种,因为它们没有实际的格式。Blob 是用户数据:您放入 Git 中的每个文件的内容(如 main.clogo.pngREADME.md)都作为 Blob 存储。这使得它们易于操作,因为它们除了基本的对象存储机制外没有实际的语法或约束:它们只是未指定的数据。因此,创建一个 GitBlob 类是微不足道的,serializedeserialize 函数只需存储和返回未修改的输入即可。

python
class GitBlob(GitObject):
    fmt = b'blob'

    def serialize(self):
        return self.blobdata

    def deserialize(self, data):
        self.blobdata = data

4.6. The cat-file command | cat-file 命令

We can now create wyag cat-file. git cat-file simply prints the raw contents of an object to stdout, uncompressed and without the git header. In a clone of wyag’s source repository, git cat-file blob e0695f14a412c29e252c998c81de1dde59658e4a will show a version of the README.

现在我们可以创建 wyag cat-file 了。git cat-file 只是将对象的原始内容打印到标准输出,不进行压缩并去掉 Git 头部。在 wyag 的源代码库 的克隆中,执行 git cat-file blob e0695f14a412c29e252c998c81de1dde59658e4a 将显示 README 的版本。

Our simplified version will just take those two positional arguments: a type and an object identifier:

我们的简化版本只需接受两个位置参数:类型和对象标识符:

txt
wyag cat-file TYPE OBJECT

The subparser is very simple:

子解析器非常简单:

python
argsp = argsubparsers.add_parser("cat-file",
                                 help="Provide content of repository objects")

argsp.add_argument("type",
                   metavar="type",
                   choices=["blob", "commit", "tag", "tree"],
                   help="Specify the type")

argsp.add_argument("object",
                   metavar="object",
                   help="The object to display")
python
argsp = argsubparsers.add_parser("cat-file",
                                 help="提供库对象的内容")

argsp.add_argument("type",
                   metavar="type",
                   choices=["blob", "commit", "tag", "tree"],
                   help="指定类型")

argsp.add_argument("object",
                   metavar="object",
                   help="要显示的对象")

And we can implement the functions, which just call into existing code we wrote earlier:

我们可以实现函数,调用之前编写的现有代码:

python
def cmd_cat_file(args):
    repo = repo_find()
    cat_file(repo, args.object, fmt=args.type.encode())

def cat_file(repo, obj, fmt=None):
    obj = object_read(repo, object_find(repo, obj, fmt=fmt))
    sys.stdout.buffer.write(obj.serialize())

This function calls an object_find function we haven’t introduced yet. For now, it’s just going to return one of its arguments unmodified, like this:

这个函数调用了一个我们尚未介绍的 object_find 函数。现在,它只是返回其参数中的一个未修改的值,如下所示:

python
def object_find(repo, name, fmt=None, follow=True):
    return name

The reason for this strange small function is that Git has a lot of ways to refer to objects: full hash, short hash, tags… object_find() will be our name resolution function. We’ll only implement it later, so this is just a temporary placeholder. This means that until we implement the real thing, the only way we can refer to an object will be by its full hash.

这个奇怪的小函数的原因在于 Git 有很多方式来引用对象:完整哈希、短哈希、标签……object_find() 将是我们的名称解析函数。我们只会在 稍后 实现它,所以这只是一个临时占位符。这意味着在我们实现真实功能之前,我们引用对象的唯一方式将是通过它的完整哈希。

4.7. The hash-object command | hash-object 命令

We will want to put our own data in our repositories, though. hash-object is basically the opposite of cat-file: it reads a file, computes its hash as an object, either storing it in the repository (if the -w flag is passed) or just printing its hash.

不过,我们确实想在我们的仓库中放入 自己的 数据。hash-object 基本上是 cat-file 的反向操作:它读取一个文件,计算其哈希作为一个对象,若传递了 -w 标志,则将其存储在仓库中,否则仅打印其哈希。

The syntax of wyag hash-object is a simplification of git hash-object:

wyag hash-object 的语法是 git hash-object 的简化版本:

txt
wyag hash-object [-w] [-t TYPE] FILE

Which converts to:

对应的解析如下:

python
argsp = argsubparsers.add_parser(
    "hash-object",
    help="Compute object ID and optionally creates a blob from a file")

argsp.add_argument("-t",
                   metavar="type",
                   dest="type",
                   choices=["blob", "commit", "tag", "tree"],
                   default="blob",
                   help="Specify the type")

argsp.add_argument("-w",
                   dest="write",
                   action="store_true",
                   help="Actually write the object into the database")

argsp.add_argument("path",
                   help="Read object from <file>")
python
argsp = argsubparsers.add_parser(
    "hash-object",
    help="计算对象 ID,并可选择从文件创建一个 blob")

argsp.add_argument("-t",
                   metavar="type",
                   dest="type",
                   choices=["blob", "commit", "tag", "tree"],
                   default="blob",
                   help="指定类型")

argsp.add_argument("-w",
                   dest="write",
                   action="store_true",
                   help="实际将对象写入数据库")

argsp.add_argument("path",
                   help="从 <file> 读取对象")

The actual implementation is very simple. As usual, we create a small bridge function:

实际的实现非常简单。和往常一样,我们创建一个小的桥接函数:

python
def cmd_hash_object(args):
    if args.write:
        repo = repo_find()
    else:
        repo = None

    with open(args.path, "rb") as fd:
        sha = object_hash(fd, args.type.encode(), repo)
        print(sha)

The actual implementation is also trivial. The repo argument is optional, and the object isn’t written if it is None (this is handled in object_write(), above):

实际的实现也很简单。repo 参数是可选的,如果为 None,对象将不会被写入(这在上面的 object_write() 中处理):

python
def object_hash(fd, fmt, repo=None):
    """ Hash object, writing it to repo if provided."""
    data = fd.read()

    # Choose constructor according to fmt argument
    match fmt:
        case b'commit' : obj=GitCommit(data)
        case b'tree'   : obj=GitTree(data)
        case b'tag'    : obj=GitTag(data)
        case b'blob'   : obj=GitBlob(data)
        case _: raise Exception("Unknown type %s!" % fmt)

    return object_write(obj, repo)
python
def object_hash(fd, fmt, repo=None):
    """ 哈希对象,如果提供了 repo,则将其写入。"""
    data = fd.read()

    # 根据 fmt 参数选择构造函数
    match fmt:
        case b'commit' : obj=GitCommit(data)
        case b'tree'   : obj=GitTree(data)
        case b'tag'    : obj=GitTag(data)
        case b'blob'   : obj=GitBlob(data)
        case _: raise Exception("未知类型 %s!" % fmt)

    return object_write(obj, repo)

4.8. Aside: what about packfiles? 旁白:那么,包文件呢?

What we’ve just implemented is called “loose objects”. Git has a second object storage mechanism called packfiles. Packfiles are much more efficient, but also much more complex, than loose objects. Simply put, a packfile is a compilation of loose objects (like a tar) but some are stored as deltas (as a transformation of another object). Packfiles are way too complex to be supported by wyag.

我们刚刚实现的被称为“松散对象”。Git 还有一种第二种对象存储机制,叫做包文件(packfiles)。包文件比松散对象更高效,但也复杂得多。简单来说,包文件是松散对象的编译(就像 tar),但其中一些以增量的形式存储(作为另一个对象的变换)。包文件复杂得多,无法被 wyag 支持。

The packfile is stored in .git/objects/pack/. It has a .pack extension, and is accompanied by an index file of the same name with the .idx extension. Should you want to convert a packfile to loose objects format (to play with wyag on an existing repo, for example), here’s the solution.

包文件存储在 .git/objects/pack/ 中,扩展名为 .pack,并伴随一个同名的索引文件,扩展名为 .idx。如果您想将包文件转换为松散对象格式(例如,在现有仓库上使用 wyag),以下是解决方案。

First, move the packfile outside the gitdir (just copying it won’t work).

首先,将包文件 移动 到 gitdir 之外(仅复制是无效的)。

shell
mv .git/objects/pack/pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack .

You can ignore the .idx. Then, from the worktree, just cat it and pipe the result to git unpack-objects:

您可以忽略 .idx 文件。然后,从工作树中,只需 cat 它并将结果管道传递给 git unpack-objects

shell
cat pack-d9ef004d4ca729287f12aaaacf36fee39baa7c9d.pack | git unpack-objects

  1. 你可能知道 SHA-1 中已发现碰撞。实际上,Git 现在不再使用 SHA-1:它使用一种 加强版,该版本不是 SHA,但对每个已知输入应用相同的哈希,除了已知存在碰撞的两个 PDF 文件。 ↩︎ ↩︎