项目作者: techgaun

项目描述 :
An overview of git internals
高级语言:
项目地址: git://github.com/techgaun/git-internals.git
创建时间: 2019-06-21T22:14:36Z
项目社区:https://github.com/techgaun/git-internals

开源协议:Apache License 2.0

下载


git-internals

An overview of git internals

This repo consists of the talk given at PayLease’s Show and Tell on 06/21/2019.

Git has a content-addressable filesystem as the layer which acts as KV store in a way.
You give some content to git and git gives you a 40 character sha1 hash. You can then
use the sha1 hash in the future to talk with git about that content.

Slides - Use with mdp slides.md

Walkthrough

Git aliases/Shell aliases/Global gitignore

.git directory

We create a git repository first and look at the initial tree structure of .git.
Git repo is a directory with .git sub-directory with relevant metadata.

  1. $ git init git_demo
  2. Initialized empty Git repository in /tmp/git_demo/.git/
  3. $ cd git_demo
  4. $ tree .git
  5. .git
  6. ├── branches
  7. ├── config
  8. ├── description
  9. ├── HEAD
  10. ├── hooks
  11. ├── applypatch-msg.sample
  12. ├── commit-msg.sample
  13. ├── fsmonitor-watchman.sample
  14. ├── post-update.sample
  15. ├── pre-applypatch.sample
  16. ├── pre-commit.sample
  17. ├── prepare-commit-msg.sample
  18. ├── pre-push.sample
  19. ├── pre-rebase.sample
  20. ├── pre-receive.sample
  21. └── update.sample
  22. ├── info
  23. └── exclude
  24. ├── objects
  25. ├── info
  26. └── pack
  27. └── refs
  28. ├── heads
  29. └── tags
  30. 9 directories, 15 files
  • .git/config holds local git configuration that applies to the repo we are in.
  • .git/description holds description that is shown by gitweb.
  • .git/HEAD holds pointer/reference to what branch/tag/commit id we are at.
  • .git/hooks holds sample hooks initially and you can create your own.
  • .git/info/exclude holds repo level gitignore that doesn’t go in repo’s .gitignore.
  • .git/objects holds all kind of objects git stores.
  • .git/refs holds all kind of references git makes use of (branch/tag/stash, etc.).
  • .git/logs doesn’t exist initially but gets created as you travel through your git repo. It holds all the logs that
    show up on git reflog subcommand.
  • .git/index doesn’t exist initially but holds information about the staging area.

Git hooks

  • scripts that executes before or after certain events such as: commit, push, receive, etc.
  • pre-commit - usage could be something like running lint or unit tests on files changed. Exits without making commit
    if the pre-commit hook returns non-zero exit code.
  • post-receive - usage could be for pushing code to the production.
  • to enable hooks, overwrite or create one of the scripts in .git/hooks and make it executable.

Git plumbing vs porcelain commands

  • Most of the commands we use on our day to day interaction with git are porcelain commands that are much simpler to
    use. Think of them as the frontend for git with simplified interface.
  • There are another sets of commands that are lower level and can be composed together to form the porcelain commands.
    These commands are called plumbing commands.
  • As we explore further, we will look at some of the plumbing commands here in a bit with example.
  • Some of the plumbing commands we will look at are hash-object, update-index, write-tree, commit-tree and
    cat-file.

Git objects

4 types of Objects:

  • blob (binary large object) - the data we want git to store and version
  • tree - pointers to file names, contents & other trees. A git tree object creates hierarchy between files and
    directories in a git repository.
  • commit - tree of changes together with some additional metadata (like author, commit message, committer, etc.). It
    represents snapshot of the state of the repository.
  • tag - For annotated tags which contains hash of tagged object (usually commits are tagged).

Git references

  • names that point to sha1 hashes.
  • stored in directories inside .git/refs.
  • heads contain branch references.
  • tags contain tag references.
  • remotes contain references on remote urls added.

Git packfiles

  • Git stores objects on disk in so called loose object format initially.
  • It would be inefficient if git kept on storing entire content everytime we make change on a file.
  • Git occasionally packs up several of these loose objects into a single binary file called packfile to save space and
    be more efficient. This allows storing versions of objects in the form of deltas.

Git gc/reflog/fsck

gc

  • performs cleanup and optimizes the repository.
  • several housekeepings such as compressing file revisions, removing unreachable objects, packing refs and pruning
    reflogs & stale working trees.
  • As it relates to packfiles, it gathers up loose objects & places them in packfiles. Also, it consolidates packfiles
    into a single large packfile as necessary.
  • Auto gc happens once in a while as git deems necessary for example when you try to push to remote.

reflog

  • git records what repo’s HEAD is everytime it changes which we call reflog.
  • stored in .git/logs directory.
  • can be useful to recover accidentally deleted branches.

fsck

  • Integrity check of your objects in the database.
  • Often gives us the knowledge of dangling objects that can no longer be reached.
  • Could be potentially useful in cases when we don’t have reflogs.

Working Example

We will run series of commands and see how things work under the hood based on the understanding from information above.
We’ve already created git_demo repository earlier while looking at the tree structure of .git directory.

  1. # lets create a simple text file
  2. $ echo "Hello World" > readme.md
  3. # now we can see what would the sha1 hash of readme.md according to git
  4. # the way it works is, a format of data is created as below and then sha1 hash for that is created
  5. # <type_of_object> <size_of_object><nullbyte><content_of_object> | sha1_hash_function
  6. $ git hash-object readme.md
  7. 557db03de997c86a4a028e1ebd3a1ceb225be238
  8. # we can replicate what git did by doing something like below:
  9. # blob is the type of object in this case
  10. # as you can see below, the hash matches, simple ;)
  11. $ echo -e "blob $(wc -m readme.md | cut -d' ' -f1)\000$(cat readme.md)" | sha1sum
  12. 557db03de997c86a4a028e1ebd3a1ceb225be238 -
  13. # we ran hash-object which is a git plumbing command
  14. # however that doesn't add our content to object database until we instruct git to do so
  15. # next we will do that
  16. # open a new terminal (or tmux split) with the following command
  17. $ watch -n 0.5 tree .git
  18. # now we will hash the object and ask git to store it in git object database as well
  19. # when we do so, git will create new directory .git/objects/55
  20. # and create file with name `7db03de997c86a4a028e1ebd3a1ceb225be238`
  21. # first two characters of sha1 hash form directory and rest the filenames
  22. # thats how git organizes objects in its objects database.
  23. $ git hash-object -w readme.md
  24. 557db03de997c86a4a028e1ebd3a1ceb225be238
  25. # next we will cat the content of file
  26. $ cat .git/objects/55/7db03de997c86a4a028e1ebd3a1ceb225be238
  27. xK��OR04bH����/�IAI
  28. # above we see some unrecognizable text and thats not what we saved though
  29. # git saves our content with header + nullbyte + content as we saw earlier
  30. # we can use git cat-file plumbing command to look at the object we just created.
  31. # -p means pretty print the content of that object
  32. $ git cat-file -p 557db03de997c86a4a028e1ebd3a1ceb225be238
  33. Hello World
  34. # and now lets look at the type of the file
  35. # -t means print the type of that object
  36. $ git cat-file -t 557db03de997c86a4a028e1ebd3a1ceb225be238
  37. blob
  38. # now the reason the raw content of object is some sort of gibberish
  39. # is because its stored after running zlib compression
  40. # lets see what it has in there
  41. # as we see next, it stores type of object (blob) and content length (12)
  42. # and actual content separated by nullbyte character.
  43. $ cat .git/objects/55/7db03de997c86a4a028e1ebd3a1ceb225be238 | zlib-flate -uncompress
  44. blob 12Hello World
  45. # now lets see an example of creating a tree out of the object we added
  46. # we have added the object to git object database but it has no idea about
  47. # where and how that should exist in our repository
  48. # before we do that, lets look at our git status
  49. $ g status -s
  50. ?? readme.md
  51. # so there's an untracked file which we will add to git's staging area
  52. # we normally do that via git add readme.md for example
  53. # this time, we will use git update-index plumbing command
  54. # which updates .git/index file (the file that holds staging info)
  55. $ git update-index --add readme.md
  56. # if we run git status, that matches with the fact that readme.md is now in staging area
  57. # the ?? from previous status has changed to A now :)
  58. $ git status -s
  59. A readme.md
  60. # now we can take a look at our staging area more deeply
  61. # 100644 - 100 means blob (040 means tree) and 644 represents permission
  62. # permission in git looks like unix permissions except its much more limited
  63. $ git ls-files --stage
  64. 100644 557db03de997c86a4a028e1ebd3a1ceb225be238 0 readme.md
  65. # now we can create a new tree with the current state of index
  66. # note that current state of index has readme.md file in staging area
  67. # for this, we use git write-tree plumbing command
  68. # and we get a hash back
  69. $ git write-tree
  70. 3a3aff7fa9639da674465c43fac565c1291f952b
  71. # we can use cat-file to look into the content & type of object that hash represents
  72. $ git cat-file -p 3a3aff7fa9639da674465c43fac565c1291f952b
  73. 100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238 readme.md
  74. $ git cat-file -t 3a3aff7fa9639da674465c43fac565c1291f952b
  75. tree
  76. # now that we have a tree object that holds a blob object we want to be committed
  77. # we can use git commit-tree plumbing command to create new commit object
  78. # with the tree object we just created
  79. $ echo "initial commit" | git commit-tree 3a3aff7fa9639da674465c43fac565c1291f952b
  80. 86aa1cb0eec333b600d5b8c23c9c95d4983d5e6d
  81. # now lets look at the log of current branch
  82. # we should see new commit we just made
  83. # but for some reason, we don't :(
  84. $ git log --oneline
  85. fatal: your current branch 'master' does not have any commits yet
  86. # so where did that commit go then
  87. # if you search for that object in .git/objects, we do see .git/objects/86/aa1cb0eec333b600d5b8c23c9c95d4983d5e6d
  88. # then why didn't it show up on the git log?
  89. # lets see what data we have in that object
  90. $ cat .git/objects/86/aa1cb0eec333b600d5b8c23c9c95d4983d5e6d | zlib-flate -uncompress
  91. commit 181tree 3a3aff7fa9639da674465c43fac565c1291f952b
  92. author techgaun <coolsamar207@gmail.com> 1561256669 -0500
  93. committer techgaun <coolsamar207@gmail.com> 1561256669 -0500
  94. initial commit
  95. # so we have the data such as tree the commit object was created from,
  96. # author, committer and finally commit message
  97. # now we come back to the same question we had
  98. # why did the git log not show that commit?
  99. # the reason is that this commit is not associated to the current branch
  100. # we only created the commit object so far
  101. # now we can do that using git update-ref plumbing command
  102. # which updates .git/refs/heads/master file among other things
  103. # we could have done: echo 86aa1cb0eec333b600d5b8c23c9c95d4983d5e6d > .git/refs/heads/master
  104. # but git does it in a safer way while handling other side effects as necessary
  105. $ git update-ref refs/heads/master 86aa1cb0eec333b600d5b8c23c9c95d4983d5e6d
  106. # now lets see what happens with git log
  107. # as you will see next, our commit is now part of master branch. Voila!
  108. # we just made a commit to git without using normal commands we are used to with
  109. $ git log --oneline
  110. 86aa1cb (HEAD -> master) initial commit
  111. # now lets create another commit with the earlier commit id as the parent
  112. # we repeat same stuff again, this time we create file with much larger content
  113. $ printf 'Hello World.%.0s' {1..1000} > new.md
  114. # lets check the status real quick
  115. $ git status -s
  116. ?? new.md
  117. # now lets add that file to staging area
  118. # note that we will skip hash-object this time
  119. # and the reason why it still works is because
  120. # update-index goes through the process of hashing all the objects
  121. # while adding them to the staging area
  122. $ git update-index --add new.md
  123. # and if we check status, we see its added to the staging area
  124. $ git status -s
  125. A new.md
  126. $ git write-tree
  127. c4996cfea245445e4bdb0561bf18e29436568e58
  128. # now lets inspect that tree
  129. # we see that this tree contains complete snapshot of what we have in the git repo
  130. $ git cat-file -p c4996cfea245445e4bdb0561bf18e29436568e58
  131. 100644 blob c7fc1d8f722cc984f6c90f4151de8b250eeb6343 new.md
  132. 100644 blob 557db03de997c86a4a028e1ebd3a1ceb225be238 readme.md
  133. # and now lets make commit object with our newly created tree
  134. # as you will see, we pass part of commit hash from first commit we made
  135. # as you can see, we only passed 7 first characters of hash
  136. # as long as git can resolve the part of hash into an object,
  137. # we can use such short partials of sha1 hash
  138. $ echo "Added new file" | git commit-tree c4996cfea245445e4bdb0561bf18e29436568e58 -p 86aa1cb
  139. fed6ba87e445db5175c628cfecbbd0b83526a54a
  140. # we can do cat-file on commit object as well
  141. # note the parent line in this case
  142. $ git cat-file -p fed6ba87e445db5175c628cfecbbd0b83526a54a
  143. tree c4996cfea245445e4bdb0561bf18e29436568e58
  144. parent 86aa1cb0eec333b600d5b8c23c9c95d4983d5e6d
  145. author techgaun <coolsamar207@gmail.com> 1561258195 -0500
  146. committer techgaun <coolsamar207@gmail.com> 1561258195 -0500
  147. Added new file
  148. # also, lets look at the type of commit object with cat-file
  149. $ git cat-file -t fed6ba87e445db5175c628cfecbbd0b83526a54a
  150. commit
  151. # finally, lets update master ref to this commit
  152. $ git update-ref refs/heads/master fed6ba87e445db5175c628cfecbbd0b83526a54a
  153. # and lets check the git log one more time
  154. # and we see things as expected
  155. $ git log --oneline
  156. fed6ba8 (HEAD -> master) Added new file
  157. 86aa1cb initial commit

Other Examples

We will continue to operate on the above repository we created earlier

gc and packfile

  1. # lets look at the size of .git/objects once
  2. # and as per the output below, we are at around 41K with our git object
  3. $ du -b .git/objects/
  4. 4096 .git/objects/pack
  5. 4224 .git/objects/86
  6. 4096 .git/objects/info
  7. 4150 .git/objects/3a
  8. 4177 .git/objects/c4
  9. 4124 .git/objects/55
  10. 4227 .git/objects/0a
  11. 4255 .git/objects/fe
  12. 4207 .git/objects/c7
  13. 41652 .git/objects/
  14. # and now lets look at the tree of .git directory after all the things we did
  15. # hooks directory is not shown here to preserve space
  16. $ tree .git
  17. .git
  18. ├── branches
  19. ├── config
  20. ├── description
  21. ├── HEAD
  22. ├── hooks
  23. ├── index
  24. ├── info
  25. └── exclude
  26. ├── logs
  27. ├── HEAD
  28. └── refs
  29. └── heads
  30. └── master
  31. ├── objects
  32. ├── 0a
  33. └── 9c3e68d37858d478ad2692e01126e6851d1c93
  34. ├── 3a
  35. └── 3aff7fa9639da674465c43fac565c1291f952b
  36. ├── 55
  37. └── 7db03de997c86a4a028e1ebd3a1ceb225be238
  38. ├── 86
  39. └── aa1cb0eec333b600d5b8c23c9c95d4983d5e6d
  40. ├── c4
  41. └── 996cfea245445e4bdb0561bf18e29436568e58
  42. ├── c7
  43. └── fc1d8f722cc984f6c90f4151de8b250eeb6343
  44. ├── fe
  45. └── d6ba87e445db5175c628cfecbbd0b83526a54a
  46. ├── info
  47. └── pack
  48. └── refs
  49. ├── heads
  50. └── master
  51. └── tags
  52. 19 directories, 26 files
  53. # Now lets see if we can optimize our repo like git promises to by running gc
  54. $ git gc
  55. # And once again, lets see the size of .git/objects
  56. $ du -b .git/objects/
  57. 5855 .git/objects/pack
  58. 4150 .git/objects/info
  59. 4227 .git/objects/0a
  60. 18328 .git/objects/
  61. # many of the objects are gone as we see above
  62. # and our git object database is down to 18K
  63. # if we look at the tree of .git repo, it will be different now
  64. $ tree .git
  65. .git
  66. ├── branches
  67. ├── config
  68. ├── description
  69. ├── HEAD
  70. ├── hooks
  71. ├── index
  72. ├── info
  73. ├── exclude
  74. └── refs
  75. ├── logs
  76. ├── HEAD
  77. └── refs
  78. └── heads
  79. └── master
  80. ├── objects
  81. ├── 0a
  82. └── 9c3e68d37858d478ad2692e01126e6851d1c93
  83. ├── info
  84. └── packs
  85. └── pack
  86. ├── pack-5dda0074f5c0745e99fad6c6d639ca69f009091e.idx
  87. └── pack-5dda0074f5c0745e99fad6c6d639ca69f009091e.pack
  88. ├── packed-refs
  89. └── refs
  90. ├── heads
  91. └── tags
  92. 13 directories, 24 files
  93. # As we see above, we have .git/objects/pack with two files .idx and .pack
  94. # git has optimized our repository and created packfile like we said earlier
  95. # there's git show-index command to which you can pipe .idx file
  96. # I leave that as homework for you to look into that and see what you will see in those

reflog and fsck

  1. # lets move master branch to the first commit
  2. $ git reset --hard 86aa1cb
  3. HEAD is now at 86aa1cb initial commit
  4. # now we don't have the top commit because none of the branches reach to that commit
  5. # if you come back at later point in time, you will not remember sha1 hash
  6. # which means we effectively lost that commit
  7. # imagine that it was not intended action, how could we recover?
  8. # given that our master branch had that commit at some point of time,
  9. # reflog records that meaning we can recover that commit
  10. $ git reflog
  11. 86aa1cb (HEAD -> master) HEAD@{0}: reset: moving to 86aa1cb
  12. fed6ba8 HEAD@{1}: reset: moving to HEAD
  13. fed6ba8 HEAD@{2}:
  14. 86aa1cb (HEAD -> master) HEAD@{3}:
  15. # here we see the sha1 hash of our commit fed6ba8
  16. # and now we can create new branch from that commit effectively saving us from disaster
  17. $ git branch master-recovered fed6ba8
  18. # lets checkout to the newly recovered branch
  19. $ git checkout master-recovered
  20. # now lets look at the log and we see that we have the lost commit back. Voila!
  21. $ git log --oneline
  22. fed6ba8 (HEAD -> master-recovered) Added new file
  23. 86aa1cb (master) initial commit
  24. # now imagine that our reflogs were gone
  25. # is there a possibility of recovery in such case?
  26. # maybe? fsck may help although its not straightforward, esp. in large repos
  27. # to mimic loss of reflog, lets delete .git/logs
  28. $ rm -rf .git/logs
  29. # and if we check our reflog, its empty
  30. $ git reflog
  31. # lets delete master-recovered branch once again
  32. $ git branch -D master-recovered
  33. Deleted branch master-recovered (was fed6ba8).
  34. # now lets run fsck with --full argument
  35. # --full is default in recent git versions which means you don't have to specify it anymore
  36. # this performs a full fledged database verification
  37. $ git fsck --full
  38. Checking object directories: 100% (256/256), done.
  39. Checking objects: 100% (6/6), done.
  40. dangling commit fed6ba87e445db5175c628cfecbbd0b83526a54a
  41. # and if you look above, we see dangling commit
  42. # which is the commit that got lost in oblivion
  43. # now that we know our dangling commit, we can recover that commit just like earlier
  44. $ git branch master-recovered-new fed6ba87e445db5175c628cfecbbd0b83526a54a

Author