GitHub's Data Model: How to Store Version History in a Database
Modelling version control in a database the way Git does: content-addressable blobs, trees and commits as a Merkle DAG, structural sharing of unchanged files, and the two-table schema that backs it.
How do you store the entire history of a codebase — every version of every file across thousands of commits — without the database exploding? The naive answer ("save a full copy of every file on every commit") would store the same unchanged README ten thousand times. Git's answer is one idea applied relentlessly: address every object by the hash of its own content. Identical content always lands at the same address, so it's stored exactly once — and an unchanged file is automatically shared across every commit that contains it. This is the data model behind Git, and behind GitHub at scale. The…
What’s inside
Read this one free
Sign in and your first premium article is on us — read GitHub's Data Model: How to Store Version History in a Database free.