'Not all those who wander are lost' - J.R.R Tolkien

Merge across repositories - Git.

** TL;DR - Skip straight to the last two sections.

We all know that Git is an excellent piece of software and it is a DVCS and the D stands for 'Distributed' and all of that is very good and so on and so forth, etc.

But honestly, how often do you use the 'D' part in there?

The 'D' in DVCS. A short detour!

Digressing a bit. Once upon a time when I was young, in programming or otherwise, I needed to handle distributed transactions. In the Microsoft world it meant DTC - the Distributed Transaction Coordinator and I didn't know squat-jack about that. So I walked up to the venerable senior architect in the organization and asked for help. His terse response - 'Change the system so that you don't need it !'. I was quite taken aback then, given that my naive faith in the power of tools had not yet been battle-tested.

Over the years, this recurred once, then again, and again, and by now I have learned, both the hard way and the soft way, that the 'D' basically means - 'Don't go there. There be Dragons !!! '.

So back to the point - how do we use Git most often?

How do we use Git? Usually, mostly, as a better VCS.

In my experience, it works mostly as a drop in replacement for a standard VCS.

Oh yes, it comes with value adds:

But mostly, in spite of all the distribu-thing-ummajigs in there, we still have a central source of truth.

Somewhere out there is a 'master' which is 'THE master' branch. And while we branch and fork and checkout and checkin and switch and 'pull request' all we like, sooner or later, all roads lead back to Rome - which in this case - is that particular 'THE master'.

Short detour again - introducing Robert Frost !

Do you know this Frost?

No? Ok, let me introduce you. The chief characteristic of the esteemed Mr. Frost is his rather well known penchant of taking the road less tavelled by. (Murphy must have been quite a fan !).

Point being, sometimes, Mr. Frost steps into the fray, and as is his wont, takes the road less travelled by, and pulls you along into it.

Road less travelled - Using Git as a "DVCS".

So finally, after several years of using Git as a VCS with a first name ("D"), I finally have a situation where I have to deal with the "Distributed" nature of Git.

The context, and the problem.

Let me explain the situation. In this here and now, this is how the cookie crumbled.

In my current engagement, work occurs in two Git repositories - In two Git repositories - each with its own 'master' branch and feature branches and history and so on.

Why so? Well, because,

Moreover,

And the explorative work has way more volume than actual production work. So to cleanly enforce separation of code, there are two repositories -

But every now and then, a POC clears business validation.

Subsequently, a new branch in created in the POC repo, within which the concept based code is upgraded and polished, and code quality reviews and all of the other practises which production readiness demands happen.

Once that is done, though, what we have is this nice, freshly minted, production quality sub-system residing in branch 'nice-production-quality-sub-system' in repository 'POC'.

What we need, however, is the above code sitting in branch 'checked-in-production-sub-system' in repository 'Prod' - along with its long history of development in the 'POC' repo.

Thus what I have to do is move the code across git repositories, preserving history.

Eh! WHAT?

The theory - Git merge across repositories.

What is the first thing to do when you have something like this? Well, you look for volunteers. There are none. Ok, So someone must have done this before? Oops ! Double or nothing, its nobody again. Thus at this point, after successfully drawing two blanks, you go back to the drawing board and start doodling.

Which is when it hits you. Git is designed to do exactly that - it is, after all, a distributed VCS.

All version control systems have this concept of 'local repository' and 'remote repository'. The way you work, is that you make all changes in your local copy, and when it is done you check in into local. This local is something on your local physical machine. and at this point - only your copy has your changes.

But usually you want to synchronize things, Not just make endless changes to the same thing (repository). And synchronize implicitly means you needs two things to synchronize between - a 'source' and a 'sink'.

From wikipedia -

Distributed revision control systems (DVCS) takes a peer-to-peer approach to version control, as opposed to the client–server approach of centralized systems. Distributed revision control synchronizes repositories by exchanging patches from peer to peer. There is no single central version of the codebase; instead, each user has a working copy and the full change history.

And here is the trick that actually lets us do what we want.

In traditional or client-server based synchronization, you can only synchronize from the 'sink' to the 'source'. The 'sink' can keep changing its contents, and you can have many 'sink's, but every single 'sink' needs to pass all its changes/records to the 'source' for synchronization to be effective and available to all other 'sink's. And these synchronization steps are all sequential. The 'source' is the gatekeeper, and cannot be changed.

Git, however, eliminates the difference between 'source' and 'sink' - 'source' and 'sink' are just roles you play. Moreover, any repository can choose any other repository, and between those two, either party may play the role of 'source' or 'sink' at will.

Some taxonomy now - In Git - the 'sink' is called the 'local' repo. The 'source' is called the 'remote'. And in Git or other DVCSs it is possible to switch out the 'remote' to some other 'remote', or have any number of 'remote's.

On that principle, if I fetch the 'prod-ready-poc' branch from the 'POC' remote into the 'POC' local repo, then switch (or actually, add) the 'Prod' remote as another 'source' (call it 'newremote'), it should theoretically be possible to push the now local 'prod-ready-poc' branch from 'POC' local into a new branch (call it 'new-in-prod'), on the 'Prod' remote (i.e. the 'newremote' added to 'POC local').

Git should seamlessly merge both. Or that's the theory anyway.

A complete discussion of why is beyond the scope of this post - but if you are interested you can get started with git basics. Another good source is here

Walking the talk - Git merge across repositories.

But theory and practise are very different things, so let's see what it takes to put theory into practise.

Arrange

For my purposes, everything is setup on local machine - since we need a total of 4, prod, prod local copy, condev, condev local copy - all 4 instances of the git repos have been setup on the file system. After setup, it looks like this :

After setup, the history for branch 'pb' from 'locprod' is :

And branch 'cdb' history from 'loccd' is :

At this point we have two repositories with branches whose histories have diverged. The goal is to get all the contents and history of 'cdb' in 'loccd' into 'pb' in 'prod'. Which means, at the end, the 'pb' branch in 'prod' should contain all the four folders 'f1', 'f2', 'f3', 'f4' with all of their files, and the history of branch 'pb' should contain all of the individual commit/operation logs from 'prod'/'locprod' and 'cd'/'loccd'.

I am not showing the steps for the above because if you need help here, you are really not ready for the next bits.

Act

The goal here is to merge 'loccd'.'cdb' into, ultimately, 'prod'.'pb'. However, this is a multi-step process.

Assert

And you are done.

These are the final folders

This is the final log of history 'prod.pb'.

And the final git merge  tree.

Thats it. You are just one step away from having everything in 'prod'.'master' by merging 'prod.pb' into 'prod'.'master'. Which you can do by whatever method works best for you - review, merge, push, pull-request whatever.

This is what worked for me. Thank you for reading. Have a nice day!