Code duplication between Amarok versions
Tagged:

For a student assignment we had to analyze code duplication between releases of an application. It gives insight in how much code remained the same with relation to other versions.
Together with Remco Blewanus, I chose to analyze the best audio player out there: Amarok. This has a couple of reasons, besides being familiar with Amarok already.
The code base is large enough to do useful things with it. As of version 2.3.0, Amarok has roughly 181 KLOC. Obviously, this is more interesting than studying a target consisting of only 1000 LOC.
Another reason to choose Amarok is that it has quite some releases, which should give a good impression of the evolution on a larger time scale.
The last reason is merely technical, just for our convenience: since the Amarok project switched to Git not too long ago, it is easy to obtain all releases and request a difference between two arbitrary versions.

To analyze the source code we used the CCFinder tool. Unfortunately we had to use the Windows version, despite the availability of a Linux version. There was no way I could get this running on Linux because it only has (K)Ubuntu in mind.
With CCFinder, you can look for code duplication within one release and point out bits of code which were copied from one section to another. You can also compare two source trees and determine how much overlap exists between them. Only code tokens are considered, such that whitespace changes or comments do not affect the numbers.

We compared 39 versions of Amarok. The earliest tag in the Amarok repository was version 0.8.2, the latest tag was version 2.3.0. We only considered official releases, no betas or release candidates. Curiously, version 1.3.5 was missing from the tags so we didn't consider this release as well.
After some hours of code crunching and processing the numbers we obtained this heatmap:

A dark color indicates little code overlap, a light color indicates a great deal of overlap (a completely white square indicates two version are completely equal).

From this picture you can observe some interesting events during the Amarok development.
Of course there's the large dark square at the bottom: this indicates the complete rewrite of the Amarok 2 series, breaking with its past. Almost no code from the 1.x series survived in this release. It should be considered that Amarok 2.0.x contains a large body of code from the Plasma project, which does not contribute at all to overlapping code.
You can see how each minor release shows a considerable drop of duplication (for instance, going from 1.3 to 1.4). This is reflected in the release policy, such version bumps imply feature additions, changes and many bugfixes: major code changes.

The 1.4.x series was an interesting one, it was the last stable Amarok branch before the 2.x series. It was maintained for about two years, keeping the community happy while the developers put most of their effort into Amarok 2. Over time, you see the yellow triangle covering the 1.4.x series becoming more white. The last releases of this branch show a little amount of changes, just some (urgent) bugfixes.
In the map you can also observe that the transition from 1.4.4 to 1.4.5 shows a relatively large gap with relation to the 1.3.x series. This is to be explained by SQLite, which takes a significant large portion of the code base in version 1.0 (~66%) till 1.3.9 (~30%). It shouldn't be a surprise that a major SQLite update would easily influence the statistics for the whole code base, which is what happened in version 1.4.5.

The 2.x series shows an unorthodox pattern: only version 2.2 shows a significant drop of code duplication compared to its predecessors. I have a possible explanation why this effect is not visible for version 2.1: kicking Plasma from the code base. Since Plasma became part of kdelibs, there was no need to ship it with Amarok any longer. The code drop compensates for feature additions done in the meantime, this results in a relatively higher similarity with the previous versions. Why the 2.3.x release shows little change compared with 2.2 is a bit of a mystery to me, I cannot come up with anything than version 2.3 being a collection of small updates to 2.2.

It would have been interesting to see whether the transition to Git has an effect on the code duplication between Amarok versions. Unfortunately, not enough time has passed since the transition to observe a change in the pattern.

In the end we're quite satisfied with this visualization, based on such a lively project as Amarok. It is interesting to see how certain events during the development are reflected in a visualization of code duplication. For myself, I enjoyed doing this assignment. Nice software, statistics and pretty pictures, sweet candy for the average open source developer. Smile

AttachmentSize
hmap.svgz11.53 KB

similarity in the 2.x series

You're right that a lot of not-strictly-Amarok code got pulled out between the 2.0 releases and 2.1. If you re-make this chart two years from now, you'll probably see that there still isn't a huge change from 2.1 to 2.5, at least not as big as the major-version changes in the 1.x series. One reason is that we really have only one chunk of code that will probably get carved out of Amarok at some point (libpud). And the other reason is that with the 2.x rewrite, we took a lot of time to design the underlying program structure so that we could add on to it easily, without having to rewrite a large portion of the codebase with each major release. That's just good program engineering.

Wow.. a little more detail

Wow.. a little more detail and this would make a great technical paper for Akademy *hint hint* Wink