November 2017

Know your cloning

Is your dataset biased in selection?

duplicate rates
duplicate rates

Github has been one of the most popular destinations for sharing projects, and it also gained its popularity in research as the source of the dataset to mine and find patterns of interest. When running experiments for a software and making statements, statistically, one would expect that the conclusion was drawn based on a software corpus made up of projects that are randomly selected and independent. Independence is taken for granted in many studies, however, there are various ways that a project can influence another, and program reuse is one important and common way of it. A dataset is biased if there are too much duplciation among the projects. Is your dataset biased? This tool, Dejavu, is to assist selecting projects from Github. We provide an index of project-level and file-level duplication for entire Github repository(*) for four popular languages: Java, C++, Python, and JavaScript.
(* The repositories are downloaded from a mirror of Github of November 2016, so we may not find the newest projects in our database.)

# of projects
# of projects

For four popular languages, we have downloaded all of their non-forked repositories on GitHub.

# of files
# of files

The inner circle still shows the proportion of downloaded projects so that they can be judged against the outer circle, which represents the amount of downloaded files for the our four languages.

Java
Java

Intentional: similar functionality required in multiple contents; school assignments.
Unintentional: boilerplate for Android applications.
Autogenerated: Apache axis (web service framework); Android; Java.

C++
C++

Intentional: slight adaptation of algorithms for different interfaces.
Unintentional: different library versions.
Autogenerated: Qt Meta-Object Compiler.

Python
Python

Intentional: unitests; database schemes.
Autogenerated: Django.

JavaScript
JavaScript

In JavaScript we have found greater variety in the identical copies.But we quickly realized that while these libraries are different, they are all various node.js packages. It almost seemed as if our analysis of JavaScript only analyzed Node.js.

JavaScript w/o node
JavaScript w/o node

Intentional: test with only changed name.
Unintentional: different library versions (jQuery).
Autogenerated: angular.js with Yeoman; express.js; grunt.js.