Know your cloning
Is your dataset biased in selection?

Github has been one of the most popular destinations for sharing projects, and it also gained its popularity in research as the source of the dataset to mine and find patterns of interest. When running experiments for a software and making statements, statistically, one would expect that the conclusion was drawn based on a software corpus made up of projects that are randomly selected and independent. Independence is taken for granted in many studies, however, there are various ways that a project can influence another, and program reuse is one important and common way of it. A dataset is biased if there are too much duplciation among the projects. Is your dataset biased? This tool, Dejavu, is to assist selecting projects from Github. We provide an index of project-level and file-level duplication for entire Github repository(*) for four popular languages: Java, C++, Python, and JavaScript.
(* The repositories are downloaded from a mirror of Github of November 2016, so we may not find the newest projects in our database.)

For four popular languages, we have downloaded all of their non-forked repositories on GitHub.

The inner circle still shows the proportion of downloaded projects so that they can be judged against the outer circle, which represents the amount of downloaded files for the our four languages.

Intentional: similar functionality required in multiple contents; school assignments.
Unintentional: boilerplate for Android applications.
Autogenerated: Apache axis (web service framework); Android; Java.

Intentional: slight adaptation of algorithms for different interfaces.
Unintentional: different library versions.
Autogenerated: Qt Meta-Object Compiler.

Intentional: unitests; database schemes.
Autogenerated: Django.

In JavaScript we have found greater variety in the identical copies.But we quickly realized that while these libraries are different, they are all various node.js packages. It almost seemed as if our analysis of JavaScript only analyzed Node.js.

Intentional: test with only changed name.
Unintentional: different library versions (jQuery).
Autogenerated: angular.js with Yeoman; express.js; grunt.js.