Duplicate file profiles

These tools generate a "profile" of a set of files in order to compare the file contents of two machines while giving away very little about the actual contents of the files.

Generating a profile

(These instructions are for Unix-like systems, such as Linux or Mac OS X. You need to enter the commands in the command line. They would probably also work under Windows with something like Cygwin installed. For Unix/Linux users, I'm assuming you have perl installed.)

The tool to generate a profile is dupsearch. After saving this, you may need to make it executable with "chmod +x dupsearch", or use "perl dupsearch" below.

To generate the profile of your home directory - ie, your personal files - use:

$ cd
$ find . -xdev -type f -print | dupsearch > my_files-profile.txt

The generated profile looks like:

d92f8f 4096 1
312630 8192 1
a677e7 16384 1
0baadc 8192 1

The first column is 6 characters of the MD5 hash, the second is the file size rounded up to 4096 bytes, and the third is the number of duplicates of that file were found within the profile. Since the purpose of this exercise is to find out the duplicates between machines, all the local duplicates are counted as a single file for the purposes of comparison.

To generate a full-machine profile, you need to make sure it doesn't try to profile any pseudo-filesystems like /proc. You can use the command:

# find / /home -xdev -type f -print | dupsearch > my_machine-profile.txt
where the list "/ /home" is the list of filesystems you have mounted. Do this as root (using "su" or "sudo") if you want to include non-public files in the profile.

Comparing profiles

After generating a set of profiles for interesting sets of files, you can compare them to look for duplicates.

The tool to compare profiles is profilecmp. Again, after saving this, you may need to make it executable with "chmod +x profilecmp", or use "perl profilecmp".

$ profilecmp one-profile.txt another-profile.txt third-profile.txt
42019430400/263635070976 duplicate bytes, 15.9384827839636%
179728/2275961 duplicate files, 7.89679612260491%

Compare with me!

Here are some profiles from some of my machines:

They are compressed with gzip, so you may need to uncompress them with gunzip before feeding them into profilecmp. Please mail me the results of your comparisons.

Tools

dupsearch

profilecmp