Simphile – text similarity and pattern detection


SimphileTM, the first pattern recognition software available from Geneffects, derives its name from "the love of similarities." The pattern detection methods used in Simphile have been used in a diverse array of applications ranging from language recognition to gene matching for bioinformatics. Using Simphile one could determine the likelihood that Shakespeare wrote an anonymous sonnet, that certain sound files came from the same source, the similarity in source code, term papers, or email spam (to name a few applications).

Give Simphile a source file and a set of files to which it will be compared and it will sort the selected files in the order of similarity to the source file.

How It Works
An interesting fact is that many pattern recognition algorithms can be used as compression algorithms. As Simphile proves, the converse of that statement is also true. Simphile uses the common compression algorithm gzip as its pattern detection engine. Let us say that we are comparing file A and file B. We compress file A to determine how small it can get. We then compress file B to see the amount it will shrink. Finally, we compress file A+B. If gzip(A+B) is significantly less than gzip(A) + gzip(B), then that means files A and B share patterns! (This method was inspired by Ming Li et al.)

Free and Open Source
Simphile is free of cost to those using it for educational or personal applications.
The source code for Simphile is written in Java and is freely available. Please submit all code suggestions to . Please contact us if you would like to use Simphile, or portions of the source code for Simphile in any commercial applications. If you discover any interesting uses for the program, please tell us! We are always eager to hear of innovative applications.

Simphile 1.0 – OSX – 103k
Simphile 1.0 – (Unix, Windows) – 44k
Source code -73k

Geneffects Software