David Lillis: Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

Expediting MRSH-v2 Approximate Matching with Hierarchical Bloom Filter Trees

David Lillis, Frank Breitinger and Mark Scanlon

In P. Matoušek and M. Schmiedecker, editors, Digital Forensics and Cyber Crime. ICDF2C 2017, volume 216 of Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, pages 144--157. Springer, Cham, 2018.


Perhaps the most common task encountered by digital forensic investigators consists of searching through a seized device for pertinent data. Frequently, an investigator will be in possession of a collection of ``known-illegal'' files (e.g. a collection of child pornographic images) and will seek to find whether copies of these are stored on the seized drive. Traditional hash matching techniques can efficiently find files that precisely match. However, these will fail in the case of merged files, embedded files, partial files, or if a file has been changed in any way. In recent years, approximate matching algorithms have shown significant promise in the detection of files that have a high bytewise similarity. This paper focuses on MRSH-v2. A number of experiments were conducted using Hierarchical Bloom Filter Trees to dramatically reduce the quantity of pairwise comparisons that must be made between known-illegal files and files on the seized disk. The experiments demonstrate substantial speed gains over the original MRSH-v2, while maintaining effectiveness.