LeakDB is a tool set designed to allow organizations to build and deploy their own internal plaintext “Have I Been Pwned”-like service. The LeakDB tool set can normalize, deduplicate, index, sort, and search leaked data sets on the multi-terabyte-scale, without the need to distribute large files to individual users. Once curated, LeakDB can search terabytes of data in less than a tenth of a second, and the LeakDB server exposes a simple JSON API that can be queried using the command line client or any http client. It can be deployed in a serverless configuration with a BigQuery backend (no indexes), or as an offline/traditional server with indexes.
LeakDB uses a configurable bloom filter to remove duplicate entires, sorts indexes using external parallel quicksort (i.e., memory constrained) with a k-way binary tree merge, and binary tree search to find entries in the index.