DOJ Document Release 12 Datasets · 250+ GB · 3.5M+ Pages

Investigate the
Epstein files

Unlock the complete DOJ Epstein archive with AI-powered investigation. Download directly from government sources, process with state-of-the-art OCR and NER, and investigate with hybrid semantic search and evidence-grounded AI analysis.

macOS · Windows · Linux · Local & Private

See Librarius in action

A complete desktop application for investigating the Epstein files with AI-powered search, entity extraction, and multimodal analysis.

How It Works

Four simple steps to unlock millions of pages of evidence with AI-powered investigation.

1

Download

Download any of the 12 Epstein archive datasets directly from the DOJ, Internet Archive, or via torrent. SHA256 verification ensures file integrity. Over 250GB of documents, audio, and images available.

2

Index

Our pipeline processes everything automatically: OCR for scanned documents (PaddleOCR), audio transcription (Whisper), image captioning (vision models), and named entity extraction for people, organizations, dates, and locations.

3

Explore

Browse documents with intelligent filtering, view images with AI-generated captions, play audio with synced transcripts, and navigate the network of extracted entities across the entire archive.

4

Investigate

Ask sophisticated questions powered by SOTA BGE-M3 hybrid retrieval (dense + sparse + BM25) with cross-encoder reranking. Every AI answer is grounded in source documents with clickable citations.

Everything you need to investigate the released files

A complete toolkit for investigating the publicly released Jeffrey Epstein court documents: OCR processing, hybrid search, entity extraction, multimodal processing, and evidence-grounded AI analysis.

🔍

Hybrid Search

Find evidence across court filings, depositions, and correspondence. BGE-M3 dense + sparse + BM25 with RRF fusion and cross-encoder reranking.

Verbatim Quotes Sources Rerank
👤

Entity Extraction

Automatically extract people, organizations, locations, and dates mentioned across the entire archive with spaCy NER.

857 People 1.3K Orgs 323 Locations
💬

RAG Chat

Ask questions about the archive. Get evidence-grounded answers with citations back to source documents. Multi-provider: Ollama, OpenAI, Claude.

Evidence-Based Citations Streaming
📄

Document OCR

Process scanned FBI documents with PaddleOCR + PyMuPDF fallback. Auto-classification: Depositions, Court Filings, Flight Logs, and more.

245 Docs 7.9K Chunks
📷

Image Analysis

Browse 216+ images with AI-generated captions. Semantic search across descriptions and "Find Similar" visual similarity search.

216 Images Captions Similar
🎧

Audio Transcription

Browse audio files with waveform visualization. Whisper-based speech-to-text transcription with playback sync.

Whisper Waveform Sync

Model Management

Configure embeddings, reranking, OCR, and transcription models. Device selection: CPU, MPS (Apple Silicon), CUDA.

MPS/CUDA Benchmark
🔒

Privacy-First

Everything runs locally. No cloud uploads, no tracking. Use local Ollama models for complete isolation.

Local Private Ollama
245
Documents
7.9K
Chunks
216
Images
857
People
1.3K
Organizations
323
Locations
610
Dates

Map every name in the files

Automatically extract named entities from all documents. Browse by entity type (People, Organizations, Locations, Dates), filter alphabetically, and see mention counts across the entire archive. Build a complete picture of who was involved, where, and when.

  • 857 people extracted from documents
  • 1,300+ organizations and agencies
  • 323 locations: properties, cities, addresses
  • 610 dates and time periods
Entity Explorer

Search visual evidence with AI

Browse 216+ images from the archive with AI-generated captions. Search across image descriptions with natural language queries, or use "Find Similar" to discover visually related images through vector similarity search.

  • AI-generated captions for every image
  • Semantic search across image descriptions
  • Find Similar: visual similarity search
  • Filter by knowledge base source
Image Gallery

Transcribe and search audio files

Browse audio files from the archive with waveform visualization and playback controls. Use Whisper-based speech-to-text to transcribe recordings and make them searchable alongside your documents.

  • Waveform visualization with seek support
  • Playback controls: play, pause, skip, speed
  • Whisper speech-to-text transcription
  • Transcription status tracking
Audio Transcription

Configure your AI pipeline

Full control over the AI models powering your investigation. Select devices per model (CPU, MPS for Apple Silicon, CUDA for NVIDIA), choose model variants, benchmark performance, and manage downloads.

  • BGE-M3 embeddings with MPS/CUDA support
  • BGE-reranker-v2-m3 cross-encoder
  • PaddleOCR for document processing
  • Benchmark tools to compare performance
Model Management

Built for serious document investigation

A production-ready platform for investigating the Jeffrey Epstein released files. Python/FastAPI backend, Flutter desktop UI, and enterprise-grade search technology.

FastAPI Backend

High-performance Python API with async support, automatic OpenAPI docs, and type-safe endpoints.

Flutter Desktop

Native desktop app for macOS, Windows, and Linux. Material 3 design with responsive layouts.

BGE-M3 Embeddings

State-of-the-art multilingual embeddings with dense and learned sparse representations.

SQLite + JSON

Lightweight storage with SQLite for entities and JSON for document cards. No external database required.

Local & Private

All processing on your machine. No cloud required. Your documents never leave your control.

Multi-Provider LLM

Connect to Ollama for local inference, or use OpenAI and Claude APIs for cloud-powered analysis.

Start investigating the Epstein files

Analyze the publicly released Jeffrey Epstein court documents. Open source, local-first, privacy-respecting.

macOS · Windows · Linux · Python · Flutter

The codebase is cross-platform, but we currently provide macOS binaries only. License: Source code is licensed under Business Source License 1.1 (BSL-1.1), and binary distributions are licensed under the Librarius Binary Distribution License. See LICENSE, BINARY-LICENSE.txt, and the website License page.

DOJ Epstein Dataset Downloads

Download the complete archive directly from the U.S. Department of Justice. Over 250 GB of court documents, depositions, and evidence files.

Dataset Size Source Download
DataSet 1 2.1 GB DOJ Download ZIP
DataSet 2 44.1 GB DOJ Download ZIP
DataSet 3 23.3 GB DOJ Download ZIP
DataSet 4 14.3 GB DOJ Download ZIP
DataSet 5 32.1 GB DOJ Download ZIP
DataSet 6 38.2 GB DOJ Download ZIP
DataSet 7 45.3 GB DOJ Download ZIP
DataSet 8 45.6 GB DOJ Download ZIP
DataSet 9 ~10 GB Torrent Only See GitHub
DataSet 10 ~10 GB Torrent Only See GitHub
DataSet 11 ~10 GB Torrent Only See GitHub
DataSet 12 5.3 GB DOJ Download ZIP

DataSets 9-11 were removed from DOJ servers and are only available via community torrents. See the GitHub repository for torrent links and verification hashes.