Buy the WebGraph Database – Custom Datasets

The complete Common Crawl web graph history, compiled into SQLite databases with instant key-value lookup. Includes Normalized PageRank and Harmonic Centrality scores on a 0–100 scale, raw values, rank positions, and host counts for every Common Crawl snapshot. Search any domain or hostname and get its full ranking history. No API limits, no rate restrictions — run unlimited queries on your own hardware.

Search This Database Using the Free Public API

Data sourced from the Common Crawl Web Graph.

Yearly Access

Updated with every new Common Crawl release — save ~50% vs. monthly

Domains Only

~300 GB

$5,999

per year

Save ~50%

438 million+ domains
Complete history from May 2017
32 sharded SQLite files
Normalized PageRank & HC (0–100)
Raw values, positions, host counts
Every update for 12 months

Buy Yearly — $5,999

Hostnames Only

~850 GB

$8,999

per year

Save ~50%

4.7 billion+ hostnames
History from February 2020 forward
32 sharded SQLite files
Normalized PageRank & HC (0–100)
Raw values and rank positions
Every update for 12 months

Buy Yearly — $8,999

Both

~1.15 TB

$11,999

per year

Save ~50%

Everything in both packages
64 total SQLite files (32 domain + 32 host)
Full domain + hostname coverage
All metrics, all snapshots
Best value for comprehensive analysis
Every update for 12 months

Buy Yearly — $11,999

30-Day Access

One-time download — no recurring charges

Domains Only

~300 GB

$999

30-day access

Buy 30-Day — $999

Hostnames Only

~850 GB

$1,499

30-day access

Buy 30-Day — $1,499

Both

~1.15 TB

$1,999

30-day access

Buy 30-Day — $1,999

Data Coverage: Domain rankings are complete from the May–June–July 2017 crawl through the most recent release. Host rankings begin with the February–March–May 2020 crawl forward. Earlier host-level runs (2017–2019) are excluded because they contained a very large number of dangling nodes that would roughly double the database size without adding meaningful data. This same coverage applies to the free API and search tool.

What You Get

SQLite Format

Portable, fast, no server required. Open with any SQLite client, any programming language. Works on Linux, macOS, and Windows.

Instant Key-Value Lookup

Each database is indexed for sub-millisecond lookups. Query any hostname or domain and get its full history instantly.

Normalized Scores (0–100)

Raw PageRank and Harmonic Centrality values vary wildly between datasets — wide-ranging floats that are hard to interpret. For each Common Crawl release, fresh normalization is calculated (not a single static formula), producing clean 0–100 integer scores you can graph, compare across time, and immediately understand.

Complete Metric History

Every crawl snapshot includes: PageRank (raw + normalized), Harmonic Centrality (raw + normalized), rank positions, and host counts (domain-level).

No Rate Limits

Run as many queries as you want, as fast as you want. The database runs on your hardware — no API calls, no daily caps, no throttling.

Monthly Updates

Common Crawl now publishes new web graph data approximately once per month. With a yearly plan, updated databases are available for download within 48–72 hours of each new release — every update included for 12 months.

Who Uses This Data

Built for teams and individuals who need domain authority data at scale — without per-query costs.

SEO & Marketing Teams

Build marketing databases where you need to prioritize domains and opportunities. At the scale most SEO projects operate, using a paid API would get astronomically expensive — this database makes it a flat cost.

Data Scientists & Analysts

Run bulk analysis across hundreds of millions of domains. Build internal ranking tools, track authority trends over time, and feed normalized scores directly into your pipelines.

ML & LLM Training Pipelines

When building training datasets, use authority scores to find the most authoritative version of content across multiple sources, or to weight content from established sites more heavily. Use the scores to influence how an LLM, neural network, or classifier values source material.

Competitive Intelligence

Track how domains and hostnames rise and fall in authority over time. Identify emerging competitors, monitor acquisitions, or evaluate link-building impact — all with full historical data.

Replace Your Existing Authority Data

If you currently pay for domain authority metrics from Ahrefs, Majestic, or Moz, you can replace that data with this — at a significantly lower cost. The vast majority of use cases are fully covered. The only exception: if you rely on something unique to a specific provider’s proprietary methodology.

Delivery & Technical Details

Direct Download Links

After purchase, you’ll enter the IP address you want to download from. Each link is validated against your IP. You’ll receive 32 download links per dataset (32 for domains, 32 for hosts). If you need to change your IP later, email support@customdatasets.com.

Zstandard Compression

Files are compressed with zstd at level 19 (highest before ultra). Significantly smaller than gzip, and decompresses very quickly. You’ll need the zstd tool to decompress. Compressed total for both datasets: ~400 GB.

Disk Space Requirements

Decompressed: domains ~272.5 GB, hosts ~847 GB, both ~1.12 TB. A 2 TB drive is comfortable for the full dataset. If space is tight, download and decompress one file at a time, deleting the compressed version before the next.

Compatibility

SQLite3 format — works on any OS (Linux, macOS, Windows) with any language that has SQLite bindings. Data is sharded into 32 files per dataset; see the database documentation for the DJB2 hash lookup reference.

Need help working with the database? Read the database documentation for schema details, query examples, and hash sharding reference.

Want to try before you buy? Use the free search tool or the free API to explore the data first.