WebGraph Database Docs – Custom Datasets

Overview

The CC WebGraph database is a set of SQLite files containing the complete historical ranking data from the Common Crawl Web Graph. There are two sets of databases: one for domain-level data and one for host-level data. Each set is sharded into 32 SQLite files.

The database is a key-value store. You look up a hostname (or domain) and get back a blob of tab-delimited, newline-separated text containing that entity’s full ranking history across all available Common Crawl snapshots.

Every metric that the free API and search tool expose comes directly from this database. Purchasing it gives you the same data with no rate limits, no API dependency, and the ability to run queries on your own hardware.

Data Coverage

Domain Rankings

Domain-level data is complete. It covers every Common Crawl WebGraph release from the May–June–July 2017 crawl all the way through the most recent snapshot.

Host Rankings

Host-level data begins with the February–March–May 2020 crawl and includes every release from that point forward. The earlier host-level runs (2017–2019) are intentionally excluded.

The reason: the early host-level web graph releases contained a very large number of dangling nodes — hostnames that appeared in the graph but had no meaningful inbound or outbound links. Including those first ~12 runs would roughly double the size of an already massive database (the complete dataset is over a terabyte uncompressed). The added storage cost and complexity isn’t worth it given the limited usefulness of the early host-level data.

From 2020 onward, Common Crawl significantly tightened the requirements for including URLs in the web graph, resulting in much cleaner, more relevant data.

Summary: Domain history is complete from May 2017 to present. Host history starts from February 2020 forward. Everything from both starting points onward is included. This same coverage applies to the free API and search tool.

Database Structure

The data is split into two independent sets of databases:

Domain databases: 32 files named domain.0.db through domain.31.db
Host databases: 32 files named host.0.db through host.31.db

Each hostname or domain is assigned to one of the 32 shards using a DJB2 hash of the reversed hostname (see Hash Sharding below). This means you need to know which shard a hostname belongs to before querying it.

If you purchased the Domains Only package, you have 32 files. If you purchased Hostnames Only, you have 32 files. If you purchased Both, you have 64 files total (32 domain + 32 host).

Package	Files	Naming Pattern	Approx. Size
Domains	32	`domain.{0-31}.db`	~272.5 GB total
Hosts	32	`host.{0-31}.db`	~847 GB total
Both	64	`domain.{0-31}.db` + `host.{0-31}.db`	~1.12 TB total

Download Format & Disk Space

Compression

All database files are delivered as Zstandard (zstd) compressed archives, compressed at level 19 (the highest level before “ultra”). Zstandard produces significantly smaller files than gzip and decompresses very quickly even at high compression levels.

You will need the zstd command-line tool to decompress. It is available on all major platforms:

# macOS
brew install zstd

# Ubuntu / Debian
apt install zstd

# Decompress a single file
zstd -d domain.0.db.zst

File Sizes

Package	Compressed (zstd)	Decompressed
Domains	~100 GB	~272.5 GB
Hosts	~300 GB	~847 GB
Both	~400 GB	~1.12 TB

Recommended Disk Space

A 2 TB drive is comfortable for the full dataset (both domains and hosts). You need enough space for both the compressed downloads and the decompressed files simultaneously. If disk space is tight, you can download and decompress one file at a time, deleting the compressed version before moving on to the next — start with the larger files first.

Download Process

After purchase, you’ll access a page where you enter the IP address you want to download from. Each download link is validated against your IP. You’ll receive 32 download links per dataset (32 for domains, 32 for hosts, depending on your purchase).

Important: Downloads are locked to a single IP address. Make sure you enter the correct IP during setup. If you need to change it later, email support@customdatasets.com.

Table Schema

Every database file (both domain and host) contains a single table called host_data with two columns:

Column	Type	Description
`host`	TEXT	The lookup key — the hostname in reversed dot notation. For example, `google.com` is stored as `com.google`, and `blog.example.com` is stored as `com.example.blog`.
`data`	TEXT	The ranking history as a blob of tab-delimited, newline-separated text. Each line is one crawl snapshot. See TSV Data Format below.

The host column is indexed for fast lookups. Queries are sub-millisecond on warm cache.

Note: The table is named host_data in both domain and host databases. The table name does not change — only the data inside differs.

TSV Data Format

The data column contains a text blob with one row per crawl snapshot. Rows are separated by newlines (\n). Within each row, fields are separated by tabs (\t).

Host-Level Entries (7 fields)

Position	Field	Type	Description
0	`year_month`	string	Crawl date in `YY-MM` format (see Date Format)
1	`hc_pos`	integer	Rank position by Harmonic Centrality (1 = best)
2	`hc_val`	integer	Raw Harmonic Centrality value
3	`pr_pos`	integer	Rank position by PageRank (1 = best)
4	`pr_val`	integer	Raw PageRank value
5	`hc_val_norm`	integer	Harmonic Centrality normalized to 0–100
6	`pr_val_norm`	integer	PageRank normalized to 0–100

Domain-Level Entries (8 fields)

Domain-level entries have the same 7 fields, plus one additional field:

Position	Field	Type	Description
0–6	Same as host-level entries above
7	`n_hosts`	integer	Count of distinct hostnames under this domain

Example Raw Data

Here is what the data column might look like for a domain entry (line breaks added for clarity; in practice each row is a single line):

25-01	2	987654321	1	123456789	99	100	4521
24-10	2	976543210	1	119876543	99	100	4480
24-07	3	965432109	1	115678901	98	100	4350

Each line is one crawl snapshot. The fields are tab-separated. In this example the domain ranked #1 by PageRank and #2–3 by Harmonic Centrality across three snapshots, with 4,350–4,521 hostnames under the domain.

Date Format

The year_month field uses the format YY-MM where:

YY is a two-digit year relative to 2000. For example, 21 means 2021, 25 means 2025.
MM is a two-digit month (01–12).

Each Common Crawl web graph release spans roughly three months of crawl data. The YY-MM value represents the latest month of that three-month window — i.e., the most recent month included in that particular web graph release.

For example, 21-05 means the web graph release whose data window ended in May 2021. The actual crawl data in that release covers approximately March–April–May 2021.

To convert to a full year: add 2000 to the two-digit year. 17-07 = July 2017. 25-01 = January 2025.

Example	Full Date	Meaning
`17-07`	July 2017	First domain release (May–Jun–Jul 2017 crawl)
`20-05`	May 2020	First host release (Feb–Mar–May 2020 crawl)
`25-01`	January 2025	A recent release (Nov–Dec–Jan crawl)

Hash Sharding

To determine which of the 32 database files contains a given hostname or domain, you need to:

Normalize the hostname: lowercase it, strip trailing dots, and reverse the dot notation. For example, www.Google.COM becomes com.google.www.
Hash the reversed hostname using the DJB2 hash function.
Mask the hash to get a shard number: shard = hash & 0x1F (bitwise AND with 31), giving a value from 0 to 31.

The shard number tells you which database file to open. For a domain lookup, open domain.{shard}.db. For a host lookup, open host.{shard}.db.

Important: The DJB2 hash implementation must match exactly across all languages. The hash is computed on the reversed hostname string (e.g., com.google), not the original hostname. The seed value is 5381 and all arithmetic is 32-bit unsigned.

DJB2 Hash Implementations

Below are reference implementations in C, PHP, and Python. All three produce identical hash values for identical inputs.

uint32_t djb2_hash(const char* str)
{
	uint32_t hash = 5381;
	unsigned char c;

	while ((c = *str++)) {
		hash = ((hash << 5) + hash) + c;
	}

	return hash;
}

/* Shard selection: */
/* uint32_t shard = djb2_hash(reversed_host) & 0x1F; */

function djb2_hash(string $str): int
{
	$hash = 5381;
	$len  = strlen($str);

	for ($i = 0; $i < $len; $i++) {
		$c    = ord($str[$i]);
		$hash = (($hash << 5) + $hash) + $c;
		$hash = $hash & 0xFFFFFFFF; // Keep as 32-bit unsigned
	}

	return $hash;
}

/* Shard selection: */
/* $shard = djb2_hash($reversed_host) & 0x1F; */

def djb2_hash(s: str) -> int:
    hash_val = 5381

    for c in s:
        hash_val = ((hash_val << 5) + hash_val) + ord(c)
        hash_val = hash_val & 0xFFFFFFFF  # Keep as 32-bit unsigned

    return hash_val

# Shard selection:
# shard = djb2_hash(reversed_host) & 0x1F

Test Vectors

Use these to verify your implementation produces the correct hashes:

Reversed Hostname	DJB2 Hash	Shard (hash & 0x1F)
`com.google`	2903648303	15
`com.facebook`	4230245740	28
`org.wikipedia`	1011139282	18
`com.amazon`	2665981784	8
`com.github`	2896713845	5
`edu.mit`	1411235451	27
`gov.nasa`	24078114	2

Querying the Database

Once you know the shard number, open the corresponding SQLite file and run a simple SELECT:

SELECT data FROM host_data WHERE host = 'com.google' LIMIT 1;

The host value you query must be the reversed, lowercased hostname. The query returns either one row (if the hostname exists) or no rows.

Recommended SQLite PRAGMAs

For best read performance, set these PRAGMAs when opening the database:

PRAGMA cache_size = -65536;    -- 64 MB cache per connection
PRAGMA mmap_size = 268435456;  -- 256 MB memory-mapped I/O

Open the database in read-only mode since you only need to query it. This is both safer and faster.

Hostname Normalization

Before looking up a hostname in the database, normalize it to match the stored key format:

Lowercase the entire string.
Strip trailing dots (e.g., example.com. becomes example.com).
Reverse the dot notation — split on ., reverse the parts, and rejoin with ..

Input	After Normalization	Stored Key
`google.com`	`com.google`	`com.google`
`www.Google.COM`	`com.google.www`	`com.google.www`
`blog.example.com.`	`com.example.blog`	`com.example.blog`
`MIT.EDU`	`edu.mit`	`edu.mit`

Note: The www. prefix is not stripped. www.google.com and google.com are separate entries in the host-level database. In the domain-level database, both will resolve to the same domain (com.google).

Full Examples

Here is a complete workflow for looking up google.com in the domain database:

import sqlite3

def djb2_hash(s: str) -> int:
    h = 5381
    for c in s:
        h = ((h << 5) + h) + ord(c)
        h = h & 0xFFFFFFFF
    return h

def normalize(hostname: str) -> str:
    hostname = hostname.lower().strip().rstrip('.')
    parts = hostname.split('.')
    parts.reverse()
    return '.'.join(parts)

def lookup(hostname: str, db_dir: str, db_type: str = 'domain'):
    reversed_host = normalize(hostname)
    shard = djb2_hash(reversed_host) & 0x1F
    db_path = f"{db_dir}/{db_type}.{shard}.db"

    conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
    conn.execute("PRAGMA cache_size = -65536")
    conn.execute("PRAGMA mmap_size = 268435456")

    cursor = conn.execute(
        "SELECT data FROM host_data WHERE host = ? LIMIT 1",
        (reversed_host,)
    )
    row = cursor.fetchone()
    conn.close()

    if row is None:
        return None

    entries = []
    for line in row[0].split('\n'):
        line = line.strip()
        if not line:
            continue
        fields = line.split('\t')
        if len(fields) < 7:
            continue
        entry = {
            'year_month':  fields[0],
            'hc_pos':      int(fields[1]),
            'hc_val':      int(fields[2]),
            'pr_pos':      int(fields[3]),
            'pr_val':      int(fields[4]),
            'hc_val_norm': int(fields[5]),
            'pr_val_norm': int(fields[6]),
        }
        if len(fields) >= 8:
            entry['n_hosts'] = int(fields[7])
        entries.append(entry)
    return entries

results = lookup('google.com', '/path/to/domain-dbs', 'domain')
for entry in results:
    print(entry)

function djb2_hash(string $str): int {
    $hash = 5381;
    $len  = strlen($str);
    for ($i = 0; $i < $len; $i++) {
        $c    = ord($str[$i]);
        $hash = (($hash << 5) + $hash) + $c;
        $hash = $hash & 0xFFFFFFFF;
    }
    return $hash;
}

function normalize(string $hostname): string {
    $hostname = strtolower(trim($hostname));
    $hostname = rtrim($hostname, '.');
    $parts = explode('.', $hostname);
    $parts = array_reverse($parts);
    return implode('.', $parts);
}

function lookup(string $hostname, string $db_dir, string $db_type = 'domain'): array {
    $reversed = normalize($hostname);
    $shard    = djb2_hash($reversed) & 0x1F;
    $db_path  = sprintf('%s/%s.%d.db', $db_dir, $db_type, $shard);

    $db = new SQLite3($db_path, SQLITE3_OPEN_READONLY);
    @$db->exec('PRAGMA cache_size = -65536;');
    @$db->exec('PRAGMA mmap_size = 268435456;');

    $stmt = $db->prepare('SELECT data FROM host_data WHERE host = :host LIMIT 1');
    $stmt->bindValue(':host', $reversed, SQLITE3_TEXT);
    $result = $stmt->execute();
    $row    = $result->fetchArray(SQLITE3_ASSOC);
    $stmt->close();
    $db->close();

    if ($row === false) return [];

    $entries = [];
    foreach (explode("\n", $row['data']) as $line) {
        $line = trim($line);
        if ($line === '') continue;
        $fields = explode("\t", $line);
        if (count($fields) < 7) continue;
        $entry = [
            'year_month'  => $fields[0],
            'hc_pos'      => (int)$fields[1],
            'hc_val'      => (int)$fields[2],
            'pr_pos'      => (int)$fields[3],
            'pr_val'      => (int)$fields[4],
            'hc_val_norm' => (int)$fields[5],
            'pr_val_norm' => (int)$fields[6],
        ];
        if (count($fields) >= 8) {
            $entry['n_hosts'] = (int)$fields[7];
        }
        $entries[] = $entry;
    }
    return $entries;
}

$results = lookup('google.com', '/path/to/domain-dbs', 'domain');
print_r($results);