Overview
The CC WebGraph database is a set of SQLite files containing the complete historical ranking data from the Common Crawl Web Graph. There are two sets of databases: one for domain-level data and one for host-level data. Each set is sharded into 32 SQLite files.
The database is a key-value store. You look up a hostname (or domain) and get back a blob of tab-delimited, newline-separated text containing that entity’s full ranking history across all available Common Crawl snapshots.
Every metric that the free API and search tool expose comes directly from this database. Purchasing it gives you the same data with no rate limits, no API dependency, and the ability to run queries on your own hardware.
Data Coverage
Domain Rankings
Domain-level data is complete. It covers every Common Crawl WebGraph release from the May–June–July 2017 crawl all the way through the most recent snapshot.
Host Rankings
Host-level data begins with the February–March–May 2020 crawl and includes every release from that point forward. The earlier host-level runs (2017–2019) are intentionally excluded.
The reason: the early host-level web graph releases contained a very large number of dangling nodes — hostnames that appeared in the graph but had no meaningful inbound or outbound links. Including those first ~12 runs would roughly double the size of an already massive database (the complete dataset is over a terabyte uncompressed). The added storage cost and complexity isn’t worth it given the limited usefulness of the early host-level data.
From 2020 onward, Common Crawl significantly tightened the requirements for including URLs in the web graph, resulting in much cleaner, more relevant data.
Database Structure
The data is split into two independent sets of databases:
- Domain databases: 32 files named
domain.0.dbthroughdomain.31.db - Host databases: 32 files named
host.0.dbthroughhost.31.db
Each hostname or domain is assigned to one of the 32 shards using a DJB2 hash of the reversed hostname (see Hash Sharding below). This means you need to know which shard a hostname belongs to before querying it.
If you purchased the Domains Only package, you have 32 files. If you purchased Hostnames Only, you have 32 files. If you purchased Both, you have 64 files total (32 domain + 32 host).
| Package | Files | Naming Pattern | Approx. Size |
|---|---|---|---|
| Domains | 32 | domain.{0-31}.db |
~272.5 GB total |
| Hosts | 32 | host.{0-31}.db |
~847 GB total |
| Both | 64 | domain.{0-31}.db + host.{0-31}.db |
~1.12 TB total |
Download Format & Disk Space
Compression
All database files are delivered as Zstandard (zstd) compressed archives, compressed at level 19 (the highest level before “ultra”). Zstandard produces significantly smaller files than gzip and decompresses very quickly even at high compression levels.
You will need the zstd command-line tool to decompress. It is available on all major platforms:
# macOS
brew install zstd
# Ubuntu / Debian
apt install zstd
# Decompress a single file
zstd -d domain.0.db.zst
File Sizes
| Package | Compressed (zstd) | Decompressed |
|---|---|---|
| Domains | ~100 GB | ~272.5 GB |
| Hosts | ~300 GB | ~847 GB |
| Both | ~400 GB | ~1.12 TB |
Recommended Disk Space
A 2 TB drive is comfortable for the full dataset (both domains and hosts). You need enough space for both the compressed downloads and the decompressed files simultaneously. If disk space is tight, you can download and decompress one file at a time, deleting the compressed version before moving on to the next — start with the larger files first.
Download Process
After purchase, you’ll access a page where you enter the IP address you want to download from. Each download link is validated against your IP. You’ll receive 32 download links per dataset (32 for domains, 32 for hosts, depending on your purchase).
Table Schema
Every database file (both domain and host) contains a single table called host_data with two columns:
| Column | Type | Description |
|---|---|---|
host |
TEXT | The lookup key — the hostname in reversed dot notation. For example, google.com is stored as com.google, and blog.example.com is stored as com.example.blog. |
data |
TEXT | The ranking history as a blob of tab-delimited, newline-separated text. Each line is one crawl snapshot. See TSV Data Format below. |
The host column is indexed for fast lookups. Queries are sub-millisecond on warm cache.
host_data in both domain and host databases. The table name does not change — only the data inside differs.
TSV Data Format
The data column contains a text blob with one row per crawl snapshot. Rows are separated by newlines (\n). Within each row, fields are separated by tabs (\t).
Host-Level Entries (7 fields)
| Position | Field | Type | Description |
|---|---|---|---|
| 0 | year_month |
string | Crawl date in YY-MM format (see Date Format) |
| 1 | hc_pos |
integer | Rank position by Harmonic Centrality (1 = best) |
| 2 | hc_val |
integer | Raw Harmonic Centrality value |
| 3 | pr_pos |
integer | Rank position by PageRank (1 = best) |
| 4 | pr_val |
integer | Raw PageRank value |
| 5 | hc_val_norm |
integer | Harmonic Centrality normalized to 0–100 |
| 6 | pr_val_norm |
integer | PageRank normalized to 0–100 |
Domain-Level Entries (8 fields)
Domain-level entries have the same 7 fields, plus one additional field:
| Position | Field | Type | Description |
|---|---|---|---|
| 0–6 | Same as host-level entries above | ||
| 7 | n_hosts |
integer | Count of distinct hostnames under this domain |
Example Raw Data
Here is what the data column might look like for a domain entry (line breaks added for clarity; in practice each row is a single line):
25-01 2 987654321 1 123456789 99 100 4521
24-10 2 976543210 1 119876543 99 100 4480
24-07 3 965432109 1 115678901 98 100 4350
Each line is one crawl snapshot. The fields are tab-separated. In this example the domain ranked #1 by PageRank and #2–3 by Harmonic Centrality across three snapshots, with 4,350–4,521 hostnames under the domain.
Date Format
The year_month field uses the format YY-MM where:
YYis a two-digit year relative to 2000. For example,21means 2021,25means 2025.MMis a two-digit month (01–12).
Each Common Crawl web graph release spans roughly three months of crawl data. The YY-MM value represents the latest month of that three-month window — i.e., the most recent month included in that particular web graph release.
For example, 21-05 means the web graph release whose data window ended in May 2021. The actual crawl data in that release covers approximately March–April–May 2021.
17-07 = July 2017. 25-01 = January 2025.
| Example | Full Date | Meaning |
|---|---|---|
17-07 |
July 2017 | First domain release (May–Jun–Jul 2017 crawl) |
20-05 |
May 2020 | First host release (Feb–Mar–May 2020 crawl) |
25-01 |
January 2025 | A recent release (Nov–Dec–Jan crawl) |
Hash Sharding
To determine which of the 32 database files contains a given hostname or domain, you need to:
- Normalize the hostname: lowercase it, strip trailing dots, and reverse the dot notation. For example,
www.Google.COMbecomescom.google.www. - Hash the reversed hostname using the DJB2 hash function.
- Mask the hash to get a shard number:
shard = hash & 0x1F(bitwise AND with 31), giving a value from 0 to 31.
The shard number tells you which database file to open. For a domain lookup, open domain.{shard}.db. For a host lookup, open host.{shard}.db.
com.google), not the original hostname. The seed value is 5381 and all arithmetic is 32-bit unsigned.
DJB2 Hash Implementations
Below are reference implementations in C, PHP, and Python. All three produce identical hash values for identical inputs.
uint32_t djb2_hash(const char* str)
{
uint32_t hash = 5381;
unsigned char c;
while ((c = *str++)) {
hash = ((hash << 5) + hash) + c;
}
return hash;
}
/* Shard selection: */
/* uint32_t shard = djb2_hash(reversed_host) & 0x1F; */
function djb2_hash(string $str): int
{
$hash = 5381;
$len = strlen($str);
for ($i = 0; $i < $len; $i++) {
$c = ord($str[$i]);
$hash = (($hash << 5) + $hash) + $c;
$hash = $hash & 0xFFFFFFFF; // Keep as 32-bit unsigned
}
return $hash;
}
/* Shard selection: */
/* $shard = djb2_hash($reversed_host) & 0x1F; */
def djb2_hash(s: str) -> int:
hash_val = 5381
for c in s:
hash_val = ((hash_val << 5) + hash_val) + ord(c)
hash_val = hash_val & 0xFFFFFFFF # Keep as 32-bit unsigned
return hash_val
# Shard selection:
# shard = djb2_hash(reversed_host) & 0x1F
Test Vectors
Use these to verify your implementation produces the correct hashes:
| Reversed Hostname | DJB2 Hash | Shard (hash & 0x1F) |
|---|---|---|
com.google |
2903648303 | 15 |
com.facebook |
4230245740 | 28 |
org.wikipedia |
1011139282 | 18 |
com.amazon |
2665981784 | 8 |
com.github |
2896713845 | 5 |
edu.mit |
1411235451 | 27 |
gov.nasa |
24078114 | 2 |
Querying the Database
Once you know the shard number, open the corresponding SQLite file and run a simple SELECT:
SELECT data FROM host_data WHERE host = 'com.google' LIMIT 1;
The host value you query must be the reversed, lowercased hostname. The query returns either one row (if the hostname exists) or no rows.
Recommended SQLite PRAGMAs
For best read performance, set these PRAGMAs when opening the database:
PRAGMA cache_size = -65536; -- 64 MB cache per connection
PRAGMA mmap_size = 268435456; -- 256 MB memory-mapped I/O
Open the database in read-only mode since you only need to query it. This is both safer and faster.
Hostname Normalization
Before looking up a hostname in the database, normalize it to match the stored key format:
- Lowercase the entire string.
- Strip trailing dots (e.g.,
example.com.becomesexample.com). - Reverse the dot notation — split on
., reverse the parts, and rejoin with..
| Input | After Normalization | Stored Key |
|---|---|---|
google.com |
com.google |
com.google |
www.Google.COM |
com.google.www |
com.google.www |
blog.example.com. |
com.example.blog |
com.example.blog |
MIT.EDU |
edu.mit |
edu.mit |
www. prefix is not stripped. www.google.com and google.com are separate entries in the host-level database. In the domain-level database, both will resolve to the same domain (com.google).
Full Examples
Here is a complete workflow for looking up google.com in the domain database:
import sqlite3
def djb2_hash(s: str) -> int:
h = 5381
for c in s:
h = ((h << 5) + h) + ord(c)
h = h & 0xFFFFFFFF
return h
def normalize(hostname: str) -> str:
hostname = hostname.lower().strip().rstrip('.')
parts = hostname.split('.')
parts.reverse()
return '.'.join(parts)
def lookup(hostname: str, db_dir: str, db_type: str = 'domain'):
reversed_host = normalize(hostname)
shard = djb2_hash(reversed_host) & 0x1F
db_path = f"{db_dir}/{db_type}.{shard}.db"
conn = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
conn.execute("PRAGMA cache_size = -65536")
conn.execute("PRAGMA mmap_size = 268435456")
cursor = conn.execute(
"SELECT data FROM host_data WHERE host = ? LIMIT 1",
(reversed_host,)
)
row = cursor.fetchone()
conn.close()
if row is None:
return None
entries = []
for line in row[0].split('\n'):
line = line.strip()
if not line:
continue
fields = line.split('\t')
if len(fields) < 7:
continue
entry = {
'year_month': fields[0],
'hc_pos': int(fields[1]),
'hc_val': int(fields[2]),
'pr_pos': int(fields[3]),
'pr_val': int(fields[4]),
'hc_val_norm': int(fields[5]),
'pr_val_norm': int(fields[6]),
}
if len(fields) >= 8:
entry['n_hosts'] = int(fields[7])
entries.append(entry)
return entries
results = lookup('google.com', '/path/to/domain-dbs', 'domain')
for entry in results:
print(entry)
function djb2_hash(string $str): int {
$hash = 5381;
$len = strlen($str);
for ($i = 0; $i < $len; $i++) {
$c = ord($str[$i]);
$hash = (($hash << 5) + $hash) + $c;
$hash = $hash & 0xFFFFFFFF;
}
return $hash;
}
function normalize(string $hostname): string {
$hostname = strtolower(trim($hostname));
$hostname = rtrim($hostname, '.');
$parts = explode('.', $hostname);
$parts = array_reverse($parts);
return implode('.', $parts);
}
function lookup(string $hostname, string $db_dir, string $db_type = 'domain'): array {
$reversed = normalize($hostname);
$shard = djb2_hash($reversed) & 0x1F;
$db_path = sprintf('%s/%s.%d.db', $db_dir, $db_type, $shard);
$db = new SQLite3($db_path, SQLITE3_OPEN_READONLY);
@$db->exec('PRAGMA cache_size = -65536;');
@$db->exec('PRAGMA mmap_size = 268435456;');
$stmt = $db->prepare('SELECT data FROM host_data WHERE host = :host LIMIT 1');
$stmt->bindValue(':host', $reversed, SQLITE3_TEXT);
$result = $stmt->execute();
$row = $result->fetchArray(SQLITE3_ASSOC);
$stmt->close();
$db->close();
if ($row === false) return [];
$entries = [];
foreach (explode("\n", $row['data']) as $line) {
$line = trim($line);
if ($line === '') continue;
$fields = explode("\t", $line);
if (count($fields) < 7) continue;
$entry = [
'year_month' => $fields[0],
'hc_pos' => (int)$fields[1],
'hc_val' => (int)$fields[2],
'pr_pos' => (int)$fields[3],
'pr_val' => (int)$fields[4],
'hc_val_norm' => (int)$fields[5],
'pr_val_norm' => (int)$fields[6],
];
if (count($fields) >= 8) {
$entry['n_hosts'] = (int)$fields[7];
}
$entries[] = $entry;
}
return $entries;
}
$results = lookup('google.com', '/path/to/domain-dbs', 'domain');
print_r($results);