WebGraph API

Need the Full Database?

Skip the API limits. Download the complete CC WebGraph database with every host and domain ranking across all Common Crawl snapshots. Run unlimited queries locally on your own hardware.

Buy the WebGraph Database

Overview

The CC WebGraph API provides free access to historical rankings computed from the Common Crawl WebGraph dataset. For each hostname or domain you submit, the API returns time-series data spanning multiple Common Crawl snapshots, including PageRank, Harmonic Centrality, ranking positions, and host counts.

The API is a simple HTTP POST endpoint. You send plain-text hostnames (one per line) and get JSON back. No API key is required. No authentication. No SDK. Just curl or any HTTP client.

Be Nice

This API is public and open. No API key, no registration, no hoops. I really want to keep it that way.

I understand that a free, unauthenticated endpoint is an open invitation to hammer it. I get it. But the moment abuse becomes a problem, I have two options: require registration or shut it down entirely. I don’t want to deal with either of those, so I’m asking you to help me avoid it.

If you plan to query from multiple IP addresses or run any kind of automated pipeline against this, just be smart about it:

  • Spread your requests throughout the day. Don’t blast everything at once.
  • Run heavy workloads after hours (US Eastern) when traffic is lighter.
  • Leave room for other people. You’re sharing this with everyone.
  • If you need bulk access, buy the database and run unlimited queries on your own hardware.

These databases are well over a terabyte in size. There’s a real amount of work that goes into building them, keeping them fast, and making them available for free. I’ve put this on pretty solid hardware and put significant effort into making this as efficient as possible so as many people as possible can use it. Please don’t ruin that.

I will not hesitate to take this API down if it gets abused. I’d genuinely rather keep it open and free, but not at the cost of the service being degraded for everyone else and an increased maintenance headache for me. Be considerate, keep it available for the community, and we’re good.

Data Coverage

The API serves a nearly complete history of Common Crawl WebGraph rankings, but there are some important differences between domain-level and host-level data coverage.

Domain Rankings

Domain-level data is complete. It covers every Common Crawl WebGraph release from the May–June–July 2017 run all the way through the most recent snapshot.

Host Rankings

Host-level data begins with the February–March–May 2020 run and includes every release from that point forward. The earlier host-level runs (2017–2019) are intentionally excluded.

The reason: the early host-level web graph releases contained a very large number of dangling nodes, which inflated the data enormously. Including those first ~12 runs would roughly double the size of an already massive database (the complete dataset is over a terabyte uncompressed). The added complexity and storage cost isn’t worth it given the limited usefulness of the early host-level data.

Summary: Domain history is complete from May 2017 to present. Host history starts from the February–March–May 2020 run forward. Everything from both starting points onward is included.

This same data coverage applies to both the free API and the purchasable database.

Date Format

Dates in the API response use the format YY-MM where YY is a two-digit year relative to 2000 (e.g. 21 = 2021) and MM is the month. Each Common Crawl web graph release spans roughly three months of crawl data. The YY-MM value represents the latest month of that three-month window. For example, 21-05 means the web graph release whose data ended in May 2021, covering approximately March–April–May 2021.

Endpoint

URL https://ccwg-api.customdatasets.com/
Method POST
Content-Type text/plain; charset=utf-8
CORS Enabled — Access-Control-Allow-Origin: *

Request Format

The POST body is plain text containing one hostname or domain per line. No JSON wrapping, no query parameters.

Rules

  • Minimum 1 hostname, maximum 10 hostnames per request.
  • Each line is one hostname or domain (e.g. google.com, blog.example.com).
  • Blank lines are ignored.
  • Hostnames are case-insensitive — Google.COM and google.com are equivalent.
  • You may include full URLs — the API strips http://, https://, and path components. However, sending clean hostnames is recommended.
  • Trailing dots are stripped (e.g. example.com. becomes example.com).

Example POST Body

google.com
facebook.com
github.com

curl Examples

Single Hostname

curl -X POST https://ccwg-api.customdatasets.com/ \
  -H "Content-Type: text/plain; charset=utf-8" \
  -d "google.com"

Multiple Hostnames

curl -X POST https://ccwg-api.customdatasets.com/ \
  -H "Content-Type: text/plain; charset=utf-8" \
  -d "google.com
facebook.com
github.com"

Using a File

Put your hostnames in a text file (one per line) and POST it directly:

# hosts.txt contains one hostname per line
curl -X POST https://ccwg-api.customdatasets.com/ \
  -H "Content-Type: text/plain; charset=utf-8" \
  --data-binary @hosts.txt

Pretty-Print the JSON

curl -s -X POST https://ccwg-api.customdatasets.com/ \
  -H "Content-Type: text/plain; charset=utf-8" \
  -d "google.com" | python3 -m json.tool

Save Response to a File

curl -s -X POST https://ccwg-api.customdatasets.com/ \
  -H "Content-Type: text/plain; charset=utf-8" \
  -d "google.com
youtube.com" -o results.json

Response Format

Successful responses return application/json with HTTP status 200. The top-level JSON object has two keys:

Key Type Description
results object Lookup results keyed by each submitted hostname.
remaining integer Number of hostname lookups remaining in your daily rate limit.

Structure of results

Each key in results is the hostname you submitted. Its value is an object with two arrays:

Key Type Description
host array Historical ranking entries at the host level (e.g. www.google.com).
domain array Historical ranking entries at the domain level (e.g. google.com aggregating all subdomains).

If a hostname has no data in a given dataset (host or domain), the corresponding array will be empty ([]).

Entry Fields

Each entry in the host or domain arrays is an object with these fields:

Field Type Description
year_month string Crawl snapshot identifier in YY-MM format (e.g. "25-01" = January 2025).
pr_val_norm integer PageRank normalized to a 0–100 scale.
hc_val_norm integer Harmonic Centrality normalized to a 0–100 scale.
pr_val integer Raw PageRank value from the Common Crawl WebGraph computation.
hc_val integer Raw Harmonic Centrality value.
pr_pos integer Rank position by PageRank (1 = highest authority).
hc_pos integer Rank position by Harmonic Centrality (1 = closest to all nodes).
n_hosts integer Count of distinct hostnames under this domain. Domain-level entries only. Not present in host array entries.
Note: The n_hosts field is only present in domain array entries. Host-level entries have 7 fields; domain-level entries have 8 (the extra field being n_hosts).

Example Response

{
  "results": {
    "google.com": {
      "host": [
        {
          "year_month": "25-01",
          "hc_pos": 2,
          "hc_val": 987654321,
          "pr_pos": 1,
          "pr_val": 123456789,
          "hc_val_norm": 99,
          "pr_val_norm": 100
        },
        {
          "year_month": "24-10",
          "hc_pos": 2,
          "hc_val": 976543210,
          "pr_pos": 1,
          "pr_val": 119876543,
          "hc_val_norm": 99,
          "pr_val_norm": 100
        }
      ],
      "domain": [
        {
          "year_month": "25-01",
          "hc_pos": 1,
          "hc_val": 1234567890,
          "pr_pos": 1,
          "pr_val": 234567890,
          "hc_val_norm": 100,
          "pr_val_norm": 100,
          "n_hosts": 4521
        }
      ]
    }
  },
  "remaining": 97
}

Metrics Reference

Every entry contains the following metrics. They appear in both host-level and domain-level data unless noted.

PageRank (Normalized 0–100) — pr_val_norm

PageRank normalized to a 0–100 scale. Higher values indicate stronger authority in the web graph. Useful for comparing sites on a simple, uniform scale.

Harmonic Centrality (Normalized 0–100) — hc_val_norm

Harmonic Centrality normalized to 0–100. Measures how close a node is to all other nodes in the web graph. Higher is better.

PageRank (Raw) — pr_val

Raw PageRank value from Common Crawl WebGraph computation. These are large integers whose magnitude varies across crawl snapshots, so cross-snapshot comparisons are most meaningful using the normalized version.

Harmonic Centrality (Raw) — hc_val

Raw Harmonic Centrality value from the web graph. Same caveat as raw PageRank regarding cross-snapshot comparisons.

PageRank Position — pr_pos

Rank position by PageRank. Lower numbers are better — rank 1 is the highest-authority site in that snapshot.

Harmonic Centrality Position — hc_pos

Rank position by Harmonic Centrality. Lower numbers are better.

Number of Hosts — n_hosts

Count of distinct hostnames under this domain. Only available in domain-level entries. Not present in host-level data. For example, google.com might report thousands of hosts (mail.google.com, maps.google.com, etc.).

Rate Limiting

Rate limiting is based on the total number of hostnames submitted, not the number of HTTP requests. Each hostname in your POST body counts as one lookup.

Daily Limit 100 hostnames per IP address per day
Per Request Maximum 10 hostnames per request
Reset Time Daily at 4:00 AM Eastern Time

Every successful response includes a remaining field showing how many hostname lookups you have left for the day. Plan your requests accordingly.

Example: If you POST 3 hostnames in one request, that uses 3 of your 100 daily lookups. Your next response will show "remaining": 97.

Error Responses

All error responses return JSON with an error field describing the problem.

400 Bad Request

Returned when the POST body is empty, contains no valid hostnames, or exceeds the 10-hostname limit.

{
  "error": "Request must contain 1-10 hostnames, one per line",
  "max_per_request": 10
}

405 Method Not Allowed

Returned for any HTTP method other than POST (e.g. GET, PUT, DELETE).

{
  "error": "Method not allowed. Use POST."
}

404 Not Found

Returned if you request any path other than /.

{
  "error": "Not found. Use POST /"
}

429 Too Many Requests

Returned when your IP has exceeded the daily hostname limit.

{
  "error": "Rate limit exceeded. Max 100 hostnames per IP per day (resets daily at 4 AM ET).",
  "retry_after": 28800,
  "daily_limit": 100
}

The retry_after field is the number of seconds until the next rate limit reset (4 AM ET).

Web Viewer & Exports

In addition to this API, we offer a web-based history viewer with interactive charts for all metrics. The web viewer also supports exporting data as:

  • PDF — multi-page report with charts for each metric
  • CSV — flat file with all metrics and timestamps
  • XLSX — Excel workbook with one sheet per metric
  • ODS — OpenDocument spreadsheet
Note: Export functionality (PDF, CSV, XLSX, ODS) is only available through the web interface. The API returns JSON data only. To generate reports or spreadsheets, use the web viewer or build your own export pipeline from the JSON responses.

Go Beyond the API — Get the Full WebGraph Database

The free API is great for spot checks and small integrations. If you need bulk access to every host and domain ranking across all crawl snapshots, the downloadable database gives you unlimited local queries with zero rate limits.

Buy the WebGraph Database