Custom Datasets for Busy Teams

When You Just Want Data,
Not Another Project.

26+ years across marketing and engineering. I build custom, production-grade datasets for teams that are too busy to build them in-house. Common Crawl extraction, SERP analysis, competitive intelligence — whatever you need, delivered ready to use.

26+
Years of combined experience
850B+
Records processed
1,400+
Client accounts managed
Fortune 500
Enterprise experience

Teams That Need Data, Not Another Project

You know the data exists. You know it can be extracted. You just don't have the time or the team to do it yourself.

In-House Marketing Teams

SEO, analytics, competitive intelligence

Your engineers are busy. Your analysts are buried. You need someone who understands both marketing and engineering — and can take a data project from concept to deliverable without hand-holding.

  • SEO teams needing competitive or authority data
  • Brand teams monitoring market presence
  • Analytics teams that need clean, structured datasets
  • Data teams without bandwidth for one-off projects

Agencies

SEO, marketing, digital, PR

Your client needs data you can't build in-house. I build it, you deliver it. Clean handoff, no drama. You set your own margin.

  • Client requests that don't justify a full-time hire
  • Complex data work outside your team's skill set
  • White-label delivery — your client, your relationship
  • Repeatable partnerships on ongoing client work

What I Build for Teams

Every project is different. Here are the kinds of work that come through the door most often.

Common Crawl Extraction

Parse and analyze pages across Common Crawl releases at scale. Extract structured data from HTML, URLs, tags, and page elements across billions of records.

SERP Analysis & Opportunity Mining

Hundreds of thousands of search queries, downloading and analyzing every ranked URL. Contact info, partnership opportunities, competitive gaps — extracted and structured.

Competitive Intelligence

Track competitors across web properties, search results, and market signals. Build databases your team can search and act on immediately.

Brand & Reputation Monitoring

Monitor search results, news, RSS feeds, and web mentions for brand terms, product names, executives, and competitors. Near-real-time, highly thorough.

Lead & Partnership Databases

Build searchable databases of sponsorship opportunities, link prospects, local partnerships, and outreach targets — nationwide or by specific location.

Large-Scale Web Scraping

Phone numbers, emails, social accounts, named entities — extracted, validated, and delivered in your preferred format. CSV, JSON, SQLite, Excel.

Work With Ben Wills

13+ years in marketing. 13+ years in engineering. 26+ years of combined experience building datasets, managing teams, and solving hard data problems for companies of every size.

Who I Am

Most marketers can't engineer complex data systems. Most engineers don't understand marketing well enough to build the right thing. I've spent over a decade in each discipline, and that combination is rare.

On the marketing side, I've directed teams of 70–80 people responsible for over 1,400 SEO client accounts. I've led international SEO campaigns spanning 30–40 countries for companies like a major home improvement retailer. I was the weekly point of contact for Fortune 500 accounts. I grew a company from zero to $140K+/month in revenue in under a year as VP of Operations. I've run SEO and PPC campaigns for small businesses, managed agency teams, and built marketing strategy at every scale.

On the engineering side, I've built large-scale web scraping and indexing systems that process billions of records. I wrote a marketing SaaS platform from scratch in pure C that could download and parse up to 400 million URLs per day. I've built custom databases, data pipelines, sharded storage systems, firmware for embedded devices, cross-platform networking libraries, and the complete Common Crawl web graph history — over 850 billion records compiled into queryable SQLite databases.

When you describe a data problem to me, I don't just understand the technical requirements — I understand the marketing objective behind it. I know what you're trying to accomplish, why it matters, and how the data needs to be structured so your team can actually use it. That's what 26 years across both disciplines gives you.

How It Works

Every project starts with a conversation about what you're trying to accomplish — not a requirements document or a feature list. I want to understand the business objective. Once I understand the goal, I scope the work, define exactly what you'll receive, and quote a fixed price. No hourly billing surprises. No scope creep. You know what you're getting and what it costs before anything starts.

From there, I build. You get regular updates with working deliverables — not status reports, not slide decks. Actual data you can look at. If something needs to change mid-project, we talk about it and adjust. I'm direct about what I can and can't do. If I think something won't work, I'll tell you before I waste your time or money on it.

Deliverables are production-grade. You get the dataset in whatever format your team needs — CSV, JSON, Excel, SQLite — along with schema definitions, field-level documentation, notes on assumptions and edge cases, and QA methodology. Everything is clean, documented, and ready to plug into your workflows.

Revisions Are Built In

Here's something I've learned from years of doing this work: once people see their data for the first time, they almost always want it differently than they originally described. That's not a failure of scoping — it's how data projects actually work. You don't fully know what you need until you see what's possible.

Iterations and revisions are baked into every project I take on. The price I quote assumes we're going to go back and forth. I expect it. I'll refine the deliverable until it's right. If you ask for something that's outside the scope, I'll tell you — but within the scope of the project, I'm not going to nickel-and-dime you on changes. The goal is a deliverable your team actually uses, not one that technically meets a spec but sits in a folder.

Schedule a Call

Engineering Experience

  • Large-scale web scraping & indexing systems
  • Data pipeline architecture (billions of records)
  • Common Crawl parsing & extraction at scale
  • Custom database design (SQLite, key-value, sharded)
  • Built a marketing SaaS in pure C — 400M URLs/day
  • Firmware development (ESP32, PIC32, custom protocols)
  • API design & development

Marketing Experience

  • Directed 70–80 person team across 1,400+ SEO accounts
  • Led international SEO campaigns (30–40 countries)
  • Weekly point person for Fortune 500 accounts
  • VP of Operations — $0 to $140K+/month in 9 months
  • SEO & PPC for SMBs, agencies, and enterprise

Example Projects

  • Built partnership database across 750+ locations for a worldwide hotel chain — 750 custom spreadsheets delivered
  • Nationwide sponsored link opportunity database from 100K+ Google search queries for a national SEO firm
  • Compiled the complete Common Crawl web graph history — 850B+ records across 15+ years of crawl data
  • SERP monitoring & competitive analysis for a major SaaS company — hundreds of thousands of search queries analyzed
  • Built brand monitoring system tracking mentions across search results, news, RSS feeds, and web sources in near-real-time

Common Questions

What format do you deliver data in?

Whatever works for your team. CSV, JSON, Excel, SQLite — I deliver in your preferred format with field-level documentation and schema definitions. The web graph databases are SQLite.

How does pricing work for custom projects?

Almost always fixed price. I scope the project, define clear deliverables, and quote a number before work begins. Revisions and iterations are included — we go back and forth until it's right.

How long does a typical project take?

It depends entirely on the project. There's no single answer. Some are a few weeks, some are longer. I'll give you a realistic timeline during the scoping conversation.

What if I need changes after delivery?

That's expected. Once people see their data, they almost always want adjustments. Iterations are baked into every project. If you need something I can't provide, I'll tell you upfront.

What's the web graph database?

The complete Common Crawl web graph history, compiled into SQLite databases. Search any domain or hostname and get full historical metrics including custom normalized PageRank and Harmonic Centrality scores (0–100).

Is the API really free?

Yes. No API key, no registration, no credit card. 100 hostnames per day, up to 10 per request. Just hit the endpoint and get data back. Resets every 24 hours.

Companies I Have Directly Worked With

Company Logo
Company Logo
Company Logo
Company Logo
Company Logo

Schedule a Call

Tell me what you're working on. I'll let you know if I can help and what it would cost. No pitch, no pressure — just a direct conversation about your project.

Contact form coming soon. For now, send me an email with a brief description of your project.

The Complete Common Crawl Web Graph — Ready to Query

The raw Common Crawl web graph data is a collection of tab-separated value files you have to download and compile yourself. No database. No search. Just raw data and a lot of work.

I've done that work for you. Every release, every domain, every hostname — compiled into SQLite databases with instant key-value lookup. Search any domain or hostname and get its complete history across every metric, every crawl.

What's Included

  • Full historical PageRank and Harmonic Centrality for every entity
  • Normalized PageRank score (0–100) — custom, far more useful for analysis
  • Normalized Harmonic Centrality score (0–100) — clean range for visualization
  • Rank position, n_hosts (domain-level), and all original metrics
  • SQLite format — portable, fast, no server required
  • Separate databases for domain queries and hostname queries
$999
Domains (~300GB) — one-time
$1,999
Hostnames (~850GB) — one-time
$2,499
Both (~1.15TB) — one-time

Yearly subscriptions available — updated with each new Common Crawl release.

View Full Pricing

Recent Data Updates

We continuously process new Common Crawl releases and improve the dataset.

Feb 2026

CC-2026-09 WebGraph Added

Latest Common Crawl release processed and added to all database products. API updated.

Jan 2026

Improved Normalization

Refined 0-100 scoring algorithm for better distribution across the range.

Dec 2025

API Rate Limits Increased

All API plans now include 2x the previous rate limits at no additional cost.

Nov 2025

Hostname Database Optimization

Reduced file sizes by 15% through improved compression without data loss.