Mapping Internal Links via the WordPress REST API

Most people default to Screaming Frog for link audits. I did too, until the procurement team dragged it along and I got impatient. The crawl files were massive, the exports slower than needed, and most of the time, all I really wanted was a clean view of how internal links flow through a WordPress site.

So I built one.

This setup hits the WordPress REST API directly, pulls all public posts, extracts internal-to-internal links, and outputs a ready-to-analyze CSV with inlink counts and anchor text summaries. It’s fast, minimal, and works anywhere Python runs.

Creating Interlinking tools

Most teams I’ve worked with hit the same wall when they start managing internal links. Give someone a spreadsheet of every URL and anchor, and you’ll end up with the same five pillar pages overlinked while the rest of the content goes unseen. What I needed was a way to make interlinking smarter — something that could eventually balance inlinks and outlinks automatically, surface new articles that need attention, and show which supporting pages are being left out.

That long-term system starts with one thing: a map. A snapshot of how internal links actually move through the site right now. Traditionally, you’d crawl everything, export, pivot, and rebuild the table. But WordPress already gives you structured data through /wp-json/wp/v2/..., so instead of spinning up a crawler, I decided to pull straight from there. I’ll build the rest once I learn how to create UIs.

Dependencies: requests, beautifulsoup4, pandas, tldextract

Install:

pip install requests beautifulsoup4 pandas tldextract

Environment variables:

export WP_BASE_URL="domain"
export WP_INTERNAL_DOMAINS="subdomains"
export WP_POST_TYPES="posts,pages"
export WP_PER_PAGE="100"
export WP_MAX_PAGES_PER_TYPE="400"
export WP_STRIP_QUERY_PARAMS="1"
export WP_DROP_FRAGMENT="1"
export WP_EXCLUDE_NAV_FOOTER="0"

The Full Python Code

#!/usr/bin/env python3
"""
WordPress REST API inlinks builder (public content only)
- Fetches published posts/pages via /wp-json/wp/v2
- Extracts internal links only
- Aggregates inlinks per target
- Output file name includes domain root and current date
"""

import os
import re
import time
import datetime
from typing import List, Dict, Optional, Tuple, Set
from urllib.parse import urljoin, urlparse, urlunparse

import requests
from bs4 import BeautifulSoup
import pandas as pd
import tldextract

# ==============================
# Configuration
# ==============================

BASE_URL = os.getenv("WP_BASE_URL", "https://dan.marketing")
POST_TYPES = [p.strip() for p in os.getenv("WP_POST_TYPES", "posts,pages").split(",") if p.strip()]
POST_STATUS = "publish"
PER_PAGE = int(os.getenv("WP_PER_PAGE", "100"))
MAX_PAGES_PER_TYPE = int(os.getenv("WP_MAX_PAGES_PER_TYPE", "200"))
SLEEP_BETWEEN_REQUESTS = float(os.getenv("WP_RATE_SLEEP", "0.2"))

STRIP_QUERY_PARAMS = os.getenv("WP_STRIP_QUERY_PARAMS", "1") == "1"
DROP_FRAGMENT = os.getenv("WP_DROP_FRAGMENT", "1") == "1"
EXCLUDE_NAV_FOOTER = os.getenv("WP_EXCLUDE_NAV_FOOTER", "0") == "1"
INTERNAL_EXTRA = [d.strip().lower() for d in os.getenv("WP_INTERNAL_DOMAINS", "").split(",") if d.strip()]

# ==============================
# Helpers
# ==============================

def domain_root(url_or_host: str) -> str:
    ext = tldextract.extract(url_or_host)
    return ext.top_domain_under_public_suffix or ""

def resolve_internal_sets(base_url: str, extras: List[str]) -> Tuple[Set[str], Set[str]]:
    base_host = urlparse(base_url).netloc.lower()
    base_root = domain_root(base_host)
    roots, hosts = {base_root}, {base_host}
    for item in extras:
        parsed = urlparse(item if "://" in item else f"https://{item}")
        host = parsed.netloc.lower()
        root = domain_root(host)
        if root: roots.add(root)
        if host: hosts.add(host)
    return roots, hosts

INTERNAL_ROOTS, INTERNAL_HOSTS = resolve_internal_sets(BASE_URL, INTERNAL_EXTRA)

def is_internal(url: str) -> bool:
    if not url: return False
    parsed = urlparse(url)
    host = parsed.netloc.lower()
    if not host: return False
    if host in INTERNAL_HOSTS: return True
    return domain_root(host) in INTERNAL_ROOTS

def normalize_url(href: str, base: str) -> Optional[str]:
    if not href or href.startswith(("mailto:", "tel:", "javascript:", "#")):
        return None
    abs_url = urljoin(base if base.endswith("/") else base + "/", href)
    parsed = urlparse(abs_url)
    fragment = "" if DROP_FRAGMENT else parsed.fragment
    query = "" if STRIP_QUERY_PARAMS else parsed.query
    scheme = parsed.scheme.lower()
    netloc = parsed.netloc.lower()
    path = re.sub(r"/{2,}", "/", parsed.path)
    return urlunparse((scheme, netloc, path, parsed.params, query, fragment))

def fetch_wp_items(base_url: str, post_type: str, per_page: int, max_pages: int) -> List[dict]:
    items, session = [], requests.Session()
    fields = "id,link,slug,title,content.rendered,date,modified"
    for page in range(1, max_pages + 1):
        url = f"{base_url}/wp-json/wp/v2/{post_type}"
        params = {"status": POST_STATUS, "per_page": per_page, "page": page, "_fields": fields}
        r = session.get(url, params=params, timeout=30)
        if r.status_code == 400 and "rest_post_invalid_page_number" in r.text: break
        r.raise_for_status()
        data = r.json()
        if not data: break
        items.extend(data)
        if len(data) < per_page: break
        time.sleep(SLEEP_BETWEEN_REQUESTS)
    return items

def extract_internal_edges(html_content: str, source_url: str) -> List[Tuple[str, str, str]]:
    soup = BeautifulSoup(html_content or "", "html.parser")
    if EXCLUDE_NAV_FOOTER:
        for sel in ["nav", "header", "footer", ".site-header", ".site-footer", ".menu", ".breadcrumb", ".breadcrumbs"]:
            for tag in soup.select(sel): tag.decompose()
    edges = []
    for a in soup.find_all("a", href=True):
        href = a.get("href", "").strip()
        anchor = a.get_text(separator=" ", strip=True) or ""
        target_norm = normalize_url(href, source_url)
        if target_norm and is_internal(source_url) and is_internal(target_norm) and source_url != target_norm:
            edges.append((source_url, target_norm, anchor))
    return edges

# ==============================
# Main
# ==============================

def main():
    base_root = domain_root(BASE_URL)
    date_str = datetime.date.today().isoformat()
    default_name = f"inlinks_{base_root}_{date_str}.csv"
    INLINKS_CSV = os.getenv("WP_OUT_INLINKS", default_name)

    print(f"[i] Base: {BASE_URL}")
    print(f"[i] Internal roots: {sorted(INTERNAL_ROOTS)}")
    print(f"[i] Internal hosts: {sorted(INTERNAL_HOSTS)}")
    print(f"[i] Output file: {INLINKS_CSV}")

    all_docs = []
    for pt in POST_TYPES:
        print(f"[i] Fetching {pt} ...")
        docs = fetch_wp_items(BASE_URL, pt, PER_PAGE, MAX_PAGES_PER_TYPE)
        print(f"[i] Retrieved {len(docs)} {pt}")
        all_docs.extend(docs)

    edge_rows = []
    for d in all_docs:
        source = d.get("link")
        content = (d.get("content") or {}).get("rendered", "")
        if not source or not content or not is_internal(source):
            continue
        for _, tgt, anchor in extract_internal_edges(content, source):
            edge_rows.append({"source_url": source, "target_url": tgt, "anchor_text": anchor})

    if not edge_rows:
        print("[!] No internal edges found. Writing empty CSV.")
        pd.DataFrame(columns=["target_url","inlink_count","unique_anchor_count","sources","anchors_top3"]).to_csv(INLINKS_CSV, index=False)
        print(f"[i] Wrote {INLINKS_CSV}")
        return

    edges_df = pd.DataFrame(edge_rows)
    def top3(s: pd.Series) -> str:
        counts = s.value_counts().head(3)
        return "; ".join([f"{a} ×{c}" for a,c in counts.items()])

    inlinks = (
        edges_df.groupby("target_url")
        .agg(
            inlink_count=("source_url", "nunique"),
            unique_anchor_count=("anchor_text", "nunique"),
            sources=("source_url", lambda s: "; ".join(sorted(set(s)))),
            anchors_top3=("anchor_text", top3),
        )
        .reset_index()
        .sort_values(["inlink_count","target_url"], ascending=[False,True])
    )
    inlinks.to_csv(INLINKS_CSV, index=False)
    print(f"[i] Wrote {INLINKS_CSV} ({len(inlinks)} rows)")

if __name__ == "__main__":
    main()

The script discovers all public post types, fetches published posts via /wp-json/wp/v2/..., parses HTML, finds internal to internal links across subdomains or allowed roots, and aggregates inlinks with top anchors.

It saves the file as inlinks_<root-domain>_<YYYY-MM-DD>.csv, for example inlinks_dan.marketing_2025-10-08.csv.

Sample Output:

[i] Base: https://dan.marketing
[i] Internal roots: ['dan.marketing', 'leapfinance.com', 'leapscholar.com']
[i] Types fetchable: ['posts', 'pages']
[i] Fetching 'posts' ...
[i] Retrieved 120 items for 'posts'
[i] Fetching 'pages' ...
[i] Retrieved 34 items for 'pages'
[i] Wrote inlinks_dan.marketing_2025-10-08.csv (367 rows)

Sample CSV:

target_url,inlink_count,unique_anchor_count,sources,anchors_top3
https://dan.marketing/blog/tipr-model,8,5,"https://dan.marketing/blog/jlon-command; https://dan.marketing/blog/content-ops","tipr ×3; link equity ×2; internal linking ×2"

Use Cases

For small teams or sandbox projects, this doesn’t replace a crawler it just gives you a quick enough output to start with and gives you a nice playground to expand the code further for interlinking tools. You can run it in a notebook, analyze link equity flow, and even chain it into a daily job.

The CSV structure mirrors how you’d normally pivot in Screaming Frog: one row per target, inlink counts, unique anchors, and source list. Once you have that, you can layer it into TIPR models, build internal link recommendation systems, or even measure content decay across time. I wrote a blog post on how these outputs can be plugged into GPT for automated interlinking suggestions.

My personal to-do list

Interlinking Suggestion Engine: Run OpenAI embeddings on anchor text and recommend contextual placements.
TIPR Scoring: Combine link counts with crawl frequency or organic traffic.
Daily Deltas: Track link changes via cron and post them to a Google Sheet.
Create a UI: The ultimate goal is to create a tool for my team to use