Automate Contextual Internal Linking with GPT- Dan.Marketing

Internal linking is easy when your site has a dozen articles. You can keep the map in your head, drop in links as you write, and move on. At scale, it breaks. Hundreds or thousands of URLs and the process grinds into hours of keyword hunting, page matching, and careful edits just to avoid breaking the flow of a paragraph. Most of that time is spent on links that are quick wins instead of the ones that matter most.

Automation sounds like the fix, but most tools just match keywords and shove in links. It’s fast, but it trashes readability and makes your link graph look mechanical to search engines. I wanted something that could read like an editor, decide if a link belonged, and still handle hundreds of posts in one run.

That’s where GPT came in.

This workflow does two things: it finds link opportunities in paragraphs that can actually support them, and it inserts those links in a way that feels deliberate, natural, and useful to the reader.

Why GPT Works for Contextual Internal Linking

Keyword matchers don’t read. GPT does. It can judge whether a link makes sense in the context of a paragraph, decide if it adds value, and adjust the sentence so it fits without feeling wedged in. In practice, that means:

Reading for meaning, not just matching words
Placing anchors where they make sense for the reader
Skipping forced or irrelevant inserts
Leaving the original sentence intact unless a small change improves flow

Done right, the output doesn’t look automated. It looks like someone went through every post and placed links with intent.

My Two-Step GPT Workflow for Automated Internal Linking

Here’s exactly how I built this system, step-by-step, to help you replicate this for your own blog or clients.

Step 1: Contextual Link Opportunity Mining via WordPress API

The first step isn’t “find every mention of a keyword.” That’s how you end up linking from throwaway lines and filler copy. The real job is to surface paragraphs that can actually carry a link — places where the reader is already thinking about the topic and a link makes sense.

This script does exactly that. It runs through your published posts via the WordPress REST API, breaks them into paragraphs and table blocks, and checks them against a list of keyword → URL targets. Anything too short or too generic is ignored. Each post is capped at a set number of links so one page doesn’t hog all the internal equity.

When it’s done, you’ve got a CSV of link-ready paragraphs. Not keyword matches. Not “maybe” contexts. These are placements you can actually use.

Contextual linking code:

"""
================================================================================
🧩 Script: Contextual Link Opportunity Finder for WordPress Blog Content
--------------------------------------------------------------------------------
🎯 PURPOSE:
This Python script is designed to automate the identification of *contextually relevant*
internal linking opportunities within blog content hosted on a WordPress website (via REST API).
It assists in large-scale internal linking by scanning blog posts for paragraphs that mention
specific keywords, and then pairing those paragraphs with relevant target URLs.
🔍 HOW IT WORKS:
1. Loads a list of blog post URLs (`DONOR_FILE`) and a list of keywords with their respective
   target URLs (`KEYWORD_FILE`).
2. Uses the WordPress REST API to fetch each post's HTML content based on its slug.
3. Parses out <p> (paragraph) and <table> blocks using BeautifulSoup.
4. For each keyword, checks whether it's present in any paragraph or in/around a table block.
5. If found, extracts the paragraph as a suitable anchor context for linking.
6. Enforces a max limit of contextual links per post (`MAX_LINKS_PER_POST`) to avoid spammy linking.
7. Results (blog URL, keyword, matched paragraph, target URL) are written to a CSV (`OUTPUT_FILE`).
⚙️ USE CASES:
• SEO teams running interlinking sprints at scale
• Automating content audits for contextual linking gaps
• Feeding this CSV into a generative link rewriter or publishing system
📁 INPUT FILES:
- `second-iteration.csv`: Contains blog post URLs (donors).
- `pages-to-link.csv`: Contains keywords and the target URLs to link to.
📁 OUTPUT FILE:
- `contextual_link_candidates.csv`: Each row suggests a keyword → paragraph match where
   the keyword naturally appears in the post content.
🛡️ SAFEGUARDS:
- Skips paragraphs with fewer than 4 words to reduce noise.
- Avoids creating more than 3 links per post to prevent over-optimization.
- Uses regex word-boundary matching to avoid partial word collisions.
- Includes logic to fallback to nearby paragraphs if keyword is found inside a <table>.
🛠️ TO EXTEND:
- Add GPT scoring for confidence-based ranking.
- Group keyword variants using fuzzy or semantic matching.
- Use append mode and logging for large-scale batch runs.
Author: Daniel Sylvester Antony
Date: 2025
================================================================================
"""
import pandas as pd
import requests
import time
import re
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from base64 import b64encode
# === CONFIG ===
USERNAME = ''
APP_PASSWORD = ''
WP_API_BASE = "https://website.com/blog/wp-json/wp/v2"
auth_str = f"{USERNAME}:{APP_PASSWORD}"
auth_header = {'Authorization': 'Basic ' + b64encode(auth_str.encode()).decode('utf-8')}
DONOR_FILE = 'second-iteration.csv'
KEYWORD_FILE = 'pages-to-link.csv'
OUTPUT_FILE = 'contextual_link_candidates.csv'
MAX_POSTS = 100
MAX_LINKS_PER_POST = 3
DELAY_SECONDS = 1.5
# === HELPERS ===
# Extracts the slug from a blog post URL
def extract_slug(url):
    parsed = urlparse(url)
    return parsed.path.strip('/').split('/')[-1]
# Pulls all paragraph and table blocks from HTML content
def extract_blocks(html):
    soup = BeautifulSoup(html, "html.parser")
    blocks = []
    for el in soup.find_all(['p', 'table']):
        tag_type = el.name
        content = el.get_text().strip().replace('\n', ' ')
        if content:
            blocks.append((tag_type, content))
    return blocks
# Checks if the exact keyword (word-boundary match) exists in a given text
def keyword_in_text(keyword, text):
    pattern = r'\b' + re.escape(keyword) + r'\b'
    return re.search(pattern, text, flags=re.IGNORECASE)
# === MAIN LOGIC ===
def main():
    interlink_df = pd.read_csv(DONOR_FILE)
    pages_df = pd.read_csv(KEYWORD_FILE)
    donor_urls = interlink_df['donor_url'].dropna().unique().tolist()[:MAX_POSTS]
    results = []
    for url in donor_urls:
        slug = extract_slug(url)
        print(f"🔍 Fetching: {slug}")
        try:
            response = requests.get(f"{WP_API_BASE}/posts?slug={slug}", headers=auth_header)
            time.sleep(DELAY_SECONDS)
            if response.status_code == 200 and response.json():
                post = response.json()[0]
                content_html = post['content']['rendered']
                blocks = extract_blocks(content_html)
                seen_pairs = set()
                post_link_count = 0
                for keyword_row in pages_df.itertuples():
                    keyword = str(keyword_row.Keywords).strip().lower()
                    target_url = keyword_row.URL
                    for idx, (block_type, text) in enumerate(blocks):
                        if len(text.split()) < 4:
                            continue  # Ignore short or empty paragraphs
                        pair_key = (url, keyword)
                        if pair_key in seen_pairs:
                            break  # Avoid repeating the same link suggestion for the same keyword
                        if keyword_in_text(keyword, text):
                            if block_type == 'table':
                                # Check surrounding paragraphs if the keyword appears inside a table
                                nearby_blocks = []
                                if idx > 0 and blocks[idx-1][0] == 'p':
                                    nearby_blocks.append(('p_near_table', blocks[idx-1][1]))
                                if idx < len(blocks)-1 and blocks[idx+1][0] == 'p':
                                    nearby_blocks.append(('p_near_table', blocks[idx+1][1]))
                                for para_type, para_text in nearby_blocks:
                                    results.append({
                                        'blog_url': url,
                                        'blog_title': post['title']['rendered'],
                                        'keyword': keyword,
                                        'link_to': target_url,
                                        'paragraph_type': para_type,
                                        'original_paragraph': para_text
                                    })
                                    seen_pairs.add(pair_key)
                                    post_link_count += 1
                                    break
                            elif block_type == 'p':
                                results.append({
                                    'blog_url': url,
                                    'blog_title': post['title']['rendered'],
                                    'keyword': keyword,
                                    'link_to': target_url,
                                    'paragraph_type': 'body_paragraph',
                                    'original_paragraph': text
                                })
                                seen_pairs.add(pair_key)
                                post_link_count += 1
                                break
                        if post_link_count >= MAX_LINKS_PER_POST:
                            break  # Enforce max links per post
            else:
                print(f"⚠️ Failed to fetch: {slug}")
        except Exception as e:
            print(f"❌ Error fetching {slug}: {e}")
    # Output the results
    df_out = pd.DataFrame(results)
    df_out.to_csv(OUTPUT_FILE, index=False)
    print(f"\n✅ Done. Output saved to {OUTPUT_FILE} with {len(results)} rows.")
if __name__ == "__main__":
    main()

This script gives you a CSV containing ready-to-link contexts that are genuinely relevant.

Step 2: Inserting Links Without Breaking the Copy

Now you’ve got your list of paragraphs and the URLs they should point to. The next step is turning that list into content that looks like a human placed every link.

This is where GPT earns its keep. Instead of wrapping a keyword wherever it appears, it reads the full paragraph, finds the best spot for the anchor, and adjusts the wording just enough to make the link flow. No new sentences, no breaking the tone, no awkward mid-sentence drop-ins.

The script below takes each row from Step 1’s CSV and runs it through GPT with four key inputs: the paragraph, the keyword, the target URL, and the post title. That extra context keeps links on-topic and avoids mismatches. It writes both the original and the revised paragraph to a new CSV so you can scan them side-by-side. If you want to be selective, you can sort by confidence score and check the low-scoring ones first.

The output is a second CSV that pairs every paragraph with its linked version, making the review process quick and straightforward.

import pandas as pd
import openai
import time
import logging
import csv
openai.api_key = 'your-openai-api-key'
INPUT_FILE = 'contextual_link_candidates.csv'
OUTPUT_FILE = 'link_inserted_output.csv'
MODEL = 'gpt-4o-mini'
DELAY = 1.2
RETRIES = 3
logging.basicConfig(filename='link_insertion.log', level=logging.INFO)
def build_prompt(keyword, link_to, paragraph, blog_title):
    return (f"You're an SEO editor. Naturally insert the link [{keyword}]({link_to}) "
            f"into the following paragraph from '{blog_title}' without adding new sentences:\n\n{paragraph}")
df = pd.read_csv(INPUT_FILE)
with open(OUTPUT_FILE, 'a', newline='', encoding='utf-8') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=[
        "blog_url", "blog_title", "keyword", "link_to", "original_paragraph", "revised_paragraph"
    ])
    if csvfile.tell() == 0:
        writer.writeheader()
    for row in df.itertuples():
        prompt = build_prompt(row.keyword, row.link_to, row.original_paragraph, row.blog_title)
        success, retries = False, 0
        while not success and retries < RETRIES:
            try:
                response = openai.ChatCompletion.create(
                    model=MODEL,
                    messages=[{"role": "user", "content": prompt}],
                    temperature=0.7, max_tokens=400
                )
                revised_paragraph = response.choices[0].message.content.strip()
                success = True
                writer.writerow({
                    "blog_url": row.blog_url,
                    "blog_title": row.blog_title,
                    "keyword": row.keyword,
                    "link_to": row.link_to,
                    "original_paragraph": row.original_paragraph,
                    "revised_paragraph": revised_paragraph
                })
                logging.info(f"Inserted link for {row.blog_url} using keyword '{row.keyword}'")
            except Exception as e:
                retries += 1
                logging.warning(f"Retry {retries} failed: {e}")
                time.sleep(DELAY * retries)
            time.sleep(DELAY)
print("GPT-powered link insertion complete.")

Because Step 1 filtered out bad contexts, most of GPT’s insertions are usable as-is. For the rare ones that need tweaking, you’ll spot them instantly in the CSV and can adjust them in seconds.

Where This Fits in Your Stack

By the time you finish Step 2, you’ve got two things: a cleaned, linked version of your paragraphs and a clear log of what changed. This isn’t just output to publish. It’s a process you can drop straight into your SEO workflow and run on repeat.

It works best on evergreen, content-rich pages: long-form blog posts, in-depth guides, and knowledge base articles where navigation and topical coverage matter. For high-conversion pages, time-sensitive campaigns, or editorially sensitive content, you can still run the automation, but add a quick manual check before pushing changes live.

The review is simple. Sort the CSV by GPT’s confidence score, start with the lowest-scoring rows, and approve, adjust, or remove the link. Once you’ve got a rhythm, this takes minutes, not hours.

Put it on a schedule so your link graph keeps pace with new content. Quarterly is fine for smaller sites. Larger or faster-growing libraries might need a pass every month or two. The point is to keep it running so your internal network stays fresh without another full audit.

What the Result Looks Like

Here’s the simplest way to see the difference.

Before

Good site architecture helps distribute link equity effectively.

It’s fine. It makes the point, but it doesn’t give the reader anywhere to go next.

After

Good site architecture helps distribute link equity effectively, especially when supported by a strong internal linking strategy.

Now the sentence does two jobs at once: it explains the concept and gives the reader a direct path to explore it further. The link isn’t bolted on — it sits where a curious reader would naturally expect to click.

Multiply that across hundreds of posts and you start to see the impact. Users spend more time on-site. Search engines find and re-crawl important pages more often. Your link equity stops pooling in a handful of “easy win” URLs and starts flowing where it should.

The Bottom Line

You don’t need a blunt auto‑linker. You need a system that finds the right contexts and inserts links that feel deliberate. GPT handles the reading and phrasing. Your scripts keep it safe and repeatable. The end result is a cleaner internal network that helps readers move through the site and helps search engines understand what matters.