Internal linking—it sounds easy, right? Just drop a link here, add an anchor there, and move on to the next article. Simple enough for a handful of posts. But when you’re managing hundreds or even thousands of articles, that “simple” task quickly turns into an SEO nightmare.
Like most digital marketers, I’ve wrestled with manually inserting internal links more times than I care to admit. You know the drill: find the keywords, find suitable content, add the link. It’s essential for SEO, crucial for user experience, and yet completely unscalable.
But automation isn’t easy either. Most available tools simply shove links into keywords wherever they find them, cluttering the page and harming readability. I needed something smarter—an approach that understands context, respects readability, and scales effortlessly. That’s exactly why I turned to GPT.
In this guide, I’ll share exactly how I built a GPT-powered internal linking pipeline that scales beautifully, maintains editorial quality, and avoids spammy link insertion.
You’ve probably experimented with automation scripts or plugins that “auto-insert” internal links. And you probably weren’t thrilled with the results. These solutions tend to blindly match keywords without context, creating a mess of unnatural links.
What if, instead of keyword-stuffing links, you had an AI-powered editor that could naturally and contextually insert links just like a skilled human editor? That’s what GPT offers. It can:
This isn’t AI hype—it’s a very practical, effective application of generative AI.
Here’s exactly how I built this system, step-by-step, to help you replicate this for your own blog or clients.
The first script crawls existing blog posts via the WordPress REST API, systematically breaking each post into paragraphs and table blocks. It then scans this content against a keyword list to find perfect contextual matches. The key here? Context, not just keywords.
Here’s the fully explained, production-ready code:
"""
================================================================================
🧩 Script: Contextual Link Opportunity Finder for WordPress Blog Content
--------------------------------------------------------------------------------
🎯 PURPOSE:
This Python script is designed to automate the identification of *contextually relevant*
internal linking opportunities within blog content hosted on a WordPress website (via REST API).
It assists in large-scale internal linking by scanning blog posts for paragraphs that mention
specific keywords, and then pairing those paragraphs with relevant target URLs.
🔍 HOW IT WORKS:
1. Loads a list of blog post URLs (`DONOR_FILE`) and a list of keywords with their respective
target URLs (`KEYWORD_FILE`).
2. Uses the WordPress REST API to fetch each post's HTML content based on its slug.
3. Parses out <p> (paragraph) and <table> blocks using BeautifulSoup.
4. For each keyword, checks whether it's present in any paragraph or in/around a table block.
5. If found, extracts the paragraph as a suitable anchor context for linking.
6. Enforces a max limit of contextual links per post (`MAX_LINKS_PER_POST`) to avoid spammy linking.
7. Results (blog URL, keyword, matched paragraph, target URL) are written to a CSV (`OUTPUT_FILE`).
⚙️ USE CASES:
• SEO teams running interlinking sprints at scale
• Automating content audits for contextual linking gaps
• Feeding this CSV into a generative link rewriter or publishing system
📁 INPUT FILES:
- `second-iteration.csv`: Contains blog post URLs (donors).
- `pages-to-link.csv`: Contains keywords and the target URLs to link to.
📁 OUTPUT FILE:
- `contextual_link_candidates.csv`: Each row suggests a keyword → paragraph match where
the keyword naturally appears in the post content.
🛡️ SAFEGUARDS:
- Skips paragraphs with fewer than 4 words to reduce noise.
- Avoids creating more than 3 links per post to prevent over-optimization.
- Uses regex word-boundary matching to avoid partial word collisions.
- Includes logic to fallback to nearby paragraphs if keyword is found inside a <table>.
🛠️ TO EXTEND:
- Add GPT scoring for confidence-based ranking.
- Group keyword variants using fuzzy or semantic matching.
- Use append mode and logging for large-scale batch runs.
Author: Daniel Sylvester Antony
Date: 2025
================================================================================
"""
import pandas as pd
import requests
import time
import re
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from base64 import b64encode
# === CONFIG ===
USERNAME = ''
APP_PASSWORD = ''
WP_API_BASE = "https://website.com/blog/wp-json/wp/v2"
auth_str = f"{USERNAME}:{APP_PASSWORD}"
auth_header = {'Authorization': 'Basic ' + b64encode(auth_str.encode()).decode('utf-8')}
DONOR_FILE = 'second-iteration.csv'
KEYWORD_FILE = 'pages-to-link.csv'
OUTPUT_FILE = 'contextual_link_candidates.csv'
MAX_POSTS = 100
MAX_LINKS_PER_POST = 3
DELAY_SECONDS = 1.5
# === HELPERS ===
# Extracts the slug from a blog post URL
def extract_slug(url):
parsed = urlparse(url)
return parsed.path.strip('/').split('/')[-1]
# Pulls all paragraph and table blocks from HTML content
def extract_blocks(html):
soup = BeautifulSoup(html, "html.parser")
blocks = []
for el in soup.find_all(['p', 'table']):
tag_type = el.name
content = el.get_text().strip().replace('\n', ' ')
if content:
blocks.append((tag_type, content))
return blocks
# Checks if the exact keyword (word-boundary match) exists in a given text
def keyword_in_text(keyword, text):
pattern = r'\b' + re.escape(keyword) + r'\b'
return re.search(pattern, text, flags=re.IGNORECASE)
# === MAIN LOGIC ===
def main():
interlink_df = pd.read_csv(DONOR_FILE)
pages_df = pd.read_csv(KEYWORD_FILE)
donor_urls = interlink_df['donor_url'].dropna().unique().tolist()[:MAX_POSTS]
results = []
for url in donor_urls:
slug = extract_slug(url)
print(f"🔍 Fetching: {slug}")
try:
response = requests.get(f"{WP_API_BASE}/posts?slug={slug}", headers=auth_header)
time.sleep(DELAY_SECONDS)
if response.status_code == 200 and response.json():
post = response.json()[0]
content_html = post['content']['rendered']
blocks = extract_blocks(content_html)
seen_pairs = set()
post_link_count = 0
for keyword_row in pages_df.itertuples():
keyword = str(keyword_row.Keywords).strip().lower()
target_url = keyword_row.URL
for idx, (block_type, text) in enumerate(blocks):
if len(text.split()) < 4:
continue # Ignore short or empty paragraphs
pair_key = (url, keyword)
if pair_key in seen_pairs:
break # Avoid repeating the same link suggestion for the same keyword
if keyword_in_text(keyword, text):
if block_type == 'table':
# Check surrounding paragraphs if the keyword appears inside a table
nearby_blocks = []
if idx > 0 and blocks[idx-1][0] == 'p':
nearby_blocks.append(('p_near_table', blocks[idx-1][1]))
if idx < len(blocks)-1 and blocks[idx+1][0] == 'p':
nearby_blocks.append(('p_near_table', blocks[idx+1][1]))
for para_type, para_text in nearby_blocks:
results.append({
'blog_url': url,
'blog_title': post['title']['rendered'],
'keyword': keyword,
'link_to': target_url,
'paragraph_type': para_type,
'original_paragraph': para_text
})
seen_pairs.add(pair_key)
post_link_count += 1
break
elif block_type == 'p':
results.append({
'blog_url': url,
'blog_title': post['title']['rendered'],
'keyword': keyword,
'link_to': target_url,
'paragraph_type': 'body_paragraph',
'original_paragraph': text
})
seen_pairs.add(pair_key)
post_link_count += 1
break
if post_link_count >= MAX_LINKS_PER_POST:
break # Enforce max links per post
else:
print(f"⚠️ Failed to fetch: {slug}")
except Exception as e:
print(f"❌ Error fetching {slug}: {e}")
# Output the results
df_out = pd.DataFrame(results)
df_out.to_csv(OUTPUT_FILE, index=False)
print(f"\n✅ Done. Output saved to {OUTPUT_FILE} with {len(results)} rows.")
if __name__ == "__main__":
main()
This is where GPT comes into play. Using the CSV from Step 1, each identified context and keyword is passed to GPT via OpenAI’s API. GPT acts as an editor, rewriting paragraphs slightly to insert links naturally and contextually.
Here’s the full implementation:
import pandas as pd
import openai
import time
import logging
import csv
openai.api_key = 'your-openai-api-key'
INPUT_FILE = 'contextual_link_candidates.csv'
OUTPUT_FILE = 'link_inserted_output.csv'
MODEL = 'gpt-4o-mini'
DELAY = 1.2
RETRIES = 3
logging.basicConfig(filename='link_insertion.log', level=logging.INFO)
def build_prompt(keyword, link_to, paragraph, blog_title):
return (f"You're an SEO editor. Naturally insert the link [{keyword}]({link_to}) "
f"into the following paragraph from '{blog_title}' without adding new sentences:\n\n{paragraph}")
df = pd.read_csv(INPUT_FILE)
with open(OUTPUT_FILE, 'a', newline='', encoding='utf-8') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=[
"blog_url", "blog_title", "keyword", "link_to", "original_paragraph", "revised_paragraph"
])
if csvfile.tell() == 0:
writer.writeheader()
for row in df.itertuples():
prompt = build_prompt(row.keyword, row.link_to, row.original_paragraph, row.blog_title)
success, retries = False, 0
while not success and retries < RETRIES:
try:
response = openai.ChatCompletion.create(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.7, max_tokens=400
)
revised_paragraph = response.choices[0].message.content.strip()
success = True
writer.writerow({
"blog_url": row.blog_url,
"blog_title": row.blog_title,
"keyword": row.keyword,
"link_to": row.link_to,
"original_paragraph": row.original_paragraph,
"revised_paragraph": revised_paragraph
})
logging.info(f"Inserted link for {row.blog_url} using keyword '{row.keyword}'")
except Exception as e:
retries += 1
logging.warning(f"Retry {retries} failed: {e}")
time.sleep(DELAY * retries)
time.sleep(DELAY)
print("GPT-powered link insertion complete.")
This setup completely transformed our internal linking strategy. We can now scale internal linking effortlessly across hundreds of articles, dramatically boosting our SEO performance while maintaining great UX.
Give this a try on your blog or reach out if you need help adapting it. This isn’t just automation—it’s smart SEO scaling, done right.