enky’s essays

$1,000,000 homepage statistics

If you have read about the internet culture of 2005, you might know about the Million Dollar Homepage. Its Wikipedia article states that “by 2017, most links have suffered from link rot”. But 2017 was almost a decade ago, so I wanted to check whether how much worse it got in 2026.

After running a Python script for scraping the Homepage’s HTML and filtering out some garbage data manually, I have arrived at 2804 actual links.

To be honest, instead of checking each individual link, I have just checked their domain (or whatever urllib calls netloc). I should also make clear that each page had only 1 second to give a response.

What happened? Count %age Area in pixels
Working pages (200 or 202) 438 15.6 124000
Redirects (3xx) 417 14.9 155100
Not found (400–429) 471 16.8 174000
Server errors (5xx) 26 0.9 8100
Time-related errors 584 20.8 235000
Other errors on my end 866 30.9 260100
Other (read below) 2 0.1 300
Total 2804 100.0 ???????

The total area in pixels should add up to 1000² = 1000000, yet for some reason it doesn’t (it’s 2800 larger than that). I suspect somewhere in the data, some pixels are owned by two people simultaneously, yet I cannot back this up yet.

There are also some regions of the Homepage not counted in the table above. Some pixels were bought yet never used, some were marked as “pending order”, and some were even suspended! The total area of pixels with a status like that is 46200.

Interestingly, two domains returned interesting response codes: 456 and 466. MDN does not provide information about these, and those are, as you might’ve guessed, non-standard.

The code I used The calculations were done in Microsoft Excel.
import bs4
from urllib.parse import urlparse, unquote

data = {}

# homepage.html is produced by keying ⌃s on http://www.milliondollarhomepage.com/

with open("homepage.html", 'r', encoding="utf-8") as f:
	soup = bs4.BeautifulSoup(f, 'lxml')

table = soup.table
for area in table.find_all("area"):
	link = ""
	archived = False
	try:
		link = area["data-original-url"]
		archived = True
	except:
		link = area["href"]
	original_link = link
	link = urlparse(link).netloc
	if not link.startswith("www"):
		link = unquote(link)

	coords = tuple(map(int, area["coords"].split(",")))
	size = (coords[2]-coords[0])*(coords[3]-coords[1])

	title = area["title"]

	if link not in data:
		data[link] = {}
		data[link]["size"] = size
		data[link]["titles"] = {title}
		data[link]["archived"] = archived
		#data[link]["coords"] = coords
		data[link]["links"] = {original_link}
	else:
		data[link]["size"] += size
		data[link]["links"].add(original_link)
		data[link]["titles"].add(title)

	if link.lower() == "pending order":
		data[link]["status"] = "pending order"
	if "Paid & Reserved" in link:
		data[link]["status"] = "wasted money"
	if link == "link suspended":
		data[link]["status"] = "suspended"

# import requests as r

# headers = {"User-Agent": "Mozilla/5.0"}

# for url in data:
# 	if "status" in data[url]:
# 		continue
# 	p = 1337
# 	try:
# 		response = r.head("https://" + url, headers=headers, timeout=1)
# 		p = response.status_code
# 	except r.exceptions.Timeout:
# 		p = 0
# 	except Exception as e:
# 		p = -1
# 	print(url, p)

# weird symbol just so i could split text into columns
for url in data:
	print(url + "◊" + str(data[url]["size"]))

#code #old #python