$1,000,000 homepage statistics
If you have read about the internet culture of 2005, you might know about the Million Dollar Homepage. Its Wikipedia article states that “by 2017, most links have suffered from link rot”. But 2017 was almost a decade ago, so I wanted to check whether how much worse it got in 2026.
After running a Python script for scraping the Homepage’s HTML and filtering out some garbage data manually, I have arrived at 2804 actual links.
To be honest, instead of checking each individual link, I have just checked their domain (or whatever urllib calls netloc). I should also make clear that each page had only 1 second to give a response.
| What happened? | Count | %age | Area in pixels |
|---|---|---|---|
| Working pages (200 or 202) | 438 | 15.6 | 124000 |
| Redirects (3xx) | 417 | 14.9 | 155100 |
| Not found (400–429) | 471 | 16.8 | 174000 |
| Server errors (5xx) | 26 | 0.9 | 8100 |
| Time-related errors | 584 | 20.8 | 235000 |
| Other errors on my end | 866 | 30.9 | 260100 |
| Other (read below) | 2 | 0.1 | 300 |
| Total | 2804 | 100.0 | ??????? |
The total area in pixels should add up to 1000² = 1000000, yet for some reason it doesn’t (it’s 2800 larger than that). I suspect somewhere in the data, some pixels are owned by two people simultaneously, yet I cannot back this up yet.
There are also some regions of the Homepage not counted in the table above. Some pixels were bought yet never used, some were marked as “pending order”, and some were even suspended! The total area of pixels with a status like that is 46200.
Interestingly, two domains returned interesting response codes: 456 and 466. MDN does not provide information about these, and those are, as you might’ve guessed, non-standard.
The code I used
The calculations were done in Microsoft Excel.import bs4
from urllib.parse import urlparse, unquote
data = {}
# homepage.html is produced by keying ⌃s on http://www.milliondollarhomepage.com/
with open("homepage.html", 'r', encoding="utf-8") as f:
soup = bs4.BeautifulSoup(f, 'lxml')
table = soup.table
for area in table.find_all("area"):
link = ""
archived = False
try:
link = area["data-original-url"]
archived = True
except:
link = area["href"]
original_link = link
link = urlparse(link).netloc
if not link.startswith("www"):
link = unquote(link)
coords = tuple(map(int, area["coords"].split(",")))
size = (coords[2]-coords[0])*(coords[3]-coords[1])
title = area["title"]
if link not in data:
data[link] = {}
data[link]["size"] = size
data[link]["titles"] = {title}
data[link]["archived"] = archived
#data[link]["coords"] = coords
data[link]["links"] = {original_link}
else:
data[link]["size"] += size
data[link]["links"].add(original_link)
data[link]["titles"].add(title)
if link.lower() == "pending order":
data[link]["status"] = "pending order"
if "Paid & Reserved" in link:
data[link]["status"] = "wasted money"
if link == "link suspended":
data[link]["status"] = "suspended"
# import requests as r
# headers = {"User-Agent": "Mozilla/5.0"}
# for url in data:
# if "status" in data[url]:
# continue
# p = 1337
# try:
# response = r.head("https://" + url, headers=headers, timeout=1)
# p = response.status_code
# except r.exceptions.Timeout:
# p = 0
# except Exception as e:
# p = -1
# print(url, p)
# weird symbol just so i could split text into columns
for url in data:
print(url + "◊" + str(data[url]["size"]))