Hidden in HTML: Parsing Page Types. 2.9B Web Page Analysis

Author: Ben Wills

Published: January 22, 2025 12:00 pm

Last Updated: January 22, 2025 12:57 pm

Author Bio: I am the author of WebParsing.io. I've been writing HTML, C, C++, and PHP for over 30 years, and parsing HTML for more than 20.
Contact me here to discuss working together.

Hidden in HTML Article Series:

Parsing Page Layouts
Parsing Page Types (You’re reading this now)
Parsing URL Patterns (Coming by 2025.01.29)
Parsing Structured Data (Coming by 2025.02.05)
Parsing HTTP Headers (Coming by 2025.02.12)
Parsing Article Meta (Coming by 2025.02.19)
Parsing Technology Used (Coming by 2025.02.26)
Writing an Article Parser (Coming by 2025.03.05)

Supplemental Posts

Downloading and Parsing at Scale (Coming by 2025.03.12)

Hidden in HTML: Parsing Page Types
Table of Contents

Series Overview
Post Overview
Checking Common Classes and IDs
URL Analysis Technique
A Real-World Use Case
Exploring the A.Href Attribute
Standard Website Pages
User Account Pages
E-Commerce/Product Shopping Pages
Blog and News Pages
Hired/SaaS Services Pages
Community Pages
Multimedia and Events Pages
Research Pages
Advertising Pages
Legal Pages
Available for Hire
All Tag.Attribute Value Data
Get the Free Database

Series Overview

The purpose of this series is to go through and try to figure out how to build better parsers when parsing HTML. Because HTML is inherently unstructured, we have to do more detailed work in order to accurately identify the structure.

HTML5 tried to add semantic tags, but the reality is that when it comes down to using those, it still doesn’t offer enough for those of us who are parsing to figure out exactly how to parse a document. So we have to look deeper than that into things like:

What’s the URL look like
What are the HTTP headers
What are the HTML tag attribute values. eg: div.id, div.class, main.id

So we have to look at all of these and look deeply into them in order to figure out how to write better parsers.

Post Overview

The purpose of this particular post is to learn how to identify what type of page we’re looking at as quickly as possible.

If we’re reading a blog, we may need to initiate a different kind of parser than if we’re reading, for example, an e-commerce page. Even if we’re not on an e-commerce site, we need to look at a category listing differently than we would look at a product. What we want to do is figure out as quickly and as efficiently as possible what kind of content we’re looking at.

This ends up being a slightly more complicated problem than we’d like because it’s never going to be exact. We still have to guess somewhat, but what we want to do is figure out how we can guess more accurately.

What you’re going to see going through this particular post is that the common ways of parsing the HTML first are probably not the best first step to take. It uses a lot of resources to run the parser and things like that, but we can actually figure out a lot about what kind of page we’re looking at by looking at the URL.

The URL can get us pretty far pretty quickly. This post is going to show how looking at things like the body.class, the body.id, or the article.id aren’t necessarily as reliable as looking at the actual URL of the page that you’re parsing. In order to figure out those most common URLs, we actually look at the anchor href attributes.

We are going to walk through exploring this process, and then we’re going to identify the 10 main categories of pages that we find in this particular data set.

Remember that you can download this data set; the link is at the bottom of this post. But this is also a much smaller data set than the total larger data set. If that’s something you’re interested in purchasing, definitely reach out and let me know, and I can figure out how much demand there is and how much work to put into cleaning up the database, making it available, etc.

Checking Common Classes and IDs

The first approach with this dataset is to check the body.id or class attribute values to see if there are any indicators of what type of page it might be. While we do see a few patterns here, we’ll find more useful indicators elsewhere.

We looked at several HTML elements including:

The body tag’s class attribute
The body tag’s id attribute
Various other elements like single class values
Body IDs (like “phpbb” which indicates a PHP BB forum)

body.class
body.id
main.id
section.id
div.id

Use the select boxes above. Results will be displayed here.

URL Analysis Technique

By analyzing URL substrings, you can actually get a decent sense of what’s on the page before parsing the HTML. I’ve used this technique for many things where I have a large URL list. I’ll parse each URL, take the URL paths, and tokenize them based on the directory using Ngrams.

A 1-gram would be just one part of the directory, a 2-gram would be two directories (like /legal/terms), a 3-gram would be three parts, and so on. This helps you see what’s common in that set of URLs and maybe what you can skip parsing.

A Real-World Use Case

For example, I extracted about 10 million RSS feed and Atom URLs from the common crawl dataset. I wanted to analyze RSS feeds but didn’t want to download 10 million pages. I could handle a million, so I looked at the URL paths to determine which ones to download.

I applied filters like:

Removing paths longer than three subdirectories (most sites’ main RSS feeds are close to the root directory)
Removing patterns for forum feeds, forum thread feeds, and forum discussion feeds
Removing url patterns for comments (eg: “comments” or “discussion” in the url)

After analyzing URLs in about 100-200 different ways, I filtered that 10 million URL list down to about 911,000 RSS feeds, which I downloaded for a clean dataset.

This is actually the main way to identify a page type. While you can also look at body tags, class attributes, and id attributes, looking at the URL is going to be the quickest way to tell what kind of page you’re on.

Exploring the A.Href Attribute

Interestingly, we didn’t find strong indicators in main.id, section.id, or div.id that would help determine the page type. The most helpful element turned out to be the a.href tag, which contains links to other pages where we can identify patterns.

For the rest of this post, we’ll explore ten different URL patterns found on websites that indicate page types:

Standard Site Pages
User Account Pages
E-Commerce/Product Shopping Pages
Blog and News pages
Hired/SaaS Services Pages
Community Pages
Multimedia and Events Pages
Research Pages
Advertising Pages
Legal Pages

The post will examine each of these sections in detail, looking at different link patterns found in the Common Crawl dataset. Through this analysis, we’ll identify approximately 80-100 different page types.

Let me walk you through the different sections of standard web pages that you’d find across websites. Looking at the a.href, we have about 15 or 16 common patterns:

Standard Website Pages

This set of URL components reflects web pages that you’d find across virtually any site on the internet. Every site is going to have a home page, and many sites will have contact pages, about pages, feedback pages, FAQs, search pages, and similar standard pages.

This section identifies common pages just by looking at the anchor URLs. You can use these patterns to write parsers that look for specific pages, or to identify pages you want to exclude from parsing.

a.href = search
a.href = contact
a.href = about
a.href = faq
a.href = sitemap
a.href = home
a.href = help
a.href = feedback
a.href = jobs
a.href = career
a.href = newsletter
a.href = subscr
a.href = support
a.href = accessibility
a.href = press
a.href = media
a.href = locations
a.href = testimonial

Use the select boxes above. Results will be displayed here.

User Account Pages

Sites with user accounts typically have a set of pages for signing up, logging in, password recovery, and adjusting user profiles. When you find these links, particularly in the top or bottom navigation (reference the previous post on parsing page layouts), it can tell you a lot about the site.

For example, finding a login link on the same hostname or domain is a good indicator that the site has authentication. More importantly, finding a sign-up or register link indicates users can create accounts. This might seem like an unimportant distinction, but a website where users can register is very different from one where they can’t.

Many blogs don’t have register/sign-up links, but sites that sell services, take payments, or host communities typically do. So this section helps identify not just specific pages, but gives hints about the type of site you’re analyzing.

a.href = login
a.href = sign
a.href = register
a.href = account
a.href = user
a.href = profile
a.href = join

Use the select boxes above. Results will be displayed here.

E-Commerce/Product Shopping Pages

This section focuses specifically on e-commerce sites selling products (we’ll cover services later in the article). The most common indicator you’ll find is “cart” in the URL – this is how most sites name their shopping cart pages. You’ll also find checkout links, order links, returns pages, and similar e-commerce patterns.

These URL patterns help identify both specific types of pages and the overall nature of the site. For example, if you find a cart in the top navigation, or order/delivery/returns links in the footer, there’s a good chance the site is selling products.

Other common e-commerce URL patterns include:

Collections (often used as a subdirectory, like collections/bags for handbags)
Wishlist/favorites (for saving items for later)
Delivery pages (containing shipping policies, delivery times, etc.)
Return pages (with return and refund policies)
Browse pages (similar to catalog views)

a.href = cart
a.href = basket
a.href = collection
a.href = catalog
a.href = review
a.href = product
a.href = wish
a.href = checkout
a.href = ship
a.href = sale
a.href = brand
a.href = favorite
a.href = order
a.href = deliver
a.href = return
a.href = browse

Use the select boxes above. Results will be displayed here.

Blog and News Pages

Some sites are purely blogs or news sites, while others simply have a blog or news section. These patterns could indicate either case – they tell us more about the specific content than about the overall site purpose.

Unsurprisingly, having “blog” in the URL (like domain.com/blog) is an obvious indicator of blog content. Similarly, “/news” could indicate blog posts or news/press releases. Many of these sites also have “/rss” or “/feed” URLs.

For blog-specific patterns, WordPress commonly uses “/archive” for archived posts, along with tags and categories. Author pages are another common blog feature.

When searching for blog content, look for “categ” in URLs to match both “category” and “categories” pages. You’ll also commonly find RSS or feed URLs, which while not HTML pages themselves, are standard features of blog and news sites.

a.href = news
a.href = blog
a.href = tag
a.href = article
a.href = categ
a.href = archive
a.href = author
a.href = rss
a.href = feed

Use the select boxes above. Results will be displayed here.

Hired/SaaS Services Pages

This section is similar to product shopping pages – finding a services page or pricing page typically indicates a service is being sold. If you only see a pricing page, it might be for a single product or a specific product with variations.

Typically, pricing and services pages indicate you’re looking at a SaaS app, though they could also be for contracting, consulting, or local services like HVAC companies.

a.href = service
a.href = pricing

Use the select boxes above. Results will be displayed here.

Community Pages

These patterns indicate some form of user-to-user communication capability. This includes forums, groups (like Facebook groups), and other community features. The most obvious indicator is “forum” in the URL. As mentioned earlier in the article, finding “phpbb” in the body ID attribute is another strong indicator of forum pages.

a.href = forum
a.href = group
a.href = communit

Use the select boxes above. Results will be displayed here.

Multimedia and Events Pages

I’ve combined these categories as they’re somewhat similar, falling under the broader category of entertainment. This includes videos and podcasts, as well as “/live”, “/events”, or “/calendar” URLs where there’s content to watch or attend.

If you’re specifically looking to aggregate podcasts or videos that aren’t on YouTube, these are some of the URL patterns to look for while crawling.

Common multimedia features include image galleries and calendar pages. Calendar URLs are particularly relevant for event listings and schedules.

a.href = live
a.href = event
a.href = video
a.href = podcast
a.href = gallery
a.href = calendar

Use the select boxes above. Results will be displayed here.

Research Pages

These pages typically contain white papers, research data, or industry publications that companies want to share. You’ll often see both “research” and “publications” in URLs for this type of content.

These URL patterns can help you find more refined and well-researched information. If you’re feeding content to an LLM, you might want to prioritize pages linked from these research or publications sections.

a.href = research
a.href = publication

Use the select boxes above. Results will be displayed here.

Advertising Pages

These pages are for promotion, donations, or partnerships. While I’m not necessarily recommending this, it’s worth noting that some use these patterns for SEO purposes – finding sites where you can donate, advertise, or partner to potentially get backlinks.

If you want to monitor new advertising, partnership, or donation opportunities in your industry, you might set up Google alerts with your main keywords plus “inurl:advertise” or “inurl:donate”. When parsing, you can use these patterns to identify and extract these types of pages.

a.href = advertise
a.href = donate
a.href = partners

Use the select boxes above. Results will be displayed here.

Legal Pages

These patterns help find specific legal pages rather than identify site characteristics. If you need to check the legality of scraping a site or similar policies, look for URLs containing “polic” (for policies/policy), “priv” (privacy), or variations of “terms” (terms of service, terms and conditions, etc).

Key legal page patterns include:

Terms pages (terms of service, terms of use, etc.)
Cookie policies (required in many jurisdictions)
Refund policies (relevant for both product and service businesses)

a.href = polic
a.href = priv
a.href = term
a.href = legal
a.href = cookie
a.href = disclaimer
a.href = refund
a.href = warranty
a.href = agreement

Use the select boxes above. Results will be displayed here.

Available for Hire

If you have parsing needs that have historically been challenging, or if you have data you want to extract from the CommonCrawl datasets like I’ve done here below…

All Tag.Attribute Value Data

To explore more of the dataset than you can, you can use the form below. This form has access to every tag and attribute value in the database.

Use the select boxes above. Results will be displayed here.

Get the Free Database

Throughout this post, you’ll have the ability to to select attribute values for different tag and attribute combinations, just like you did above.

But the forms on this page only contain the top 250 results.

I’ve also made a larger SQLite database available to you for free, with up to 10,000 values per tag.attribute combination.

Hidden in HTML Article Series:

Hidden in HTML: Parsing Page TypesTable of Contents

Series Overview

Post Overview

Checking Common Classes and IDs

URL Analysis Technique

A Real-World Use Case

Exploring the A.Href Attribute

Standard Website Pages

User Account Pages

E-Commerce/Product Shopping Pages

Blog and News Pages

Hired/SaaS Services Pages

Community Pages

Multimedia and Events Pages

Research Pages

Advertising Pages

Legal Pages

Available for Hire

All Tag.Attribute Value Data

Get the Free Database

Hidden in HTML: Parsing Page Types
Table of Contents