Hidden in HTML: Parsing Page Layouts. 2.9B Web Page Analysis

Hidden in HTML Article Series:

  1. Parsing Page Layouts (You’re reading this now)
  2. Parsing Page Types
  3. Parsing URL Patterns (Coming by 2025.01.29)
  4. Parsing Structured Data (Coming by 2025.02.05)
  5. Parsing HTTP Headers (Coming by 2025.02.12)
  6. Parsing Article Meta (Coming by 2025.02.19)
  7. Parsing Technology Used (Coming by 2025.02.26)
  8. Writing an Article Parser (Coming by 2025.03.05)

    Supplemental Posts
  • Downloading and Parsing at Scale (Coming by 2025.03.12)

Hidden in HTML: Parsing Page Layouts
Table of Contents

Overview

The purpose of this series is to learn how to more precisely and accurately parse HTML. While HTML is an inherently unstructured language, the way that humans use language absolutely has patterns and structure.

This series builds a foundational understanding of how we write semi-semantic HTML, in order to better build things like generic article parsers, product parsers, news parsers, etc, thereby reducing the amount of site-specific parsing we have to do.

Another benefit of this is to better feed AI/ML/LLMs. If you’re not removing navigation text, for example, your LLM learns to associate whatever text tokens are in the header, side, and footer navigation with the main article text (what you actually want feeding your models), which can significantly reduce the output quality of your model.

Throughout this article series, we’ll look at an analysis of how HTML tags, attributes, and their values are used across 2.9 billion web pages from the November 2024 Common Crawl dataset.

For example, the most common <a>.class attribute value, nav-link, occurs 5,272,736,667 times in the dataset. Elsewhere, the <ul>.class value of nav occurs 1,172,570,814 times, while the <nav>.class value of nav occurs 1/14 as often, at only 82,965,969 times.

By analyzing the frequency of attribute values for specific tags, we can begin to get a sense of how people organize their page layouts, what information might be able to be discarded, what technology is used on the page, what type of page is it (blog post, product listing, forum thread…), etc.

Searching Tag.Attribute Value Data

To get a sense of what I mean by HTML being semi/implied-semantic, I’ll start by showing you what this tag.attribute.value data looks like.

This is the first of several of these forms throughout this post, and it uses all of the tags and attributes in the database. Forms later in the post may be limited to certain tags or attributes depending on how we want to more specifically look a the data.

Or you can use the select boxes to choose the tag and attribute combinations you’re most curious about.


Use the select boxes above. Results will be displayed here.

A 1994 Guide to HTML

All of this started for me over 30 years ago when I discovered View -> Source (or whatever was the equivalent operation at the time) on AOL, Windows 3.1, and our 16-mhz Intel 386 with 3mb of RAM and a 40mb hard drive. The Packard Bell 386sx-16. It looked a whole lot like this.

(A couple of years ago, I was contracted to spend a year writing complete ESP32-S3 firmware and a custom networking protocol to interact with a PIC microcontroller, in pure C. It was a very tangible reminder of just how far we’ve come, given that the ESP32 was more powerful than my first computers in every way…but at the size of a postage stamp vs. a large briefcase…)

Completely captivated by the idea that I could so easily see the magic behind the scenes of what made the internet The Internet, I quickly began printing out web page source code and their frontend designs on our dot matrix printer. After nearly filling a 2″ 3-ring binder, I began dissecting patterns between the HTML and what was displayed on the page.

Back then, web pages were simple and HTML was easy. CSS wouldn’t exist for two more years. JavaScript was another year away. The <marquee> tag was magic. Using <marquee> and <blink>, together, probably got you into Mensa. And the <frame> tag completely blew your mind. (“…how is all of *this* displayed with one HTML tag???)

So at 12/13 years old, I was able to write my own 8-page HTML guide in WordPad. I printed out that HTML guide, stapled it together, and carried that thing around with me for a long time.

30+ years later, and I’m still digging around in HTML source code looking for new patterns I haven’t yet discovered.

But HTML and the web are very different now than they were back then…

HTML…Not Quite Structured or Semantic

HTML, like any other language, has a language specification. If you’re bored, can’t sleep, or are that kind of nerd, here you go: HTML Living Standard

However…

Within that specification, for better and for worse, there are minimal semantic or structural limitations. Especially with CSS, you can make most elements display however you’d like…including replacing all of your <div> tags with <i> tags and declaring <i> to display:block. (do not do this) And with both CSS and JavaScript, you can basically write any HTML tag to make it function like any other HTML tag…or at least appear to.

When it comes to parsing HTML, parsing the structure to the HTML spec is easy. It’s basically just a less-strict form of XML. (I say that, ignoring things like block/inline-level elements, etc, and just parsing the XML/HTML text format.)

But, deriving the layout of what you are parsing…is anything but easy.

HTML5 attempted to improve this by adding semantic tags such as <article>,<aside>, <footer>, <header>, <main>, <nav>, <section>, etc. But those tags aren’t necessary, nor are they enforcecd by the spec. They’re also very easily replaced with the same basic functionality as <div> tags.

From a parser’s perspective, how do I know I’m looking at the article? Or am I looking at a product? Is this #comments ID for blog comments or product reviews? Or are they business reviews? Or am I looking at a forum thread’s replies?

This is what we’re setting to figure out here: How do we parse and interpret more meaning from HTML, giving us an insight in the structure of a web page, without requiring a specific parser for that site/page?

Why Parse HTML?

If you’re building a news aggregator, you may want to extract things like the main article content, author, published date, modified date, etc.

If you’re building something like a shopping or real estate aggregator, you might extract product names, prices, categories, ratings, descriptions, etc.

And now, LLMs are feeding on decades of web pages to build their collective knowledge. To keep the quality of their LLM high, and to reduce watering down the meaning of different tokens, separating out things like navigation from the main content, or separating out comments/reviews, or associating tags/categories with a page, are all things that have to be considered.

So, what happens when you want to properly parse web pages from across many sites, each with completely different ways of declaring a header, footer, main content, etc? If it were just one site you wanted to parse, creating that parser would be easy. Maybe even if it was 5 or 10 or 15 sites. Depending on what you’re extracting, it might be done in a day or so.

But what if it’s hundreds of sites you want to parse? Or thousands? Or millions?

What if you want to parse every single page on the internet?

To get a sense of the naming patterns people use, we’ll perform a simple analysis of billions of HTML documents and counting the combinations of tags + attributes + attribute_values that we come across.

As this is an initial exploration of web parsing, we’re going to keep it more on the surface of discovering where naming patterns may exist in HTML, and will begin touching on intuiting the semantic meaning of them.

In future posts, we’ll go more into the details of parsing for specific use cases. In addition to going deeper into HTML, we’ll also expand beyond HTML into HTTP headers, JSON-LD, other forms of Microdata, feeds, calendars, and more.

Using this Data for SEO

Given that I’ve been involved SEO since 2001, I’m not sure I could do an analysis like this without considering the implications for SEO.

The idea here is that you can use the most common layout-defining tags, classes, and IDs to make it easier for search engines to parse the layout of your pages. And by doing so, you’ll make it easier for search engines to more accurately parse your pages.

I do want to say that making the changes like the ones I outline below aren’t going to be the kinds of changes that get you from the tenth page of results to the first. There are only two reasons to make these kinds of changes to your HTML: 1, you’re starting from scratch, or 2, you’re obsessive.

  • <body>.(single|post|page|article|product|etc) to declare your page type
  • <header>#header for your page header
  • <ul>.nav or ul.sub-menu for groups of navigation links
  • <header>.entry-header to contain your article/post title
  • <div>.entry-content to contain your main article/post/page content
  • <div>.entry-meta or div.post-meta to contain metadata (author, published/modified dates, etc) for your main article/post/page content
  • <div>#comments to contain comments on your main article/post/page content
  • <aside>#sidebar for your page sidebar
  • <div>.widget to contain individual widgets/functional blocks you might have in your header, sidebar, or footer
  • <footer>#footer for your page footer

Anyone familiar with WordPress themes will recognize a number of those id/class names. Given that it’s the most used CMS on the internet, it’s expected that those naming conventions would be the most common. And given that they’re the most common, it’s worth considering using the same/similar naming conventions to add semantic meaning to your page structures.

And I think this goes without saying, but just to covrer our bases here… Where I would not apply the same naming conventions is if where you see WordPress plugin names (or any other CMS-specific naming) as part of the class/id naming. eg: don’t use <div>.elementor-widget even though it has almost twice as many occurrences as <div>.widget.

About the Data: Nov 2024 CommonCrawl

The 2.9 billion web pages comes from the November 2024 Common Crawl WARC dataset. All 2.9 billion pages were parsed and analyzed for this post.

If you’re not familiar with the Common Crawl organization, definitely check them out. They’ve been doing great work for almost 2 decades now, downloading large swaths of the internet and making them publicly available.

If you want to better understand how they store WARC files, how they’re compressed, and how those compressed files are combined, check out the code I submitted as a proposal to change the compression format to using ZSTD with dictionary compression for smaller file sizes. To discover the real performance benefits, the code is written in C, but can help with understanding the specific details of how the data is structured. You can go ahead and ignore my preference at the time for short variable names…

Normalizing Data

Semantically, do these things seem to represent the same fundamental type of element?

  • menuItem
  • menu-item
  • MENU_ITEM
  • menu__item–7
  • menu__item3–7
  • menu_123_item
  • menu_123_item–__–a__

In my experience, these very often represent the same fundamental element, so I have normalized them into the same value for this dataset. Therefore, every one of those attribute values above gets normalized into menu-item.

Here’s the token normalization process:

  • Convert camelCase to kebab-Case.
  • lower-case.
  • Replace all non-alphanumeric character sequences with a single hyphen.
  • Trim leading and following hyphens.
  • Tokenize based on hyphens.
    • For each subtoken…
    • Trim non-alphabetical characters. This includes trimming numeric prefixes and suffixes.
    • If the subtoken was a number, it will be reduced to a 0-length string. Discard it and any other 0-length subtokens.
    • If the subtoken is only a single character, discard it.
  • Combine the remaining subtokens, separating them by a hyphen.
  • This is your final, normalized value.

The token normalization occurs only on the class, id, name, and role attributes.

For attributes that may contain multiple values, like rel, they are lowercased, and each individual value is stored as its own count. eg: an anchor link with rel="nofollow ugc" will be stored incrementing the count of both a.rel=nofollow and a.rel=ugc.

For those familiar with parsing HTML, you may see a potential issue here when things like multiple comments are displayed. In those situations, you might have #comment-1, #comment-2, #comment-3. But in this analysis, that’s simply counted as three #comment values. For this analysis, I’m fine with those cases being lost/merged. In other analyses I’ve done, I’m not okay with that and handle the normalization differently.

For normalizing urls:

  • Lowercase the entire URL string.
  • Parse the url into scheme, host, path, query, fragment.
  • Remove the scheme.
  • Re-combine the parsed URL into a final URL

The url normalization occurs only on attributes that are known to contain URLs. action, code, data, formaction, href, and src attributes.

There’s a lot more that can be done for normalizing URLs, which I’ll walk through in future posts. Some of that includes checking the URL to see if it’s a relative or absolute URL, based on the host of the page it appears on…then deciding if you want to store it as an absolute or relative URL. You can also remove query vars like utm_* tracking varaibles used by Google Analytics, sort the query variable names alphabetically, conditionally add trailing slashes to paths, etc.

So just know that the URLs in this dataset are fairly minimally normalized.

All other attributes are simply lowercased.

Parsing HTML Page Layouts

Given that this is the first post of the series, I had to go through a lot of the above information. Expect futher posts in this series to be less verbose.

The rest of this post goes through the main visible sections of a web page: Header, Main content, Comments/Reviews, Sidebar, Footer, and other common UI elements.

Parsing Header Navigation

The header of a web page will most often have things like the logo, top navigation, and maybe a search bar, login/profile link, etc.

In the form below, we’re looking for tag+attribute+value combinations that might indicate a wrapper for the entire header element.

Show div classes with ‘header’


Use the select boxes above. Results will be displayed here.

Parsing the Main Content

The main content of a web page will depend on the type of page it is, whether it’s an article, product, forum, about page, etc.

In the form below, we’re looking for most commonly used tag+attribute+value combinations that might indicate a wrapper for the entire main content.


Use the select boxes above. Results will be displayed here.

Parsing Comments / User-Generated Content

Comments/User-generated-content (ugc) isn’t a part of most pages online. But if you’re parsing content with comments, reviews, or replies (eg: forum threads), then you’ll want to parse this into our out from your dataset.

In the form below, we’re looking for most commonly used tag+attribute+value combinations that might indicate a wrapper for the main types of user-generated content / comments.


Use the select boxes above. Results will be displayed here.

Parsing Side Navigation

The side navigation of a web page isn’t used on all web pages, but it’s common enough that it’s absolutely something to look out for. In many CMSes, this may contain links to related content, calendars, tag clouds, search forms, other other ‘widgets’.

In the form below, we’re looking for tag+attribute+value combinations that might indicate a wrapper for the entire sidebar element.


Use the select boxes above. Results will be displayed here.

Parsing Footer Navigation

The footer of a web page will most often have things like copyright dates, links to privacy policies and other legal docs, site-wide social media account links, and other important site links that are most often displayed directly and not collapsed into a dropdown menu.

In the form below, we’re looking for tag+attribute+value combinations that might indicate a wrapper for the entire footer element.

Parsing Other Common UI Elements

In addition to the main structural components of a web page, there are tons of common UI elements that you may want to parse into our out from your datasets. Things like modals, search boxes, image galleries, etc, can very often be identified by an attribute’s values.

In the form below, we’re looking for tag+attribute+value combinations that might indicate a wrapper for various UI elements.


Use the select boxes above. Results will be displayed here.

Closing

It’s often the case that when we start parsing HTML, and we’re looking at the source code from page after page on different sites, it’s easy to think there may be no patterns here and that the only solution is to throw it to an LLM or neural network to get what we want.

Hopefully this data has begun to show you otherwise.

And as we go through this series, we’ll look even more deeply into figuring out how to identify the type of page we’re looking at, various URL patterns in links, scripts, css, and media, we’ll get into parsing (un)clearly-defined structured data, HTTP headers, article meta information, and the technologies used on a page.

All of this will lay the foundational work to then develop an article parser/extractor. And whether you’re building a parser for commerce, news, forums, blogs, or anything else, going through the process of building that article parser will lay the foundation for writing any other parser from there.

Available for Hire

If you have parsing needs that have historically been challenging, or if you have data you want to extract from the CommonCrawl datasets like I’ve done here below…

Get the Free Database

Throughout this post, you’ll have the ability to to select attribute values for different tag and attribute combinations, just like you did above.

But the forms on this page only contain the top 250 results.

I’ve also made a larger SQLite database available to you for free, with up to 10,000 values per tag.attribute combination.