Dataset: CommonCrawl: 2024.11: HTML Tags, Attributes, and Attribute Values by Count. Top 1,000 Dataset

This is a SQLite database file containing one table per tag and attribute pair. For example, a_href.

Each table has two columns, key and count. Each table has up to the top 1,000 tag+attribute+value combinations, based on the total number that was found across the entire November 2024 CommonCrawl Warc dataset.

There are 2,957 of these database tables.

The gzip-compressed database for download is 33.6MB. Decompressed, it is 142.3MB.

If you would like a dataset of more than 1,000 of these results (eg: up to 10,000,000 per tag+attr+val), contact me for pricing. I’m not sure of the demand for this, but if there’s enough, I’ll make purchasing datasets much easier.

You will need to enter a name and email address for downloading this file. However…

Your email address will not be verified before being able to download this file. Because this form is used to keep bots from downloading the file, if you wish to use an invalid email address, please use something random not belonging to someone else.

If you want to subscribe to the newsletter, which is emailed each time I publish a new blog post, dataset, or dataset analysis, you will need to check the box. You will need a valid email address, which you will then confirm via email.

This dataset has been referenced in the following blog posts:

You may download this dataset by clicking this download link below:
Download: CommonCrawl: 2024.11: HTML Tags, Attributes, and Attribute Values by Count. Top 1,000 Dataset (13 downloads )