Brands are building out content at breakneck speed – SEO pieces, landing pages, blog posts and more. It is pretty hard to keep track of what your competitor is doing.

We could discover a part of what your competitor is up to by viewing what web pages were created by them. Gone are the days of 2~5 pages website. Web pages are now complicated beasts with thousands of pages. Manually clicking through is an insane proposition. Another way to find out what content exists is to use the sitemaps of competitors.

What can we find out using Sitemaps?

Using AWS as an example, the following is sample of what we are able to find out from AWS’s marketing sitemap urls

  • We found 10,072 unique links listed on our competitor sitemap, indicating a ferocious capacity to create content (vs 5K we found on microsoft azure)
  • About 50% of the links were updates of “What’s New”.
    We can see AWS accelerating their content update efforts rapidly with much more “What’s New” posts:
  • We found 98 events links with a few core focus of targeting Developers (28% of links), Innovation (8% of links), (Start-ups 6% of Links) and Government (11% of links). Event links targeting developers were India heavy.
  • AWS has very creative events such as Zombie Outbreak Simulation Events for coders to play with.
  • Beyond this, we see a section of the site that focuses on developers community through an AWS Hero Programme (over 99 links)

The above is just a sample of interesting points we could find from just looking at the list of urls in the sitemaps.

So how does one goes about doing this?

What are Sitemaps?

Sitemaps give hints to Search Engines what web pages exist in a website – hence pretty much are the hints we want to use to discover our competitor’s content:

Below is the formal definition:

sitemaps.org/ explains:
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL.

Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.

As an example, let’s assume we are a cloud startup with AWS as our competitor.

Where do I find Sitemaps?

There is no guarantee but typically sitemaps can be found at the following location:
example.com/sitemap.xml

For our AWS exampled,aws.amazon.com/sitemap.xm has 404’ed

another approach is to look for the robots.txt file

As Search engines automatically browse websites to find and categorize its content. Robots.txt help search engine bots by detailing files/directories the bots should not browse.
At times the sitemap is listed in this file.

In our case we hit jackpot: aws.amazon.com/robots.txt
In this file we could find an index of AWS’s sitemaps

And in this index we could find the following:

BINGO! Exactly what we want the sitemap of our competitor’s marketing effort!

However, sometimes we won’t be able to find the sitemap this way. For those cases, we could try using Google’s advanced search option site:example.com filetype:xml which restricts Google’s output to xml files on the domain we are interested in.

What do I do next?

Now that we have found the sitemap – the volume of links is impossible to go through. Here is where a few simple lines of Python could help tidy things up.
We would be using BeautifulSoup to parse the sitemap, cleaning it and then placing it into a data frame to sort alphabetically.

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get("https://aws.amazon.com/sitemaps/sitemap_marketing/")
xml = r.text

soup = BeautifulSoup(xml)
sitemapTags = soup.find_all("loc",text=True)

regex= re.compile(r"$|^“, re.IGNORECASE)
cleanedsitemapTags=[re.sub(r”
$|^“,”,str(i)) for i in sitemapTags]

df = pd.DataFrame({'pages':cleanedsitemapTags})
df=df.sort_values(['pages'], ascending=[1])

With the list of urls generated, we could glean insights into our competitor’s content strategy as demonstrated in our AWS example.