Reading the title, you’re probably rolling your eyes in thinking this is another clickbait post where someone is going to tell me about the intitle: command or how to collect lists of blogs from other people’s top list posts, maybe even show me how to use the power of xpath extraction or how to download a competitors backlink profile. If you’ve presumed that you are mistaken.
Instead I’m going to show you how you can use your own custom crawler. Think of it as your own personal link butler, doing all the hard work to find those websites. Here is the basic flow of the extraction program.
- Crawl all the links from a start URL
- Add start domain to a blacklist
- Crawl all the internal and external links from the start URL and repeat step 1 for those URL’s
- Add crawled domains to blacklist and continue to crawl through the queued URL’s
A very simple process. As the crawler goes through it’s going to capture some pretty basic information.
- Meta Title
- Meta Description
- Meta Generator
The meta generator is an interesting one because this is the tag CMS systems e.g. WordPress use to label their site.
<meta name="generator" content="WordPress 4.8" />
I’m going to use this information to start to classify the sites that are surfaced.
So without further adue let’s get into it.
Disclaimer: You’ll need to understand python to be able to run this script.
Disclaimer2: This code is a little old and not my proudest but I don’t have the time to refactor and it still does the job I need it to!
Some things I wanted to be able to do when building this crawler was to ensure that I could stop & resume crawls. That’s why it’s built on top of a Sqlite database. Again not ideal.
This crawler will crawl up to a couple million URL’s if left to run long enough. ~1 week.
Step 1) Clone this repo
Clone the repo onto your machine.
Step 2) Configure the variables
- Project # the name used for the database
- Root_url # the start URL you want to crawl from
- Crawl_type #always leave this as web
- Threads # how many threads you want to run if you have a fast connection you can increase this
- Limit # a limit of how many URL’s you want to collect
Step 3) Start the script
The crawler is going to get start collecting all the linked sites that have a relationship to your start URL
Step 4) Process the results.
This script creates a SQLite database and in order to view that you need to use SQLite browser. It’s open source and available for Mac and Windows and allows you to export tables to CSV file.
For the purpose of writing this article I let the crawler run for around 24 hours and in that time managed to collect 100,000 websites.
Now using the data I can start to understand the trends and where I can find linking opportunities.
If you didn’t already know WordPress is the most used CMS out there, why? Pretty much anyone can set it up and it has a lot of functionality. WordPress is typically used on blogs, hey this blog runs it and is going to be a good indicator of a potential linking opportunity.
The first thing I look for in this data set is relevancy so search through titles & descriptions for keywords. In this example I chose, “adventure” & “travel” as I’m looking for some travel sites to engage with. This returned around 3500 URL’s. Next I filter out by WordPress generator tags and combine this into a master list.
Next up is qualifying the domains.
URLprofiler is an amazing tool which is essentially a connector to all your SEO services. I’m going to keep it simple in this example, the metrics I want to understand are.
- Domain Authority
- Trust Flow
- Referring Domains
- SEMrush rank
With these 4 metrics you can get a very good idea of how strong a website is from an SEO perspective.
It comes with a 14 day free trial.
I’ve made an assumption you have access to these services already. All of these services require paid subscriptions in order to access the data via their API.
To get the same data as I’m using these are the selections you should make
Click the accounts tab in the top left to setup all the API keys required.
Once that’s all setup click on run profiler and let the profile run.
Step 5) Processing the Results
If you’re an excel lover then you can handle this yourself as the output is in pure CSV.
If you’re not I’ve built a data studio report that should process the information for you.
It has simple filters to filter out the high domain authority and low domain authority sites. The really high DA sites 70+ are not going to be in scope for this link building efforts as sending them an email is very very very rarely going to be enough of a proposition to acquire a link. The low DA sites <15 well are probably not worth the time for the value that link is going to drive to your site in the long term.
You will notice a table which has domains with email addresses. This is the lazy man’s table where the data you download from it comes with 80% qualified emails of that domain owner & will be within the range you select within the DA & trust flow filters.
Step 6) Outreach
You’ve made it you now have a large list of prospects you can start emailing those websites.
I’m going to leave this short and sweet but still a very powerful tutorial.
If you’ve got any questions let me know!