Posted by Jeff_Baker
In January of 2018 Brafton began a massive organic keyword targeting campaign, amounting to over 90,000 words of blog content being published.
Did it work?
Well, yeah. We doubled the number of total keywords we rank for in less than six months. By using our advanced keyword research and topic writing process published earlier this year we also increased our organic traffic by 45% and the number of keywords ranking in the top ten results by 130%.
But we got a whole lot more than just traffic.
From planning to execution and performance tracking, we meticulously logged every aspect of the project. I’m talking blog word count, MarketMuse performance scores, on-page SEO scores, days indexed on Google. You name it, we recorded it.
As a byproduct of this nerdery, we were able to draw juicy correlations between our target keyword rankings and variables that can affect and predict those rankings. But specifically for this piece...
How well keyword research tools can predict where you will rank.
A little background
We created a list of keywords we wanted to target in blogs based on optimal combinations of search volume, organic keyword difficulty scores, SERP crowding, and searcher intent.
We then wrote a blog post targeting each individual keyword. We intended for each new piece of blog content to rank for the target keyword on its own.
With our keyword list in hand, my colleague and I manually created content briefs explaining how we would like each blog post written to maximize the likelihood of ranking for the target keyword. Here’s an example of a typical brief we would give to a writer:
Between mid-January and late May, we ended up writing 55 blog posts each targeting 55 unique keywords. 50 of those blog posts ended up ranking in the top 100 of Google results.
We then paused and took a snapshot of each URL’s Google ranking position for its target keyword and its corresponding organic difficulty scores from Moz, SEMrush, Ahrefs, SpyFu, and KW Finder. We also took the PPC competition scores from the Keyword Planner Tool.
Our intention was to draw statistical correlations between between our keyword rankings and each tool’s organic difficulty score. With this data, we were able to report on how accurately each tool predicted where we would rank.
This study is uniquely scientific, in that each blog had one specific keyword target. We optimized the blog content specifically for that keyword. Therefore every post was created in a similar fashion.
Do keyword research tools actually work?
We use them every day, on faith. But has anyone ever actually asked, or better yet, measured how well keyword research tools report on the organic difficulty of a given keyword?
Today, we are doing just that. So let’s cut through the chit-chat and get to the results...
While Moz wins top-performing keyword research tool, note that any keyword research tool with organic difficulty functionality will give you an advantage over flipping a coin (or using Google Keyword Planner Tool).
As you will see in the following paragraphs, we have run each tool through a battery of statistical tests to ensure that we painted a fair and accurate representation of its performance. I’ll even provide the raw data for you to inspect for yourself.
Let’s dig in!
The Pearson Correlation Coefficient
Yes, statistics! For those of you currently feeling panicked and lobbing obscenities at your screen, don’t worry — we’re going to walk through this together.
In order to understand the relationship between two variables, our first step is to create a scatter plot chart.
Below is the scatter plot for our 50 keyword rankings compared to their corresponding Moz organic difficulty scores.
We start with a visual inspection of the data to determine if there is a linear relationship between the two variables. Ideally for each tool, you would expect to see the X variable (keyword ranking) increase proportionately with the Y variable (organic difficulty). Put simply, if the tool is working, the higher the keyword difficulty, the less likely you will rank in a top position, and vice-versa.
This chart is all fine and dandy, however, it’s not very scientific. This is where the Pearson Correlation Coefficient (PCC) comes into play.
Phew. Still with me?
So each of these scatter plots will have a corresponding PCC score that will tell us how well each tool predicted where we would rank, based on its keyword difficulty score.
We will use the following table from statisticshowto.com to interpret the PCC score for each tool:
In order to visually understand what some of these relationships would look like on a scatter plot, check out these sample charts from Laerd Statistics.
And here are some examples of charts with their correlating PCC scores (r):
The closer the numbers cluster towards the regression line in either a positive or negative slope, the stronger the relationship.
That was the tough part - you still with me? Great, now let’s look at each tool’s results.
Test 1: The Pearson Correlation Coefficient
Now that we've all had our statistics refresher course, we will take a look at the results, in order of performance. We will evaluate each tool’s PCC score, the statistical significance of the data (P-val), the strength of the relationship, and the percentage of keywords the tool was able to find and report keyword difficulty values for.
In order of performance:
Revisiting Moz’s scatter plot, we observe a tight grouping of results relative to the regression line with few moderate outliers.
Moz came in first with the highest PCC of .412. As an added bonus, Moz grabs data on keyword difficulty in real time, rather than from a fixed database. This means that you can get any keyword difficulty score for any keyword.
In other words, Moz was able to generate keyword difficulty scores for 100% of the 50 keywords studied.
Visually, SpyFu shows a fairly tight clustering amongst low difficulty keywords, and a couple moderate outliers amongst the higher difficulty keywords.
SpyFu came in right under Moz with 1.7% weaker PCC (.405). However, the tool ran into the largest issue with keyword matching, with only 40 of 50 keywords producing keyword difficulty scores.
SEMrush would certainly benefit from a couple mulligans (a second chance to perform an action). The Correlation Coefficient is very sensitive to outliers, which pushed SEMrush’s score down to third (.364).
Further complicating the research process, only 46 of 50 keywords had keyword difficulty scores associated with them, and many of those had to be found through SEMrush’s “phrase match” feature individually, rather than through the difficulty tool.
The process was more laborious to dig around for data.
#4: KW Finder
KW Finder definitely could have benefitted from more than a few mulligans with numerous strong outliers, coming in right behind SEMrush with a score of .360.
Fortunately, the KW Finder tool had a 100% match rate without any trouble digging around for the data.
Ahrefs comes in fifth by a large margin at .316, barely passing the “weak relationship” threshold.
On a positive note, the tool seems to be very reliable with low difficulty scores (notice the tight clustering for low difficulty scores), and matched all 50 keywords.
#6: Google Keyword Planner Tool
Before you ask, yes, SEO companies still use the paid competition figures from Google’s Keyword Planner Tool (and other tools) to assess organic ranking potential. As you can see from the scatter plot, there is in fact no linear relationship between the two variables.
SEO agencies still using KPT for organic research (you know who you are!) — let this serve as a warning: You need to evolve.
Test 1 summary
For scoring, we will use a ten-point scale and score every tool relative to the highest-scoring competitor. For example, if the second highest score is 98% of the highest score, the tool will receive a 9.8. As a reminder, here are the results from the PCC test:
And the resulting scores are as follows:
Moz takes the top position for the first test, followed closely by SpyFu (with an 80% match rate caveat).
Test 2: Adjusted Pearson Correlation Coefficient
Let’s call this the “Mulligan Round.” In this round, assuming sometimes things just go haywire and a tool just flat-out misses, we will remove the three most egregious outliers to each tool’s score.
Here are the adjusted results for the handicap round:
As noted in the original PCC test, some of these tools really took a big hit with major outliers. Specifically, Ahrefs and SEMrush benefitted the most from their outliers being removed, gaining .162 and .150 respectively to their scores, while Moz benefited the least from the adjustments.
For those of you crying out, “But this is real life, you don’t get mulligans with SEO!”, never fear, we will make adjustments for reliability at the end.
Here are the updated scores at the end of round two:
SpyFu takes the lead! Now let’s jump into the final round of statistical tests.
Test 3: Resampling
Being that there has never been a study performed on keyword research tools at this scale, we wanted to ensure that we explored multiple ways of looking at the data.
Big thanks to Russ Jones, who put together an entirely different model that answers the question: "What is the likelihood that the keyword difficulty of two randomly selected keywords will correctly predict the relative position of rankings?"
He randomly selected 2 keywords from the list and their associated difficulty scores.
Let’s assume one tool says that the difficulties are 30 and 60, respectively. What is the likelihood that the article written for a score of 30 ranks higher than the article written on 60? Then, he performed the same test 1,000 times.
He also threw out examples where the two randomly selected keywords shared the same rankings, or data points were missing. Here was the outcome:
As you can see, this tool was particularly critical on each of the tools. As we are starting to see, no one tool is a silver bullet, so it is our job to see how much each tool helps make more educated decisions than guessing.
Most tools stayed pretty consistent with their levels of performance from the previous tests, except SpyFu, which struggled mightily with this test.
In order to score this test, we need to use 50% as the baseline (equivalent of a coin flip, or zero points), and scale each tool relative to how much better it performed over a coin flip, with the top scorer receiving ten points.
For example, Ahrefs scored 11.2% better than flipping a coin, which is 8.2% less than Moz which scored 12.2% better than flipping a coin, giving AHREFs a score of 9.2.
The updated scores are as follows:
So after the last statistical accuracy test, we have Moz consistently performing alone in the top tier. SEMrush, Ahrefs, and KW Finder all turn in respectable scores in the second tier, followed by the unique case of SpyFu, which performed outstanding in the first two tests (albeit, only returning results on 80% of the tested keywords), then falling flat on the final test.
Finally, we need to make some usability adjustments.
Usability Adjustment 1: Keyword Matching
A keyword research tool doesn’t do you much good if it can’t provide results for the keywords you are researching. Plain and simple, we can’t treat two tools as equals if they don’t have the same level of practical functionality.
To explain in practical terms, if a tool doesn’t have data on a particular keyword, one of two things will happen:
- You have to use another tool to get the data, which devalues the entire point of using the original tool.
- You miss an opportunity to rank for a high-value keyword.
Neither scenario is good, therefore we developed a penalty system. For each 10% match rate under 100%, we deducted a single point from the final score, with a maximum deduction of 5 points. For example, if a tool matched 92% of the keywords, we would deduct .8 points from the final score.
One may argue that this penalty is actually too lenient considering the significance of the two unideal scenarios outlined above.
The penalties are as follows:
Please note we gave SEMrush a lot of leniency, in that technically, many of the keywords evaluated were not found in its keyword difficulty tool, but rather through manually digging through the phrase match tool. We will give them a pass, but with a stern warning!
Usability Adjustment 2: Reliability
I told you we would come back to this! Revisiting the second test in which we threw away the three strongest outliers that negatively impacted each tool’s score, we will now make adjustments.
In real life, there are no mulligans. In real life, each of those three blog posts that were thrown out represented a significant monetary and time investment. Therefore, when a tool has a major blunder, the result can be a total waste of time and resources.
For that reason, we will impose a slight penalty on those tools that benefited the most from their handicap.
We will use the level of PCC improvement to evaluate how much a tool benefitted from removing their outliers. In doing so, we will be rewarding the tools that were the most consistently reliable. As a reminder, the amounts each tool benefitted were as follows:
In calculating the penalty, we scored each of the tools relative to the top performer, giving the top performer zero penalty and imposing penalties based on how much additional benefit the tools received over the most reliable tool, on a scale of 0–100%, with a maximum deduction of 5 points.
So if a tool received twice the benefit of the top performing tool, it would have had a 100% benefit, receiving the maximum deduction of 5 points. If another tool received a 20% benefit over of the most reliable tool, it would get a 1-point deduction. And so on.
All told, our penalties were fairly mild, with a slight shuffling in the middle tier. The final scores are as follows:
Using any organic keyword difficulty tool will give you an advantage over not doing so. While none of the tools are a crystal ball, providing perfect predictability, they will certainly give you an edge. Further, if you record enough data on your own blogs’ performance, you will get a clearer picture of the keyword difficulty scores you should target in order to rank on the first page.
For example, we know the following about how we should target keywords with each tool:
This is pretty powerful information! It’s either first page or bust, so we now know the threshold for each tool that we should set when selecting keywords.
Stay tuned, because we made a lot more correlations between word count, days live, total keywords ranking, and all kinds of other juicy stuff. Tune in again in early September for updates!
We hope you found this test useful, and feel free to reach out with any questions on our math!
Disclaimer: These results are estimates based on 50 ranking keywords from 50 blog posts and keyword research data pulled from a single moment in time. Search is a shifting landscape, and these results have certainly changed since the data was pulled. In other words, this is about as accurate as we can get from analyzing a moving target.
Sign up for The Moz Top 10, a semimonthly mailer updating you on the top ten hottest pieces of SEO news, tips, and rad links uncovered by the Moz team. Think of it as your exclusive digest of stuff you don't have time to hunt down but want to read!