๐Ÿ”Ž [Part 2] How To Generate SEO Data For Testing


Use Data Or Be Used By Data!

โ€‹

The June 5 issue of Seotistics is here for you!

Welcome back to another issue of Seotistics!

This time we will expand on what said last week about generating synthetic SEO data to test your ideas.

You'll get to see more technical topics to level up your game.

Be prepared, we are going to brush up some Statistics concepts (in an easy way).

Please move this email to your Primary inbox. This is to prevent Seotistics from going into spam by accident. Gmail users can read this tutorial to do it.

๐Ÿ”‘ Key Statistics Concepts

Let's explain some definitions first, with clarity as our first goal:

  • Function: a map that assigns to each element of X exactly one element of Y. So you have 2 sets, input and output. A function associates inputs with outputs, that's it.
  • Sample space: a set of all the possible outcomes of an experiment. It's a list of every possible scenario that can happen. If flipping a coin, heads and tails would be your outcomes.
  • Event space: a set of events, namely a set of outcomes. For instance, getting heads and tails when flipping a coin twice. Heads and Tails would be individual outcomes.
  • Random variable: a function that maps outcomes to real values.
  • Probability distribution: a function that maps outcomes to probabilities.
  • Parameters: what defines a distribution, the levers you use to modify its shape.

It's all about mapping then!? Yes, most of what you've read above consists in conversions.

I know, textbooks and college professors may make it harder and boring... but it's actually that simple.

For example, a function is simply a magical mapper. You give x to get f(x).

If in the plot below I have x=0, then f(x) will be 0 as well.

The function squares values of x, so it transforms our input to give us an output (the squared number).

Advanced Reading Ahead

This issue isn't for the faint of heart. It's one of the few with extremely technical details.

If you have doubts, don't hesitate to send me an email. I will be happy to answer you.

๐Ÿงฎ Actionable SEO Tip - Generating Data Once Again

Last time we sampled data and called it a day. That's fine but we want to be more accurate when generating data.

Here are some clues:

  • Clicks and Impressions can't be negative but can be 0.
  • Clicks and Impressions are whole numbers
  • Many websites have a prevalence of a few pages getting more traffic
  • Not everything in nature is harmonious

A naive model would assume that every page on your website can get the same traffic...

but this is completely false in reality.

Supporting pages will most likely get less than the pillar.

PageRank isn't equally distributed across your pages and it's quite natural to have some getting more.

๐Ÿ”— Colab Link with R code [Part 2]โ€‹

๐Ÿ’ก Using Statistics To Be Credible

The solution is relying on the so-called probability distributions to create realistic values for our metrics.

Most of the phenomena we observe can be traced back to some pattern, meaning that we can generate similar data.

Last time, we simply sampled data without much thought but what if we want to be more accurate?

We start testing until the output is good!

In this example, I'll once again use R code because it's the best language for doing Statistics.

Don't worry if you don't understand the code, focus on the concepts.

This time we want to add pages, so we need to include queries, dates and the respective pages.

This poses the first problem: how many rows do we even need?

Let's say I want:

  • 30 pages
  • 365 days
  • 5 unique queries per page (150 queries in total)

54750 rows (30 x 365 x 5)

Great! Now we have an idea of how big our dataset will be.

We can then proceed to generate fictional pages with the following code:

You could also do something much more complex with terms and substitutions but for now, we don't care.

Next, we have to use the rep() function to fill the cells in our empty dataframe. Note that the argument each is used to repeat pages 365 x 5 times, as we need to create the right aggregation.

Next, we have queries and I've decided to use some terms to combine once again.

The same goes for date, we need to generate every single day in a year and then repeat them.

It's been quite tricky to code it because as you see it involves a lot of logic! Be sure to test carefully and check your results.

Now here comes the fun part, we have to generate clicks and impressions.

โ—๏ธWe know that clicks can't be negative and it's reasonable to think some rows will have 0.

This is where probability distributions come in handy... we have to find a good approximation of what could generate clicks in a realistic way.

I don't want to go heavy on Statistics, so this is what I will do:

  • Clicks are usually skewed. Some observations get much much more, the majority tends to be lower. This isn't symmetric at all.
  • For this reason, I try using the Chi-Squared distribution, which is skewed to the right. Again, this may not be the best approximation, it's my wild guess!
  • I want values to be within a specific range.
  • I also want to have 20% of clicks to be zero.

An example of a Chi-Squared distribution follows:

The plot is skewed to the right, so the probability of extreme x values is low.

The df (degrees of freedom) parameter changes the shape of the curve.

Warning!!!

The Chi-Squared distribution is continuous, meaning you shouldn't use it for clicks, since they are discrete (whole numbers).

Unless... you round down the values to make them integers.

And the last point requires us to use another distribution to generate many 0s and 1s, called the Binomial distribution.

N.B. n means how many trials you do, p is the probability of success.

Being successful 5 times has a probability of around 25%, according to the plot below.

๐Ÿ’ก In plain English, I generate many 0s and 1s at random and they'll be later used to turn some clicks to 0.

Consider it as a sort of scale to fix things.

We then assign clicks to each page for simplicity. However, an actual dataset will feature fluctuations, so this approach is only demonstrative.

If you want to take it to the next level, you can do that for page and query combinations and add some noise.

Over a year, clicks will naturally fluctuate and go down or up, right? ๐Ÿ“ˆ

Probability distributions can also help you with that!

Now for impressions... I have done something smart.

Impressions can't be lower than Clicks and they are usually much much higher. Not always, of course.

So I created a vector with some multipliers and another one with their probabilities.

So, the chance of clicks and impressions being the same (multiplier = 1) is 1% (0.01), which is very low.

We use sampling and the clicks values we generated before to get impressions.

Clicks get scaled by those multipliers to generate impressions. โš–๏ธ

With that knowledge, we can now proceed to get our beloved CTR. As we know it's given by the ratio of clicks and impressions.

But we miss one metric that we skipped last time... yes, it's the position!

This is the hardest one to generate because you can't simply put random numbers... it's affected by the CTR (it implies clicks and impressions affect it too).

Warning!!!

The position metric is actually an average and that's why it's a real number.

Please recall that it's not a reliable metric and can be extremely hard to estimate since it can dramatically decrease if you rank for a lot of queries.

I took some CTR thresholds from a website and used them as my baseline.

Then, I created some conditions to assign a random position between a certain range given a CTR value.

All of this makes our metric more realistic and credible since we add some noise in the mix.

The Uniform distribution has the same probability for every outcome... it's quite balanced for this use case.

๐Ÿ†™ What Can Be Improved

Many things... if you want to get to the next level, here's what you can do:

  • Add more noise when generating variables.
  • Test different prob. distributions for clicks
  • Use different functions to account for skewness
  • Explore the relationship between variables (what is inversely proportional?)
  • Improve how the position is simulated

The data you want to generate is largely influenced by how you will use it.

Some assumptions won't harm your goals, others will.

For this reason, it's crucial to know what you can sacrifice and what must be as realistic as possible.

If your goal is simply convincing someone or building a prototype, then you can be extremely lax.

But if you work in an agency and you have to replicate data as accurate as you can... then you must spend some time on it!

โญ What's Next?

Data generation applies to anything. You can generate the number of internal links, users, sessions, page views and so on.

It's all a matter of scaling, finding the right functions and probability distributions.

Instead of waiting for someone to give you access, start creating your own dataset.

For sure, it's annoying to create queries and pages but it's still better than waiting.

๐Ÿ’ก The SEO Insights

Last time we talked about generating data with some simple R code and sampling.

Now, you should be equipped to make more accurate generation grounded on probability distributions.

This skillset is transferrable to every other industry you will happen to work with.

Generating synthetic data in SEO is a valuable skill when you have to deal with private data or lack access.

๐Ÿงต My Selection Of Twitter Threads

A quick recap for those who haven't read them all or need a refresher:

๐Ÿ‘ฅ Community Launched!

We have launched our Discord community and I will contact every person who hasn't joined yet.

Our goal is to encourage actual SEO testing and building meaningful connections.

๐Ÿ”Ž Analytics For SEO Ebook (v2)

This ebook is aimed at SEOs or Business Owners who want to explore the combination of SEO and Analytics.

It will teach you or your employees to:

๐Ÿ‘‰ Avoid common pitfalls that cost you money ๐Ÿ’ธ

๐Ÿ‘‰ Create meaningful analyses that add value ๐Ÿ’ฏ

๐Ÿ‘‰ Shorten the learning time of Analytics โณ

This comes with monthly updates because I want to create the Ultimate Guide out there.

The April update includes the following new information:

โœ… Categorize Pages

โœ… More on Content Audits

โœ… Handling Large Files

v3, (finally) coming out this week, will feature:

  • Quick And Simple Way Of Detecting Keyword Cannibalization
  • Statistical Inference And Statistics (Update)
  • Update For Use Cases 2 and 5
  • Going Deeper With Analysis (Google Analytics, Screaming Frog, etc.)
  • R Approach To Some Problems

๐Ÿ“š Recommended Reads

Other useful resources for understanding the topic better...

The R book covers a lot of simulations and is a must buy for anyone who is half-serious about the topic.

I also recommend a Stats book because I couldn't cover an entire course in 1 newsletter!

Of course, I don't want to sugarcoat the topic as it poses multiple ethical challenges in terms of bias.

โ—๏ธ Feedback and Recommendations

If you have ideas/recommendations for the next issues of Seotistics, you can simply reply to this email.

Marco Giordano
โ€‹
SEO Specialist & Data Analyst

Follow me on ๐Ÿ”ฝ๐Ÿ”ฝ๐Ÿ”ฝ:

linkedintwitterexternal-link

Bernerstrasse Sรผd 169, Zurich, Switzerland
โ€‹Unsubscribe ยท Preferencesโ€‹

Seotistics - Analytics & SEO

The Seotistics newsletter is written by Marco Giordano, an SEO Specialist focused on content and Data Analyst. Tired of the usual SEO content? Seotistics teaches you how to use Analytics and data in your workflow while helping you with Content Management & Strategy.

Read more from Seotistics - Analytics & SEO

Use Data Or Be Used By Data! The September 16 issue of Seotistics is here for you! Everyone talks about data but not many showcase actual processes that support the business. Seotistics exists to tie (Web) data to business and I am here to show you some nice examples. You should adjust them based on your specific use cases, so follow the logic! Please move this email to your Primary inbox or reply to it. This is to prevent Seotistics goes into spam by accident. Gmail users can read this...

Use Data Or Be Used By Data! The September 9 issue of Seotistics is here for you! It was hard to put these concepts into words and connect them to Web Analytics. Pretty much any job isn't about hard skills alone... the highest-paid professionals think differently. I want to show you how to abstract concepts and what is common sense in daily data life. P.S. Seotistics has a new look! Go check it out ๐Ÿ‘€ Please move this email to your Primary inbox or reply to it. This is to prevent Seotistics...

Use Data Or Be Used By Data! The September 2 issue of Seotistics is here for you! It's finally September and as the end of the year draws near, companies still miss some key factors. When auditing websites, you will see that the stuff you want is never there... so you educate your client(s) and teach them the value of data. I am here to show you the way. P.S. My previous issue contains a lot of information on Content Management instead. Please move this email to your Primary inbox or reply to...