Use Data Or Be Used By Data!

The June 5 issue of Seotistics is here for you!

Welcome back to another issue of Seotistics!

This time we will expand on what said last week about generating synthetic SEO data to test your ideas.

You'll get to see more technical topics to level up your game.

Be prepared, we are going to brush up some Statistics concepts (in an easy way).

Please move this email to your Primary inbox. This is to prevent Seotistics from going into spam by accident. Gmail users can read this tutorial to do it.

🔑 Key Statistics Concepts

Let's explain some definitions first, with clarity as our first goal:

Function: a map that assigns to each element of X exactly one element of Y. So you have 2 sets, input and output. A function associates inputs with outputs, that's it.
Sample space: a set of all the possible outcomes of an experiment. It's a list of every possible scenario that can happen. If flipping a coin, heads and tails would be your outcomes.
Event space: a set of events, namely a set of outcomes. For instance, getting heads and tails when flipping a coin twice. Heads and Tails would be individual outcomes.
Random variable: a function that maps outcomes to real values.
Probability distribution: a function that maps outcomes to probabilities.
Parameters: what defines a distribution, the levers you use to modify its shape.

It's all about mapping then!? Yes, most of what you've read above consists in conversions.

I know, textbooks and college professors may make it harder and boring... but it's actually that simple.

For example, a function is simply a magical mapper. You give x to get f(x).

If in the plot below I have x=0, then f(x) will be 0 as well.

The function squares values of x, so it transforms our input to give us an output (the squared number).

A parabola function. You give x and you get a transformed output, namely f(x).

Advanced Reading Ahead

This issue isn't for the faint of heart. It's one of the few with extremely technical details.

If you have doubts, don't hesitate to send me an email. I will be happy to answer you.

🧮 Actionable SEO Tip - Generating Data Once Again

Last time we sampled data and called it a day. That's fine but we want to be more accurate when generating data.

Here are some clues:

Clicks and Impressions can't be negative but can be 0.
Clicks and Impressions are whole numbers
Many websites have a prevalence of a few pages getting more traffic
Not everything in nature is harmonious

A naive model would assume that every page on your website can get the same traffic...

but this is completely false in reality.

Supporting pages will most likely get less than the pillar.

PageRank isn't equally distributed across your pages and it's quite natural to have some getting more.

🔗 Colab Link with R code [Part 2]

💡 Using Statistics To Be Credible

The solution is relying on the so-called probability distributions to create realistic values for our metrics.

Most of the phenomena we observe can be traced back to some pattern, meaning that we can generate similar data.

Last time, we simply sampled data without much thought but what if we want to be more accurate?

We start testing until the output is good!

In this example, I'll once again use R code because it's the best language for doing Statistics.

Don't worry if you don't understand the code, focus on the concepts.

This time we want to add pages, so we need to include queries, dates and the respective pages.

This poses the first problem: how many rows do we even need?

Let's say I want:

30 pages
365 days
5 unique queries per page (150 queries in total)

54750 rows (30 x 365 x 5)

Great! Now we have an idea of how big our dataset will be.

We can then proceed to generate fictional pages with the following code:

You could also do something much more complex with terms and substitutions but for now, we don't care.

Next, we have to use the rep() function to fill the cells in our empty dataframe. Note that the argument each is used to repeat pages 365 x 5 times, as we need to create the right aggregation.

The rep() function in R is a must have for generating data.

Next, we have queries and I've decided to use some terms to combine once again.

Generate queries and dates with the same procedure as before.

The same goes for date, we need to generate every single day in a year and then repeat them.

It's been quite tricky to code it because as you see it involves a lot of logic! Be sure to test carefully and check your results.

Now here comes the fun part, we have to generate clicks and impressions.

❗️We know that clicks can't be negative and it's reasonable to think some rows will have 0.

This is where probability distributions come in handy... we have to find a good approximation of what could generate clicks in a realistic way.

I don't want to go heavy on Statistics, so this is what I will do:

Clicks are usually skewed. Some observations get much much more, the majority tends to be lower. This isn't symmetric at all.
For this reason, I try using the Chi-Squared distribution, which is skewed to the right. Again, this may not be the best approximation, it's my wild guess!
I want values to be within a specific range.
I also want to have 20% of clicks to be zero.

An example of a Chi-Squared distribution follows:

The plot is skewed to the right, so the probability of extreme x values is low.

The df (degrees of freedom) parameter changes the shape of the curve.

Warning!!!

The Chi-Squared distribution is continuous, meaning you shouldn't use it for clicks, since they are discrete (whole numbers).

Unless... you round down the values to make them integers.

And the last point requires us to use another distribution to generate many 0s and 1s, called the Binomial distribution.

N.B. n means how many trials you do, p is the probability of success.

Being successful 5 times has a probability of around 25%, according to the plot below.

💡 In plain English, I generate many 0s and 1s at random and they'll be later used to turn some clicks to 0.

Consider it as a sort of scale to fix things.

We then assign clicks to each page for simplicity. However, an actual dataset will feature fluctuations, so this approach is only demonstrative.

If you want to take it to the next level, you can do that for page and query combinations and add some noise.

Over a year, clicks will naturally fluctuate and go down or up, right? 📈

Probability distributions can also help you with that!

Google Search Console plot showing clicks decay and query expansion.

Now for impressions... I have done something smart.

Impressions can't be lower than Clicks and they are usually much much higher. Not always, of course.

So I created a vector with some multipliers and another one with their probabilities.

We create multipliers and their respective probabilities to simulate impressions based on clicks.

So, the chance of clicks and impressions being the same (multiplier = 1) is 1% (0.01), which is very low.

We use sampling and the clicks values we generated before to get impressions.

Clicks get scaled by those multipliers to generate impressions. ⚖️

With that knowledge, we can now proceed to get our beloved CTR. As we know it's given by the ratio of clicks and impressions.

But we miss one metric that we skipped last time... yes, it's the position!

This is the hardest one to generate because you can't simply put random numbers... it's affected by the CTR (it implies clicks and impressions affect it too).

Warning!!!

The position metric is actually an average and that's why it's a real number.

Please recall that it's not a reliable metric and can be extremely hard to estimate since it can dramatically decrease if you rank for a lot of queries.

I took some CTR thresholds from a website and used them as my baseline.

Then, I created some conditions to assign a random position between a certain range given a CTR value.

Based on the CTR, we simulate how the average position would look like.

All of this makes our metric more realistic and credible since we add some noise in the mix.

The Uniform distribution has the same probability for every outcome... it's quite balanced for this use case.

🆙 What Can Be Improved

Many things... if you want to get to the next level, here's what you can do:

Add more noise when generating variables.
Test different prob. distributions for clicks
Use different functions to account for skewness
Explore the relationship between variables (what is inversely proportional?)
Improve how the position is simulated

The data you want to generate is largely influenced by how you will use it.

Some assumptions won't harm your goals, others will.

For this reason, it's crucial to know what you can sacrifice and what must be as realistic as possible.

This sample dataset shows some observations with 0 clicks. Here we could add more noise!

If your goal is simply convincing someone or building a prototype, then you can be extremely lax.

But if you work in an agency and you have to replicate data as accurate as you can... then you must spend some time on it!

⏭ What's Next?

Data generation applies to anything. You can generate the number of internal links, users, sessions, page views and so on.

It's all a matter of scaling, finding the right functions and probability distributions.

Instead of waiting for someone to give you access, start creating your own dataset.

For sure, it's annoying to create queries and pages but it's still better than waiting.

💡 The SEO Insights

Last time we talked about generating data with some simple R code and sampling.

Now, you should be equipped to make more accurate generation grounded on probability distributions.

This skillset is transferrable to every other industry you will happen to work with.

Generating synthetic data in SEO is a valuable skill when you have to deal with private data or lack access.

🧵 My Selection Of Twitter Threads

A quick recap for those who haven't read them all or need a refresher:

👥 Community Launched!

We have launched our Discord community and I will contact every person who hasn't joined yet.

Our goal is to encourage actual SEO testing and building meaningful connections.

Join our Discord Community!

🔎 Analytics For SEO Ebook (v2)

This ebook is aimed at SEOs or Business Owners who want to explore the combination of SEO and Analytics.

It will teach you or your employees to:

👉 Avoid common pitfalls that cost you money 💸

👉 Create meaningful analyses that add value 💯

👉 Shorten the learning time of Analytics ⏳

This comes with monthly updates because I want to create the Ultimate Guide out there.

The April update includes the following new information:

✅ Categorize Pages

✅ More on Content Audits

✅ Handling Large Files

Unlock Powerful Insights Now!

v3, (finally) coming out this week, will feature:

Quick And Simple Way Of Detecting Keyword Cannibalization
Statistical Inference And Statistics (Update)
Update For Use Cases 2 and 5
Going Deeper With Analysis (Google Analytics, Screaming Frog, etc.)
R Approach To Some Problems

📚 Recommended Reads

Other useful resources for understanding the topic better...

The R book covers a lot of simulations and is a must buy for anyone who is half-serious about the topic.

I also recommend a Stats book because I couldn't cover an entire course in 1 newsletter!

Of course, I don't want to sugarcoat the topic as it poses multiple ethical challenges in terms of bias.

Combinatories & Permutations (Even if you hate Math)
Synthetic Data Generation + Python Libraries
R For Marketing Research and Analytics (Must buy for the topic)
Learning Statistics with R (Probability Chapter)
Ethical Challenges of Synthetic Data in ML

❗️ Feedback and Recommendations

If you have ideas/recommendations for the next issues of Seotistics, you can simply reply to this email.

Marco Giordano
SEO Specialist & Data Analyst

Follow me on 🔽🔽🔽:

Bernerstrasse Süd 169, Zurich, Switzerland
Unsubscribe · Preferences

Seotistics - Analytics & SEO

🔎 [Part 2] How To Generate SEO Data For Testing