What is probabilistic data?

Probabilistic data: Information about people derived from mathematical models

Probabilistic audience data is usually based on behavioural data like web-logs that are aggregated and analysed in order to determine the probability that a user belongs to a certain demographic category or class. Advanced algorithms try to identify distinct behavioural patterns like certain travel and browsing behaviours in order to determine the probability of the user being male or female, young or old, etc. Many behaviouristic models are in fact searching for distinct patterns of known human behaviour. Patterns that usually emerge due to humans being creatures of habit.

  • Some audiences are more likely to consume sports- and motor-news
  • Some audiences are more likely to be online at certain times of the day/week
  • Some audiences own and use certain types of devices

All these habits create distinct behavioural patterns that often can be identified algorithmically in anonymised log files. The advantage of using probabilistic modelling is the ability to scale your models since you no longer have to rely on first party interactions and people providing you with their profile information as well as login information like usernames and e-mail addresses. As long as we ensure that the correct permissions are obtained, a user does not need to login and provide you with personal data before online behaviour can be observed, logged and algorithmically matched to a specific demographic target group.

While the true strength of the probabilistic approach lies in its ability to scale, it’s inherent weakness is often a lack of deterministic data to actually validate the accuracy of the model with. The question is: How do we know that our model is right? The answer is: We can validate predicted profiles if we have “ground truth” for a sufficient subset of them. For this reason, deterministic and probabilistic data are complementary.

Probabilistic modelling does not operate in absolutes, but provide classification with a degree of certainty. Validation is in other words needed in order to document the effectiveness of any probabilistic derived audience.

This is why AudienceProject has chosen to deploy a combined approach where behaviouristic modelling is used to classify anonymous users into demographic classes, while the deterministic data is used for testing the accuracy and precision of the models and to improve our behaviouristic models iteratively.

This approach gives us the benefit of high accuracy levels combined with massive scale.

How did we do?

What is deterministic data?