Blog | NACE Classification Series: Introduction and Data Collection

.ie domains
Data Analytics
Domain
NACE Classification
by Vindhya Nagaraj
18 Mar 2025

Introduction to the NACE Classification Series

In this blog series, we’ll guide you through a project we undertook to classify Irish domains based on their web content using the NACE classification system. This system provides a standardised way to categorise economic activities, helping us better understand the online business landscape in Ireland.

The series will cover key phases of our project, providing insights into the methodology and models we developed along the way:

  • Blog Post 1: Introduction and Data Collection | In the first post, we’ll introduce the NACE classification system and explain its importance in understanding economic activities on the internet. We’ll also discuss how we gathered and prepared the data for classification, laying the groundwork for the project.
  • Blog Post 2: Data Annotation, Sample Selection, and Feature Engineering | The following blog post will delve into the annotation process and how we carefully selected domain samples for classification. We’ll also explore how WebCat (developed by DNS Belgium) helped streamline this step and dive into feature engineering, significantly improving the classification process.
  • Blog Post 3: Developing the Version 1 Model, Iterations, and Final Version 2 Model | In the third blog, we’ll discuss the development of our initial V1 classification model, the improvements we made along the way, and how we arrived at the final V2 model. Additionally, we’ll touch on the transition toward productionalisation, ensuring the model’s readiness for real-world application.
  • Blog Post 4: Results and Key Insights | In the final blog, we will share valuable insights and findings from applying the NACE classification to our .ie registry data. Our analysis will dive into registrar-level metrics, including domain age, registrant type, and other key factors, all of which were examined and visualised using Power BI dashboards.

Let’s get started by looking at our first blog in the series about the introduction, background, and scope of the project.

Understanding NACE

NACE (Nomenclature statistique des activités économiques dans la Communauté européenne) is a European-standard classification system used to categorise economic activities across Europe.

It plays a crucial role in analysing business trends and economic activities, particularly in the digital landscape.

The NACE system follows a hierarchical structure:

  • Level 1: 21 broad sectors, identified by letters A to U

  • Level 2: 88 divisions, identified by two-digit numerical codes (01 to 99)

  • Level 3: 272 groups, identified by three-digit numerical codes (01.1 to 99.0)

  • Level 4: 615 classes, identified by four-digit numerical codes (01.11 to 99.00)

For this project, we focused on classifying Irish domains into Level 1 categories, which provide a high-level view of the economic sectors represented by these domains.

Why NACE Classification Matters

In today’s digital landscape, where businesses and services flourish online, it’s essential to understand the economic activities happening on the internet. Our project focuses on classifying Irish domains into the level 1 NACE categories. By doing this, we can gain insights into the digital landscape of Irish businesses. Ireland’s CSO (Central Statistics Office) already tracks business statistics for each category in this classification, and our work aims to compare and understand how these online businesses fit within the broader economic trends in Ireland.

  • Under-represented sectors: Are there industry sectors that are under-represented in the digital world compared to registered businesses in Ireland?

  • Growth trends: Are there industries growing faster in the real world than online, or vice versa?

  • Industry evolution: How do industries evolve in the domain world?

  • Targeting sectors: Can we identify sectors that could benefit from adopting .ie domains?

  • Domain activity and renewal: Which industries are more likely to keep their domains active or renew them?

  • Security stance: Can we correlate industry sectors with their online security practices?

This classification project allows us to bridge the gap between real-world statistics recorded by the CSO and the digital activities present today, providing valuable insights into the evolving online economy.

Our Solution: Building a Predictive Model for Classification

Problem Statement:
We decided to build a predictive model that could classify domains into NACE industry codes based on data collected from websites. Spoiler alert: We successfully achieved this, but the journey was full of experimentation, challenges, and iterative improvements.

Let’s begin with our initial approach to developing the model, including how we started the process and the first challenges we encountered.

Initial Approach:

We started by crawling Irish domains with our custom-built crawler, designed to gather a wide range of data from websites, including text, titles, descriptions, and domain names. Our initial focus was on a dataset of approximately 4,440 high-content websites, which we manually labelled according to NACE categories. Using this labelled data and creating features from webpage metadata descriptions, we trained the first version of our model. This model achieved an overall classification accuracy of 87%, though some NACE categories didn’t perform as well as expected.

Additionally, we collaborated with other European ccTLDs via CENTR on the WebCat project, developed by DNS Belgium. We tested this model on our dataset but observed a lower accuracy of 67%.

Existing Data and Scope

Key Challenges

As with any project, we faced challenges early on. Here are some of the most critical hurdles we encountered:

  • Data Quality: Our first major challenge was determining if the webpage metadata description was sufficient for feature generation, and if not, which combinations of data fields from our crawler data were most effective for generating features.

  • Training Samples: We realized that we didn’t have enough labelled samples for some NACE categories. This limitation hindered the accuracy of our model for certain sectors. To overcome this, we explored a few strategies to boost the quantity and quality of our training data.

Overcoming the challenges

Our initial exploration and analysis revealed the need to collect more labelled data that is both abundant and high-quality for feature development.

To start looking into an effective predictive model, we explored two main strategies:

  • Improving Data Quality:

    • We analysed various data fields collected by the crawler to determine which combinations of data led to the best classification performance.

    • After multiple iterations, we determined that the following combination of text-based data fields proved to be the most effective.

      • Domain Name: The unique string used to identify and access a website

      • Text: The raw text extracted from the home page of the website after crawling.

      • Description: A brief summary of the website found in its metadata.

      • Title: Another important metadata field that indicates the title of the webpage. Both the description and title are typically included by website creators to enhance Search Engine Optimisation (SEO).

  • Boosting Training Sample Size:

    • We explored third-party APIs for domain labels but found they didn’t offer significant value towards our labelling efforts.

    • The earlier referenced Belgian registry’s WebCat model’s predictions seemed to be a good tool to identify more samples and further use the predictions for annotating too (More on this in blog post number 2)

    • The most effective methodology was leveraging manual annotations performed by Team Xavier here at .ie. The Prodigy tool helped us effectively label many training samples, which went on to be used for the model.

All these methodologies and findings combined laid the foundation for building a robust model that significantly improved our classification accuracy, leading to enhanced performance across all NACE categories.

What’s Next in the Blog Series

In the next blog post, we will dive deeper into the data annotation process, sample selection, and feature engineering. We will share the challenges we faced in picking effective samples and how these steps played a crucial role in refining our predictive model. Stay tuned to discover the tools and techniques that helped us move closer to our final model.

Final Thoughts

This blog sets the stage for an exciting journey into NACE classification using the analysis of internet data. By classifying domains into industry categories, we will uncover valuable insights into Ireland’s online business landscape. The following posts will further explore the technical aspects and developments of the predictive model.