Data labelling, also called data tagging, is the process of assigning various data points with information so that machine learning (ML) algorithms can better understand its meaning. It’s used to train machine learning models so that computing systems can output accurate information for use in analytics and business decision-making.
Differences Between Data Labelling and Data Annotation
Data labelling and data annotation are similar but serve different purposes. Both terms are used interchangeably in some circumstances but are not the same process. Feeding a machine-learning model data is not enough for a computer to understand how to analyse and process it. Annotations and labelling describe data so that these algorithms can decipher it.
Annotations in machine learning are metadata used to describe the data. Machine learning uses large quantities of unstructured data to output meaningful information, and annotations provide every element of input information used by computing processes. For example, a picture with various elements uses annotations to define identifiable objects in the picture so that algorithms can understand and identify the same elements in future input.
Labelling is similar, but it’s used to define data types. Input into an algorithm could be text or a picture, but a computing system doesn’t know the difference between input types unless you tell it. Data labelling tags both input types so that algorithms can decipher between the two and use them to establish patterns. In a picture, it tells algorithms what type of data is present such as a human or an animal. Labelling data is critical in Natural Language Processing (NLP) to help algorithms identify aspects of human communication, including words spoken, accents, and dialects.
What Is Data Labelling Used For?
Data labelling is used in several computer applications. It’s necessary for NLP, computer vision, and speech recognition. Although it’s primarily used in these three applications, data labelling also is used in smaller proprietary applications built for corporate analytics and consumer products.
In computer vision, data labelling helps algorithms identify items within a picture. Users enter text to describe an image search, and data labelling helps algorithms identify elements of an image to return relevant results. Computer vision uses labelling and annotations to pinpoint items in images.
Tagging elements of phrases or words in NLP helps algorithms identify nuances in the way humans communicate. Labels assigned to text allow NLP algorithms to recognise special characters and use the same colloquialisms and phrases as humans with specific dialects or accents. Corporations use labels to work with spam detection, chatbots, and virtual assistance.
Speech recognition is required for products that take speech input and output it to text or perform a specific action. Transcribing applications use labels to understand video input and output text or take speech from user input on a home automation system and perform an action based on user input.
How Does Data Labelling Work?
Machine learning uses supervised or unsupervised models. Data labelling is a component of supervised machine learning, the most-used method currently. In supervised models, input is labelled and mapped to an output. Humans define labels that apply to data, so supervised models require human input.
Labelled models are fed to algorithms, and the output is reviewed. If the output is not as expected, labels would be reviewed and potentially changed to provide different input to feed algorithms again to get different output. Machine learning depends heavily on the accuracy of human-applied data labels for analysis and accurate output.
For example, a machine learning application might be asked to identify cars in thousands of pictures. Humans go through each picture and identify which ones have cars in them and label them as such. Machine learning algorithms are fed the same images with the labels attached and identify patterns to recognise cars in future picture input. The accuracy of the output depends on the accuracy of the data labelling process.
Methods of Data Labelling
Data scientists approach the data labelling process in five different ways. Each approach has its benefits and drawbacks, but most data scientists use a preferred method to label data based on the application. Most machine learning applications use thousands (potentially millions) of data points, so the process of data labelling is tedious and time-consuming. It must be done correctly for output to be accurate, so it’s also essential for the labelling process to be thorough.
The type of approach to data labelling is determined by project complexity and size. The five methods used in data labelling are:
- In-house: for organisations with a team of data scientists that label the data. This approach is also the cheapest because it requires people already on the payroll and familiar with labelling. In-house data scientists are more efficient and can work directly with the labelling process to improve functionality. However, not every business has a budget for a full in-house data science team.
- Outsourcing: for organisations that don’t have a dedicated data science team, outsourcing to a third party is an option. Data scientists working as independent contractors help your organisation label data and facilitate a process that would not be possible in-house. Organisations can build a temporary team that works independently and as contractors, so there's no long-term commitment. The disadvantage of this method is that your temporary team will need training and help to integrate with your internal procedures.
- Crowdsourcing: if you’ve ever identified images in a CAPTCHA to verify that you’re a human, you’ve experienced crowdsourced data labelling. Using a system that gathers potentially thousands of people, an organisation can leverage the internet to label data for machine learning models. The disadvantage of crowdsourcing is quality control. Platforms offer a solution for finding crowdsourced individuals, but the quality of participants varies widely, and mistakes are almost guaranteed.
- Synthetic: data scientists use synthetic methods to use computer-generated “fake” data with attributes necessary to label data and create “real” data from it. Generative adversarial networks (GANs) use neural networks that “compete” to create fake data, compare it to real data, and then use results to determine the correct data labels. Labels are created from pre-existing datasets, which makes them more efficient in certain projects. The downside to this method is that it takes large amounts of computing power, which can make it a more expensive option.
- Programmed: labelling data using custom scripts, typically created by data scientists for accuracy and efficiency. Scripts are more efficient than human labellers and can be more accurate than crowdsourcing, but they still require a quality assurance system to ensure no mistakes are made.
Importance of Data Labelling
Computers are only as smart as humans program them to be. Without data labels, they would be unable to perform machine learning and artificial intelligence necessary for modern applications. Data labelling is a component of supervised learning, so it’s common for data scientists to label their data as a part of machine learning development.
The most time-consuming part of a machine-learning project is data preparation. The efficiency and accuracy of the preparation process determine the accuracy of the results. Understandably, data labelling is one of the most critical components in machine learning because mistakes or poor labelling can lead to unusable applications. In severe situations, mistakes can have catastrophic consequences that affect business continuity and revenue.
Types of Data Labelling
While data labelling methods determine how your organisation performs the function, there are three different types of data labelling scientists can choose from. The type chosen depends on the project, so it’s important to choose the right one to get accurate results from machine-learning applications.
Three types of data labelling are:
- Computer vision: machine learning is used to identify objects in pictures, but algorithms need data labels to find these objects. Data labels define the type of image (e.g., travel or personal) or can be used to identify objects within the picture. A picture could contain a dozen different objects, so data labels are boxes surrounding a specific object with text to describe it. Every object has a bounding box with a label to define it. After labelling images, machine learning takes the model and uses it to automatically categorise images or identify objects within images.
- Natural language processing (NLP): text applications use NLP labels to identify words or phrases to work with human written communication. NLP can also be used with computer vision to identify text in an image. Machine learning uses NLP to categorise text, identify languages, transcribe videos, or determine intent. For example, customer service applications use chat boxes to answer common customer service questions on ecommerce websites.
- Audio processing: audio labelling transcribes voice content to text or labelling sounds from audio content so that machine-learning algorithms can recognise sounds. Tagged sounds are often used in speech or applications that require control over decibels (e.g., alarms that use breaking glass to identify a break-in). The tags identifying sounds are used as the dataset for training machine-learning algorithms.
Benefits of Data Labelling
With data labelling, you’re in control of the output. Accurate data labelling means accurate data output. For organisations that need people to perform data labelling, having a good process is critical to the success of your machine-learning project.
A few benefits of data labelling include:
- Data accuracy: the method used to label data directly impacts the accuracy of results.
- Quality: data labelling enhances the quality of your machine-learning applications.
- Better results: better results mean application users are more effective at their jobs.
- Uncover business opportunities: accurate data labelling with analytics helps businesses define revenue-generating opportunities.
Challenges of Data Labelling
As with any data project, labelling data has its challenges. Businesses must be able to overcome these challenges to build effective applications with accurate results.
A few challenges with data labelling projects:
- Managing a labelling workforce: especially in crowdsourcing and outsourcing, businesses must manage human labellers, train them, and hire quality assurance people to oversee results.
- Keep consistent quality: the datasets used to build models must have quality data to produce accurate results. Data scientists must take time to review datasets to ensure that it has the correct data to build the target application.
- Financial costs: several methods are cost-effective, but data scientists and analytics are expensive, especially if an organisation uses synthetic or programmatically-generated data.
- Data privacy: data used to build models should not use private data protected by compliance regulations. Also, data must not introduce bias and should stay objective in results.
- Tooling: some data science tools are expensive, and machine-learning algorithms can also be costly.
Best Practices with Data Labelling
To get the best results, your data scientists should follow best practices in data labelling. Here are a few ways that organisations can improve the quality of results and the accuracy of data models:
- Determine if machine learning is viable: not every project should use machine learning, so make sure your project is best suited for machine learning.
- Use at least 5000 data points: good results require thousands of data points to build a model, and experts recommend at least 5000. Accuracy improves with more data points.
- Store all representative data: collect and store as many data points as possible to return to it should you need to make changes or improve labels.
- Store tangibly related data: perhaps you want to scale applications to cover related analytics, so storing this data will make it easier to scale.
- Keep backups: system failure can ruin a project, but having backups will make recovery faster and easier.
- Think in scale: as the organisation grows, more analytics might be needed, or changes to data models might be necessary. Store and use data for future purposes.
- Audit: occasionally audit data and labels for quality assurance.
Data Labelling and Cybersecurity
Protecting data should be a priority for any organisation, and machine learning is used in the cybersecurity industry. Both play a part in safe and effective machine-learning analytics. Cybersecurity applications that leverage machine learning often use data labels to help identify viruses and malware, determine suspicious traffic patterns, trigger alerts during user account anomalies, and analyse traffic for suspicious egress and ingress data transfers.
Data labelling helps consumers choose the right IoT devices and works with IoT to build physical security in homes and businesses. Security cameras, for example, can detect people in real-time videos to identify if an organisation is currently experiencing a break-in.
For the security of data itself, it’s essential to use access controls when outsourcing and crowdsourcing projects. It can be difficult to ensure security with a large, outsourced workforce, but it’s also critical to protect data from theft and stay compliant. Any tooling in the cloud should also be compliant and protected from data theft.
Use Cases
Data labelling is necessary for supervised machine-learning projects, but not every machine-learning project is supervised. However, supervised machine learning benefits some applications.
A couple of use cases include:
- In computer vision projects, data labelling is used for deep learning models for cloud and edge computing, enabling them to work with several industries. For example, manufacturing uses images and machine learning to identify issues with production, eliminate errors, and determine when machines could be damaged.
- Natural Language Processing (NLP): speech recognition and understanding text can only be accomplished with good data labelling. For example, companies that provide speech recognition for home automation use NLP to understand accents and human speech to control various appliances and IoT.
How Proofpoint Aegis Uses Machine Learning
Discover how Proofpoint Aegis threat protection platform leverages ML to detect ai-generated phishing emails.
Machine Learning Models in Proofpoint Automate
Get a better understanding of the machine learning models in Proofpoint Automate.