Data labelling is central to every machine learning project, especially in production settings. In research and academia, machine learning engineers often deal with research data prepared by others and mainly focus on finding new techniques to outperform existing ones.
In production settings, it is very much different. First is that production data are real, might be coming from several different data distributions, are prone to significant variance, etc. In most cases, annotated data is not readily available and this situation was the case for us. Our current production damage assessment engine was trained on images we carefully annotated in-house – a painful process that spanned over 2 months with several lessons we want to share in this article.
How Labeling fits into Machine learning lifecycle
Most machine learning life cycles do not include data annotation as a core component of the process. At Curacel, we have modified that to include how we fit annotations into our machine pipelines.
We’ve written a script that downloads the annotation files from S3 and updates the training pipeline.
Where to find data?
Case in point image data: Data is becoming a proprietary IP for most AI companies. The idea is that the accuracy of a model can sometimes be limited to the quality and the amount of data available. We see this concept in the stark improvement between GPT2 and GPT3. In this article, we will focus on setting up an image data labelling job on AWS sagemaker.
How to use AWS Ground Truth
The first step in the long process is to login into your AWS account, and if you do not have an account you can sign up here.
While logged in, your web screen should look similar to this:
If you are a frequent user of AWS Sagemaker it should appear on your recently visited section. If this is your first time you can easily search for “Amazon Sagemaker” then click on it. It should take you to the page below.
- Creating the Labeling workforce:
When working on an image annotation job of a thousand images, naturally it would take a long time to finish annotating those images on your own. For this problem, AWS gives you the option of adding colleagues or friends to assist in the labelling job. To get them in on the fun. Click on Ground Truth which is found on the left side of the screen, then select Labeling workforces. This should take you to the screen below.
On this page, you will find three sections, Amazon Mechanical Turk, Private, and Vendor. Click on Private, you find the screen looking like that which is below.
When you click on the create private team button, it takes you to the page below. On this page create the desired team name and in the Add worker section, use the default “Create a new Amazon Cognito user group“. Scroll to the bottom of the page and then click the Create private team button.
Now that you have created a team, you will be redirected to the page we clicked on when we wanted to create a workforce, this time you should see the name of the team created in the private team section, this will be similar to what we have below.
Click on the name of the newly created team name.
Scroll down, then click on Workers, this gives you the opportunity to add your email and the email of your colleagues or friends to the team. You should receive a link in your email, change the password and login.
It would be good to note that since there is no job assigned, there would be no task to work on. This then leads us to create a labelling job.
- Create the Labeling Job:
Now that we have the workforce, we create a job for the workforce. To the left side of the page similar to before, we click on Ground Truth then select Labeling Jobs.
The page you see should be similar to what you have above, click on create labelling job.
Create the name of the job, personally, in the input data setup section, I leave it as Automated data setup.
When you scroll down, the screen should be very similar to what we have above. Since this is an image annotation job, click on the browse S3 button to take you to Amazon S3. I am going to assume that you have the images already stored in an S3 bucket, select the bucket with the desired images. If by chance you used a script to upload your images to an S3 bucket like we did, I am sure the images will be in a folder in that bucket so click on the bucket name, this will take you into the contents of that bucket, then select the folder with the images then click on the choose button at the bottom of the page.
You will see the S3 location of the selected bucket. Now, you have to specify where you want the annotations to be saved, as seen in the picture above, you can save the annotations in the same location as the input dataset or you could specify a new location. I prefer to specify a new location but within that bucket. For example, if we have
As the S3 input location, I will specify the output as
Note: You don’t need to create that output folder in the S3 bucket. AWS will automatically do that for you.
Now that all that has been handled, select the data type, in our case, this will be Images and then choose an IAM role. Click the complete data setup button to complete the input data set up properly.
We are very close to completing this process. On the same page scroll to the Task type section, select image in the task category, and in the task selection section choose whatever task you want to do. For us, that would be the Bounding Box.
Click on the Next button, this should take you to a page where you can select the private team you created earlier.
Once this has been done, scroll down, add the desired labels ( assuming you chose a bounding box task ) and give your team some good examples and bad examples of a labelling job.
Then click on the create button. That’s it. You have successfully created a data labelling job using AWS.
Kenechi Ojukwu is a data scientist working on end to end machine learning projects at Curacel.