Tutorial: Estimating the Accuracy of MTurk Batches

Amazon Mechanical Turk (MTurk) is a powerful method for getting fast and high quality annotations from Workers but how do you know if the results you’re getting back are accurate? In our previous tutorials explaining how to find Workers who will be good at your task and reconciling multiple Worker responses we discussed techniques for ensuring good results. In this tutorial, we’ll explain how to combine the concepts in those approaches to estimate the accuracy of a batch.

Much of this tutorial will rely on use of Microsoft Excel or Google Sheets to manage the results of your tasks. To view the sample spreadsheet we will be working from, you can download it here.

Including known answers

In our tutorial on finding Workers who will be good at your task, we started by building a set of known answers. We will now use that same set of known answers to estimate the accuracy of our results.

In our next batch of one thousand images we’re going to add the fifty images we labeled earlier. After publishing and downloading results for the batch we will apply the same process we used in the reconciliation tutorial to produce final answers for this batch. Note that for this batch we submitted images 4001 to 5000 and also received responses for our fifty known images that number 123 to 3029.

In this case, we didn’t get an agreed answer for image 2050 from our known answer set so we’ll exclude it from our analysis.

Checking Results

Now we can check our batch against our set of known answers. To do this, go to your KnownAnswers sheet and add a column called Batch Answer with a formula similar to the following that will lookup the answer for the batch from your FinalAnswers sheet.

=VLOOKUP(A2,FinalAnswers!A:B,2,FALSE)

As we can see, most of the answers were correct with the exception of image 137:

In fact, if we go back to our Results sheet we see that for image 137 Worker A2WWWWWWWW who previously showed as being in disagreement was the only one that matched our known answer.

Since we’re still early in our testing of this task we’ll want to take another look at this image to make sure our known answer is correct or if it’s a case where it’s possible for a different opinion on this image and our instructions need to be updated to provide more clarity.

To check that all of our results are correct we’ll add another column called Correct to our KnownAnswers sheet with the following formula.

=B2=C2

This will simply display TRUE if the values match and FALSE if they do not.

Remember, in our example we didn’t get a response back for image 2050 so we’ll exclude it from our analysis. We now have 49 responses and 48 correct answers. We simply divide 48 by 49 to yield an estimated batch accuracy of 98%.

Next steps

The more known answers you include in your batch, the more confidence you can have in your accuracy score. Over time it’s also good to expand your known answer set to include new items so that it accurately reflects the state of the data you are working with.

We hope you found this to be a helpful introduction measuring batch accuracy. If you have any questions, please post a question to our MTurk forums. To become a Requester, sign up here. Want to contribute as a Worker customer? Get started here.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.

Responses
The author has chosen not to show responses on this story. You can still respond by clicking the response bubble.