Tutorial: How to verify crowdsourced training data using a Known Answer Review Policy

Published in

Happenings at MTurk

8 min readJun 5, 2017

“Known Answers” or “Golden Answers” are a common mechanism used by customers to track the quality of crowdsourced tasks when building ML training data sets. Known Answers are tasks you present to Workers but for which you already have answers. They are commonly mixed in with your tasks where the answer is unknown when publishing work on MTurk. By comparing the input from an Annotator (Worker) with the Known Answer, you can measure performance and get a sense of the quality of his or her work overall.

MTurk offers an API feature called Review Policies to make this process more convenient. This feature allows you, the Requester, to automate the use of Known Answers when reviewing Worker results. This allows you to:

Automatically approve or reject tasks submitted by Workers based on their Known Answer score. For example, if a Worker gets 4 of the 5 Known Answers in a task wrong, you may want to reject their Assignment.
Automatically get input from another Worker if the present Worker does not complete your Known Answer questions correctly (this is called “extending” your task).

Review Policies can save you a lot of time if you want to use the Known Answers technique. This tutorial will walk through an example of how to set it up using Python (though these concepts apply equally well to all supported languages and SDKs). Let’s get started!

Prerequisites
In order to follow along with this tutorial, you will need a Python environment with the AWS Boto3 SDK, an MTurk Requester Sandbox account, and a linked AWS account with an IAM user configured for MTurk access. This beginner’s tutorial will help you get all of this configured if you’re new to MTurk and will also give you some basic familiarity with how the API generally works.

How Review Policies Work
First — each task (HIT) that you post on MTurk can be completed multiple times by more than one Worker, resulting in multiple Assignments. Why would you ask more than one Worker to complete the same task twice? Because you can then compare the results from multiple people and improve the confidence and quality of your training data set.

When creating a new task (HIT) through the MTurk API, you can specify a Review Policy as one of the parameters in the API call. MTurk then uses the specified Review Policy to automatically process the Assignments submitted by Workers for your HIT.

There are two kinds of Review Policies that can be attached to a new HIT:

Assignment-level review policy: the Assignment-level review policy allows you to use Known Answers to vet individual task results submitted by Workers (i.e., it lets you act on a single Assignment at a time). In this tutorial we will walk through using an Assignment level review policy.

HIT-level review policy: the HIT-level review policy lets you compare multiple Assignments from Workers who all complete the same task. You can specify various actions that MTurk can automatically take depending on if the answers from multiple Workers agree with each other.

Today we will use an Assignment level review policy. We will create a HIT with 1 assignment, asking 1 Worker to annotate 3 images of animals with a yes/no answer about whether the animal is a bird. One of these images will have a Known Answer. If a Worker gets that image wrong, we will have MTurk automatically extend the HIT to one more Worker:

First let’s start by setting up our code. In your working folder, create a new file called “create.py” and connect to the MTurk API:

import boto3MTURK_SANDBOX = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'mturk = boto3.client('mturk',
   aws_access_key_id = "PASTE_YOUR_IAM_USER_ACCESS_KEY",
   aws_secret_access_key = "PASTE_YOUR_IAM_USER_SECRET_KEY",
   region_name='us-east-1',
   endpoint_url = MTURK_SANDBOX
)print "I have $" + mturk.get_account_balance()['AvailableBalance'] + " in my Sandbox account"

Run this file and if everything is set up correctly, you should get back:

$ I have $10000 in my Sandbox account

Note that we are using your IAM user access keys and secret keys in here when calling the boto3.client() function.

This lets you authenticate your calls to MTurk. However, this is not the recommended way to deploy your code in production. The best practice is to store your credentials in a separate file on your local machine, so that they don’t get inadvertently shared with others. Embedding keys directly is a quick way to test things, but once you have it working check out these guidelines on how best to manage credentials.

Now, let’s define our task with the 3 multiple choice questions. Just like we did in our Beginner’s Tutorial, we will define the question in HTML wrapped in a bit of XML, and saved as a file called “questions.xml”. Note in this example we also use Twitter Bootstrap to improve the visual presentation of the HIT for Workers.

<HTMLQuestion xmlns="http://mechanicalturk.amazonaws.com/AWSMechanicalTurkDataSchemas/2011-11-11/HTMLQuestion.xsd">
<HTMLContent><![CDATA[
<!-- YOUR HTML BEGINS -->
<!DOCTYPE html>
<html><head>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'/
<script type='text/javascript' src='https://s3.amazonaws.com/mturk-public/externalHIT_v1.js'></script>
<link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" crossorigin="anonymous">
</head><body>
<div class="container"><div>
Do any of the following images have one or more birds in them?
</div><br><form name='mturk_form' method='post' id='mturk_form' action='https://www.mturk.com/mturk/externalSubmit'><input type='hidden' value='' name='assignmentId' id='assignmentId'/><div class="panel panel-default"><div class="panel-body">
<img src="https://s3.amazonaws.com/my-image-repo/mturk-images/parrot.jpg" height="300px"/>
</div><div class="panel-body">
<input type="radio" name="question_1" value="yes"> Yes<br>
<input type="radio" name="question_1" value="no"> No<br>
</div></div><hr><div class="panel panel-default"><div class="panel-body">
<img src="https://s3.amazonaws.com/my-image-repo/mturk-images/flamingo.jpg" height="300px"/>
</div><div class="panel-body">
<input type="radio" name="question_2" value="yes"> Yes<br>
<input type="radio" name="question_2" value="no"> No<br>
</div></div><hr><div class="panel panel-default">
<div class="panel-body">
<img src="https://s3.amazonaws.com/my-image-repo/mturk-images/dog.jpg" height="300px"/>
</div><div class="panel-body">
<input type="radio" name="question_3" value="yes"> Yes<br>
<input type="radio" name="question_3" value="no"> No<br>
</div></div><p><input type='submit' id='submitButton' value='Submit' /></p></form><script language='Javascript'>turkSetAssignmentID();</script></div>
</body>
</html>
<!-- YOUR HTML ENDS -->
]]>
</HTMLContent>
<FrameHeight>600</FrameHeight>
</HTMLQuestion>

As you can see, we have added 3 images of a Flamingo, a Parrot, and a Dog (hosted using AWS S3, as we described in this tutorial) and added a pair of yes/no radio buttons with each one.

Let’s assume our Known Answer is the Flamingo — we know beforehand that the answer for “question_2” should be Yes.

Now, going back to our create.py file, lets create a HIT using the questions.xml task file and add in an Assignment-level review policy:

import boto3MTURK_SANDBOX = 'https://mturk-requester-sandbox.us-east-1.amazonaws.com'mturk = boto3.client('mturk',
   aws_access_key_id = "PASTE_YOUR_IAM_USER_ACCESS_KEY",
   aws_secret_access_key = "PASTE_YOUR_IAM_USER_SECRET_KEY",
   region_name='us-east-1',
   endpoint_url = MTURK_SANDBOX
)# Read in the questions.xml file saved in the same directory
question = open(name='questions.xml',mode='r').read()new_hit = mturk.create_hit(
Title = 'Do these images contain birds in them?',
Description = 'Please review these 3 images and mark whether any of them contain a picture of a bird',
Keywords = 'images, quick, labeling',
Reward = '0.15',
MaxAssignments = 1,
LifetimeInSeconds = 172800,
AssignmentDurationInSeconds = 600,
AutoApprovalDelayInSeconds = 14400,
Question = question,
AssignmentReviewPolicy={
'PolicyName':'ScoreMyKnownAnswers/2011-09-01',
'Parameters':[
{'Key':'AnswerKey', 'MapEntries':[{'Key': 'question_2', 'Values':['yes']}]},
{'Key': 'ApproveIfKnownAnswerScoreIsAtLeast', 'Values':['1']},
{'Key': 'RejectIfKnownAnswerScoreIsLessThan', 'Values':['1']},{'Key': 'RejectReason', 'Values':['Sorry, we could not approve your submission as you did not correctly identify the image containing the Flamingo.']},
{'Key': 'ExtendIfKnownAnswerScoreIsLessThan','Values':['1']}
]
})print "A new HIT has been created. You can preview it here:"
print "https://workersandbox.mturk.com/mturk/preview?groupId=" + new_hit['HIT']['HITGroupId']# Remember to modify the URL above when you're publishing
# HITs to the live marketplace.
# Use: https://worker.mturk.com/mturk/preview?groupId=

As you can see, this looks like a standard call to create a HIT, except there is now an extra “AssignmentReviewPolicy” parameter that takes in a dictionary. Let’s take a look at this parameter in more detail:

{
'PolicyName':'ScoreMyKnownAnswers/2011-09-01',
'Parameters':[{'Key':'AnswerKey', 'MapEntries':[{'Key':   'question_2', 'Values': ['yes']}]},
{'Key': 'ApproveIfKnownAnswerScoreIsAtLeast', 'Values':['1']},
{'Key': 'RejectIfKnownAnswerScoreIsLessThan', 'Values':['1']},{'Key': 'RejectReason', 'Values':['Sorry, we could not approve your submission as you did not correctly identify the image containing the Flamingo.']},
{'Key': 'ExtendIfKnownAnswerScoreIsLessThan','Values':['1']}
]
}

It looks a little messy, but it really just breaks down into a few simple pieces. First, the dictionary has two main keys: PolicyName and Parameters. PolicyName is always ‘ScoreMyKnownAnswers/2011–09–01’.

Parameters is where it gets more interesting. Parameters consists of an array of dictionaries containing specific keys. The main one is the “AnswerKey”. This is where you specify the Known Answers to MTurk:

{'Key':'AnswerKey', 'MapEntries':[
     {'Key': 'question_2', 'Values': ['yes']}
]}

The key for each Known Answer matches the “name” attribute set in the form controls of your questions.xml file. Remember, our Flamingo had the following radio buttons attached to it:

<img src="https://s3.amazonaws.com/my-image-repo/mturk-images/flamingo.jpg" height="300px"/><...>
  <input type="radio" name="question_2" value="yes"> Yes<br>
  <input type="radio" name="question_2" value="no"> No<br>
<...>

Our AnswerKey thus has just one entry, which is the correct answer for the “question_2” radio button group.

Once you’ve specified the AnswerKey, you can then specify what actions you want MTurk to take after evaluating the results. Let us look at the 3 actions we included in our example:

{'Key': 'ApproveIfKnownAnswerScoreIsAtLeast', 'Values':['1']},
{'Key': 'RejectIfKnownAnswerScoreIsLessThan', 'Values':['1']},{'Key': 'RejectReason', 'Values':['Sorry, we could not approve your submission as you did not correctly identify the image containing the Flamingo.']},
{'Key': 'ExtendIfKnownAnswerScoreIsLessThan','Values':['1']}

The Keys for each of these dictionaries largely explain themselves:

ApproveIfKnownAnswerScoreIsAtLeast lets MTurk automatically approve the Worker’s assignment if they get at least one Known Answer correct (i.e. if they get the Flamingo right).
RejectIfKnownAnswerScoreIsLessThan conversely lets MTurk automatically reject the Worker’s assignment if they get less than one Known Answer correct (i.e. if they get the Flamingo wrong). If you include this, you must also include the RejectReason key and include an explanation that is shared with the Worker.
Note: Instead of automatically rejecting assignments in this manner, you can also leave out these keys and manually review any assignments that don’t get automatically approved to understand what may have happened. Generally speaking, tasks submitted by Workers should only rarely be rejected in cases where a Worker is clearly submitting malicious results (such as leaving everything blank, or typing in the same text over and over again).
ExtendIfKnownAnswerScoreIsLessThan finally allows you to automatically extend the HIT to one more Worker and pay for one more set of results if this Worker got the 1 Known Answer wrong.

There are a few other parameters that are supported, and you can read more about them in the docs here.

And that’s it. Once you create the HIT and Workers start submitting Assignments, the Review Policy will kick in automatically, using your Known Answers to vet results.

If you wish, you can monitor what actions MTurk took exactly on each HIT by using the list_review_policy_results_for_hit API call. This comes in handy for advanced workflows like dynamically assigning Qualifications to Workers based on their Known Answer scores.

We hope you found this tutorial helpful. If you have any questions, please post a question to our MTurk forums. To become a Requester, sign up here. Want to contribute as a Worker customer? Get started here.

Tutorial: How to verify crowdsourced training data using a Known Answer Review Policy

Written by Amazon Mechanical Turk