Tutorial: Validating Data Set Matches with MTurk

Amazon Mechanical Turk
Happenings at MTurk
5 min readJul 17, 2017

--

This post steps through a solution for matching data sets with MTurk. It’s not uncommon to have to align multiple data sets, whether they are sales leads or account records. There are many techniques to make a soft match between records, but how do you make sure that the matches you’ve made across hundreds of records are correct?

For example, is this the same company?

Designing a one-to-one matching task

Let’s start with an easy example like the one above. We already have a good idea that these are the same company based on matches between a number of attributes of the records. To apply some human knowledge to this comparison, we’re going to create an MTurk task that will let Workers review each item and decide if the match is right.

To build a task, start with the Item Equality template. In this project we ask Workers to categorize a match as “Same” or “Different”.

Give the new Project a name as well as a Title and Description that will be shown to Workers:

To improve the accuracy of our results we’re going to ask three Workers to provide answers for each match and give Workers 10 minutes to complete each task even though we only expect it will take Workers about 30 seconds to complete. Since we need the data tomorrow, we’ll only let the task live on the MTurk marketplace for 1 day. After that, if the work hasn’t been completed it will be removed from the site. Finally, we’ll automatically approve work in seven days.

To design our layout, we’ll first define the categories that the Workers will have to select from. You will need to edit the source code to look as follows:

To add the company information to the layout, we’ll delete the existing images and replace it with a table. If you’re not familiar with HTML, don’t fret. We’re just going to be adding a simple table. To do so, you will delete the highlighted part from the image below.

In its place we’re going to paste in the following table:

<table class="table table-condensed table-striped table-responsive">
<tbody>
</tbody>
<colgroup>
<col class="col-xs-1 col-md-1" />
<col class="col-xs-3 col-md-3" />
<col class="col-xs-3 col-md-3" />
</colgroup>
<tbody>
<tr>
<th>Field</th>
<th>Company A</th>
<th>Company B</th>
</tr>
<tr>
<td>Name</td>
<td>${name_a}</td>
<td>${name_b}</td>
</tr>
<tr>
<td>Address Line 1</td>
<td>${address_line1_a}</td>
<td>${address_line1_b}</td>
</tr>
<tr>
<td>City</td>
<td>${city_a}</td>
<td>${city_b}</td>
</tr>
<tr>
<td>State</td>
<td>${state_a}</td>
<td>${state_b}</td>
</tr>
<tr>
<td>Zipcode</td>
<td>${zipcode_a}</td>
<td>${zipcode_b}</td>
</tr>
<tr>
<td>Phone</td>
<td>${phone_a}</td>
<td>${phone_b}</td>
</tr>
</tbody>
</table>

Next you will update the instructions for the Workers. You want to be detailed enough so the Workers know exactly what you want them to do, but not so many instructions that could cause confusion. Within the code there is a section for short instructions and long instructions. The short instructions would be a one sentence description of the task. The long instructions is where you would be the details of the task.

The example below shows where you would edit the information in the highlighted areas. You can put anything you want in the instructions. They don’t have to match the example.

When you click the Preview button, you should see the following. All of the values captured in ${} are fields that you will be providing in the data file you give MTurk to create your task. We’ll talk more about those later.

As a final step we’ll edit the categories that a Worker can select so that the only options are Same and Different.

Preparing your data

Now we need to prepare our data for submitting to MTurk. We’ll need a data file containing the columns that match the field values we specified earlier in our layout. If you don’t remember what they were, you can click on your Project name and see a list of the parameters it requires

In our Excel file, each line will have all of the details on the two company names we are comparing.

Now select File->Save As in Excel and save it as a CSV file.

Publish task

You’re now all set to publish your task. Click Publish batch and upload the CSV file you just created.

After your file has been processed, you will be able to Preview your task. Double-check to make sure that the information displays clearly.

When you’re satisfied with your task, you can publish it to Workers to complete. As a best practice, we recommend publishing a small portion of your file first to make sure you get the results you expect.

If you have any questions, please post a question to our MTurk forums. To become a Requester, sign up here. Want to contribute as a Worker customer? Get started here.

--

--