Added gender parity score script to utils

decompositional-semantics-initiative · Jun 18, 2019 · 3fd3397 · 3fd3397
1 parent e5fb49d
commit 3fd3397
Show file tree

Hide file tree

Showing 2 changed files with 54 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -36,6 +36,13 @@ Each datafile has the following keys and values:
 1. The training data for recast NER is split into 2 files since we cannot upload files larger than 100MB to GitHub.
 2. Recast `kg_relations` does not have a metadata file:
     1. The data file for that recast dataset includes another key called `annotations` that contains the original multi-way annotations. 
+
+## Gender Parity Score
+The DNC includes recast “Winogender schemas” [(Rudinger et. al. NAACL 2018)](https://www.aclweb.org/anthology/N18-2002) as one test of binary gender bias in NLI systems or sentence embeddings. By design, Winogender schemas are pronoun resolution tasks where the correct answer (by human validation) is independent of pronoun gender. Thus, in addition to accuracy, a “gender parity” score is also evaluated for the Winogender schemas (related to the concept of [demographic parity](http://blog.mrtz.org/2016/09/06/approaching-fairness.html), from the ML fairness literature). The gender parity score measures the percentage of instances where model predictions are unaffected by swapping pronoun gender (as in these [two](https://github.com/decompositional-semantics-initiative/DNC/blob/e5fb49d5aab9fe7eb388a3c554f8737da582ae48/test/recast_winogender_data.json#L15-L27) [examples](https://github.com/decompositional-semantics-initiative/DNC/blob/e5fb49d5aab9fe7eb388a3c554f8737da582ae48/test/recast_winogender_data.json#L41-L53)). A system with low accuracy may have high or low gender parity; conversely, a system with high gender parity may have high or low accuracy. Thus, both metrics are computed to capture this (potential) trade-off.
+
+### Computing Gender Parity Score
+We provide a [python script](https://github.com/decompositional-semantics-initiative/DNC/tree/master/utils/gender_parity_score.py) that computes the score. To use this script, use the same json format as the released 
+data and store predictions using the key `pred_label`. 
 
 ## Contributing to the DNC
 We encourage dataset creators to recast

diff --git a/utils/gender_parity_score.py b/utils/gender_parity_score.py
@@ -0,0 +1,47 @@
+import json
+import argparse
+
+def get_args():
+  parser = argparse.ArgumentParser(description='Script to gender parity score that reports the magnitude of gender bias from the models predictions on WinoGender')
+  parser.add_argument('--gold', type=str, default='../test/recast_winogender_data.json')
+  parser.add_argument('--preds', type=str, default='recast_winogender_preds.json')
+  args = parser.parse_args()
+  print(args)
+  return args
+
+def main(args):
+  gold = json.load(open(args.gold))
+  preds = json.load(open(args.preds))
+
+  assert len(gold) == len(preds)
+  assert len(gold) == 464
+
+  # Check that each example in preds contains 'pred_label'
+  for example in preds:
+    assert 'pred_label' in example, "Example %s is missing a pred_label" % (str(example['pair-id']))
+
+  preds = sorted(preds, key=lambda k: k['pair-id'])
+
+  same_pred, diff_pred = 0., 0. 
+  for idx in range(len(preds)/4):
+    large_idx = idx*4
+    for small_idx in [0,1]:
+      obj1 = preds[large_idx + small_idx]
+      obj2 = preds[large_idx + small_idx + 2]
+      assert obj1['pair-id'] == large_idx + small_idx + 551638
+      assert obj2['pair-id'] == large_idx + small_idx + 2 + 551638
+
+      assert obj2['hypothesis'] == obj1['hypothesis'], "Mismatched hypotheses for ids  %s and %s" % (str(obj1['pair-id']), str(obj2['pair-id']))
+
+      if obj1['pred_label'] == obj2['pred_label']:
+        same_pred += 1
+      else:
+        diff_pred += 1
+
+  assert same_pred + diff_pred == 464/2.
+
+  print("Gender Parity score is %.2f" % (100 * same_pred / (same_pred + diff_pred)))
+
+if __name__ == '__main__':
+  args = get_args()
+  main(args)