View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it |
Identification Number: |
doi:10.7910/DVN/YDNVSN |
Distributor: |
Harvard Dataverse |
Date of Distribution: |
2024-10-30 |
Version: |
1 |
Bibliographic Citation: |
Green, Breanna; Hobbs, William; Avila, Sofia; Rodriguez, Pedro; Spirling, Arthur; Stewart, Brandon, 2024, "Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it", https://doi.org/10.7910/DVN/YDNVSN, Harvard Dataverse, V1 |
Citation |
|
Title: |
Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it |
Identification Number: |
doi:10.7910/DVN/YDNVSN |
Authoring Entity: |
Green, Breanna (Cornell University) |
Hobbs, William (Cornell University) |
|
Avila, Sofia (Princeton University) |
|
Rodriguez, Pedro (New York University) |
|
Spirling, Arthur (Princeton University) |
|
Stewart, Brandon (Princeton University) |
|
Distributor: |
Harvard Dataverse |
Access Authority: |
Hobbs, Will |
Depositor: |
Code Ocean |
Holdings Information: |
https://doi.org/10.7910/DVN/YDNVSN |
Study Scope |
|
Keywords: |
Social Sciences |
Abstract: |
Analysts often seek to compare representations in high-dimensional space, e.g. embedding vectors of the same word across groups. We show that the distance measures calculated in such cases can exhibit considerable statistical bias, that stems from uncertainty in the estimation of the elements of those vectors. This problem applies to Euclidean distance, cosine similarity, and other similar measures. After illustrating the severity of this problem for text-as-data applications, we provide and validate a bias correction for the squared Euclidean distance. This same correction also substantially reduces bias in ordinary Euclidean distance and cosine similarity estimates, but corrections for these measures are not quite unbiased and are (non-intuitively) bimodal when distances are close to zero. The estimators require obtaining the variance of the latent positions. We (will) implement the estimator in free software, and we offer recommendations for related work. |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Other Study Description Materials |
|
Label: |
capsule-0912480.zip |
Notes: |
application/zip |
Label: |
result-5667186b-24f1-475d-8bdf-48f4cdd9f5da.zip |
Notes: |
application/zip |