Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it (doi:10.7910/DVN/YDNVSN)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

Document Description

Citation

Title:

Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it

Identification Number:

doi:10.7910/DVN/YDNVSN

Distributor:

Harvard Dataverse

Date of Distribution:

2024-10-30

Version:

1

Bibliographic Citation:

Green, Breanna; Hobbs, William; Avila, Sofia; Rodriguez, Pedro; Spirling, Arthur; Stewart, Brandon, 2024, "Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it", https://doi.org/10.7910/DVN/YDNVSN, Harvard Dataverse, V1

Study Description

Citation

Title:

Replication materials for: Measuring Distances in High Dimensional Spaces Why Average Group Vector Comparisons Exhibit Bias, And What to Do About it

Identification Number:

doi:10.7910/DVN/YDNVSN

Authoring Entity:

Green, Breanna (Cornell University)

Hobbs, William (Cornell University)

Avila, Sofia (Princeton University)

Rodriguez, Pedro (New York University)

Spirling, Arthur (Princeton University)

Stewart, Brandon (Princeton University)

Distributor:

Harvard Dataverse

Access Authority:

Hobbs, Will

Depositor:

Code Ocean

Holdings Information:

https://doi.org/10.7910/DVN/YDNVSN

Study Scope

Keywords:

Social Sciences

Abstract:

Analysts often seek to compare representations in high-dimensional space, e.g. embedding vectors of the same word across groups. We show that the distance measures calculated in such cases can exhibit considerable statistical bias, that stems from uncertainty in the estimation of the elements of those vectors. This problem applies to Euclidean distance, cosine similarity, and other similar measures. After illustrating the severity of this problem for text-as-data applications, we provide and validate a bias correction for the squared Euclidean distance. This same correction also substantially reduces bias in ordinary Euclidean distance and cosine similarity estimates, but corrections for these measures are not quite unbiased and are (non-intuitively) bimodal when distances are close to zero. The estimators require obtaining the variance of the latent positions. We (will) implement the estimator in free software, and we offer recommendations for related work.

Methodology and Processing

Sources Statement

Data Access

Other Study Description Materials

Other Study-Related Materials

Label:

capsule-0912480.zip

Notes:

application/zip

Other Study-Related Materials

Label:

result-5667186b-24f1-475d-8bdf-48f4cdd9f5da.zip

Notes:

application/zip