Replication data for: Topic-partitioned multinetwork embeddings (doi:10.7910/DVN/GGHMFT)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description

Citation

Title:

Replication data for: Topic-partitioned multinetwork embeddings

Identification Number:

doi:10.7910/DVN/GGHMFT

Distributor:

Harvard Dataverse

Date of Distribution:

2012-12-13

Version:

2

Bibliographic Citation:

Krafft, Peter; Moore, Juston; Desmarais, Bruce; Wallach, Hanna, 2012, "Replication data for: Topic-partitioned multinetwork embeddings", https://doi.org/10.7910/DVN/GGHMFT, Harvard Dataverse, V2

Study Description

Citation

Title:

Replication data for: Topic-partitioned multinetwork embeddings

Identification Number:

doi:10.7910/DVN/GGHMFT

Authoring Entity:

Krafft, Peter (Massachusetts Institute of Technology)

Moore, Juston (University of Massachusetts Amherst)

Desmarais, Bruce (University of Massachusetts Amherst)

Wallach, Hanna (University of Massachusetts Amherst)

Producer:

Bruce Desmarais

Date of Production:

2012

Distributor:

Harvard Dataverse

Distributor:

Bruce Desmarais

Access Authority:

Bruce Desmarais

Date of Deposit:

2012-12-13

Date of Distribution:

2012

Holdings Information:

https://doi.org/10.7910/DVN/GGHMFT

Study Scope

Keywords:

network analysis, topic modeling, machine learning, political science, latent space

Topic Classification:

network analysis, topic modeling, machine learning, political science

Abstract:

We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We validate the appropriateness of our model by achieving state-of-the-art performance on a link prediction task and by achieving semantic coherence equivalent to that of latent Dirichlet allocation. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus. This work was supported in part by the Center for Intelligent Information Retrieval and in part by the NSF GRFP under grant #1122374. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors.

Time Period:

2011-02-2011-02

Date of Collection:

2011-03-2011-06

Country:

United States

Geographic Coverage:

New Hanover County, North Carolina

Geographic Unit(s):

County

Universe:

communication networks

Kind of Data:

government email archive

Methodology and Processing

Sources Statement

Data Access

Notes:

<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a>

Other Study Description Materials

Related Materials

N/A

Related Studies

N/A

Related Publications

Citation

Title:

Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012.

Bibliographic Citation:

Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012.

Other Study-Related Materials

Label:

authors.txt

Text:

- each line represents the email address of an author - the order of the authors correspond to the author index and recipient columns in edge-matrix.txt

Notes:

text/plain; charset=US-ASCII

Other Study-Related Materials

Label:

edge-matrix.csv

Text:

- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - the second column gives an index between zero and the number of actors in the email network minus one (inclusive) indicating the author of that email - there is one additional column for each actor in the email network. Each column should contain either a one (indicating that the actor is a recipient of that row's email) or a zero (indicating that the actor is not a recipient of that row's email). The order of these columns should correspond to the indices used to indicate the authors of the emails. The column for the email's author should be 0.

Notes:

text/plain; charset=US-ASCII

Other Study-Related Materials

Label:

README

Text:

description of the data files

Notes:

text/plain; charset=US-ASCII

Other Study-Related Materials

Label:

vocab.txt

Text:

- each line represents a word type in the vocabulary - the order of the words must correspond to the order of the columns in the word matrix file

Notes:

text/plain; charset=US-ASCII

Other Study-Related Materials

Label:

word-matrix.csv

Text:

- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - each subsequent column should contain a nonnegative number indicating the number of times the word type associated with that column occurs in that document (i.e. a vector of word counts corresponding to the word types given in the vocab folder).

Notes:

text/plain; charset=US-ASCII