View: |
Part 1: Document Description
|
Citation |
|
---|---|
Title: |
Replication data for: Topic-partitioned multinetwork embeddings |
Identification Number: |
doi:10.7910/DVN/GGHMFT |
Distributor: |
Harvard Dataverse |
Date of Distribution: |
2012-12-13 |
Version: |
2 |
Bibliographic Citation: |
Krafft, Peter; Moore, Juston; Desmarais, Bruce; Wallach, Hanna, 2012, "Replication data for: Topic-partitioned multinetwork embeddings", https://doi.org/10.7910/DVN/GGHMFT, Harvard Dataverse, V2 |
Citation |
|
Title: |
Replication data for: Topic-partitioned multinetwork embeddings |
Identification Number: |
doi:10.7910/DVN/GGHMFT |
Authoring Entity: |
Krafft, Peter (Massachusetts Institute of Technology) |
Moore, Juston (University of Massachusetts Amherst) |
|
Desmarais, Bruce (University of Massachusetts Amherst) |
|
Wallach, Hanna (University of Massachusetts Amherst) |
|
Producer: |
Bruce Desmarais |
Date of Production: |
2012 |
Distributor: |
Harvard Dataverse |
Distributor: |
Bruce Desmarais |
Access Authority: |
Bruce Desmarais |
Date of Deposit: |
2012-12-13 |
Date of Distribution: |
2012 |
Holdings Information: |
https://doi.org/10.7910/DVN/GGHMFT |
Study Scope |
|
Keywords: |
network analysis, topic modeling, machine learning, political science, latent space |
Topic Classification: |
network analysis, topic modeling, machine learning, political science |
Abstract: |
We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We validate the appropriateness of our model by achieving state-of-the-art performance on a link prediction task and by achieving semantic coherence equivalent to that of latent Dirichlet allocation. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus. This work was supported in part by the Center for Intelligent Information Retrieval and in part by the NSF GRFP under grant #1122374. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors. |
Time Period: |
2011-02-2011-02 |
Date of Collection: |
2011-03-2011-06 |
Country: |
United States |
Geographic Coverage: |
New Hanover County, North Carolina |
Geographic Unit(s): |
County |
Universe: |
communication networks |
Kind of Data: |
government email archive |
Methodology and Processing |
|
Sources Statement |
|
Data Access |
|
Notes: |
<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a> |
Other Study Description Materials |
|
Related Materials |
|
N/A |
|
Related Studies |
|
N/A |
|
Related Publications |
|
Citation |
|
Title: |
Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012. |
Bibliographic Citation: |
Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012. |
Label: |
authors.txt |
Text: |
- each line represents the email address of an author - the order of the authors correspond to the author index and recipient columns in edge-matrix.txt |
Notes: |
text/plain; charset=US-ASCII |
Label: |
edge-matrix.csv |
Text: |
- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - the second column gives an index between zero and the number of actors in the email network minus one (inclusive) indicating the author of that email - there is one additional column for each actor in the email network. Each column should contain either a one (indicating that the actor is a recipient of that row's email) or a zero (indicating that the actor is not a recipient of that row's email). The order of these columns should correspond to the indices used to indicate the authors of the emails. The column for the email's author should be 0. |
Notes: |
text/plain; charset=US-ASCII |
Label: |
README |
Text: |
description of the data files |
Notes: |
text/plain; charset=US-ASCII |
Label: |
vocab.txt |
Text: |
- each line represents a word type in the vocabulary - the order of the words must correspond to the order of the columns in the word matrix file |
Notes: |
text/plain; charset=US-ASCII |
Label: |
word-matrix.csv |
Text: |
- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - each subsequent column should contain a nonnegative number indicating the number of times the word type associated with that column occurs in that document (i.e. a vector of word counts corresponding to the word types given in the vocab folder). |
Notes: |
text/plain; charset=US-ASCII |