Replication data for: Topic-partitioned multinetwork embeddings (doi:10.7910/DVN/GGHMFT)

View:

Part 1: Document Description
Part 2: Study Description
Part 5: Other Study-Related Materials
Entire Codebook

(external link)

Document Description
Citation
Title:	Replication data for: Topic-partitioned multinetwork embeddings
Identification Number:	doi:10.7910/DVN/GGHMFT
Distributor:	Harvard Dataverse
Date of Distribution:	2012-12-13
Version:	2
Bibliographic Citation:	Krafft, Peter; Moore, Juston; Desmarais, Bruce; Wallach, Hanna, 2012, "Replication data for: Topic-partitioned multinetwork embeddings", https://doi.org/10.7910/DVN/GGHMFT, Harvard Dataverse, V2
Study Description
Citation
Title:	Replication data for: Topic-partitioned multinetwork embeddings
Identification Number:	doi:10.7910/DVN/GGHMFT
Authoring Entity:	Krafft, Peter (Massachusetts Institute of Technology)
	Moore, Juston (University of Massachusetts Amherst)
	Desmarais, Bruce (University of Massachusetts Amherst)
	Wallach, Hanna (University of Massachusetts Amherst)
Producer:	Bruce Desmarais
Date of Production:	2012
Distributor:	Harvard Dataverse
Distributor:	Bruce Desmarais
Access Authority:	Bruce Desmarais
Date of Deposit:	2012-12-13
Date of Distribution:	2012
Holdings Information:	https://doi.org/10.7910/DVN/GGHMFT
Study Scope
Keywords:	network analysis, topic modeling, machine learning, political science, latent space
Topic Classification:	network analysis, topic modeling, machine learning, political science
Abstract:	We introduce a joint model of network content and context designed for exploratory analysis of email networks via visualization of topic-specific communication patterns. Our model is an admixture model for text and network attributes which uses multinomial distributions over words as mixture components for explaining text and latent Euclidean positions of actors as mixture components for explaining network attributes. We validate the appropriateness of our model by achieving state-of-the-art performance on a link prediction task and by achieving semantic coherence equivalent to that of latent Dirichlet allocation. We demonstrate the capability of our model for descriptive, explanatory, and exploratory analysis by investigating the inferred topic-specific communication patterns of a new government email dataset, the New Hanover County email corpus. This work was supported in part by the Center for Intelligent Information Retrieval and in part by the NSF GRFP under grant #1122374. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsors.
Time Period:	2011-02-2011-02
Date of Collection:	2011-03-2011-06
Country:	United States
Geographic Coverage:	New Hanover County, North Carolina
Geographic Unit(s):	County
Universe:	communication networks
Kind of Data:	government email archive
Methodology and Processing
Sources Statement
Data Access
Notes:	<a href="http://creativecommons.org/publicdomain/zero/1.0">CC0 1.0</a>
Other Study Description Materials
Related Materials
	N/A
Related Studies
	N/A
Related Publications
Citation
Title:	Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012.
Bibliographic Citation:	Peter Krafft, Juston Moore, Bruce Desmarais, Hanna Wallach. Topic-partitioned multinetwork embeddings.Advances in Neural Information Processing Systems 25. 2012.
Other Study-Related Materials
Label:	authors.txt
Text:	- each line represents the email address of an author - the order of the authors correspond to the author index and recipient columns in edge-matrix.txt
Notes:	text/plain; charset=US-ASCII
Other Study-Related Materials
Label:	edge-matrix.csv
Text:	- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - the second column gives an index between zero and the number of actors in the email network minus one (inclusive) indicating the author of that email - there is one additional column for each actor in the email network. Each column should contain either a one (indicating that the actor is a recipient of that row's email) or a zero (indicating that the actor is not a recipient of that row's email). The order of these columns should correspond to the indices used to indicate the authors of the emails. The column for the email's author should be 0.
Notes:	text/plain; charset=US-ASCII
Other Study-Related Materials
Label:	README
Text:	description of the data files
Notes:	text/plain; charset=US-ASCII
Other Study-Related Materials
Label:	vocab.txt
Text:	- each line represents a word type in the vocabulary - the order of the words must correspond to the order of the columns in the word matrix file
Notes:	text/plain; charset=US-ASCII
Other Study-Related Materials
Label:	word-matrix.csv
Text:	- each line represents a document - columns are separated by commas - the first column gives the name of the original document location (this can also be an empty column) - each subsequent column should contain a nonnegative number indicating the number of times the word type associated with that column occurs in that document (i.e. a vector of word counts corresponding to the word types given in the vocab folder).
Notes:	text/plain; charset=US-ASCII