Replication Data for: "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews"

Version 1.0

Kim, Sunghoon; Lee, Sanghak; McCulloch, Robert, 2024, "Replication Data for: "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews"", https://doi.org/10.7910/DVN/EE3DE2, Harvard Dataverse, V1, UNF:6:nkFFI1aNdeZLqrOOqW+UhQ== [fileUNF]

Learn about Data Citation Standards.

Contact Owner

Dataset Metrics

278 Downloads

Description	We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer specification used in the paper] R version: 4.3.1 (2023-06-16 ucrt) Package versions: MCMCpack_1.6-3 coda_0.19-4 / mvtnorm_1.2-3 / MASS_7.3-60 / tm_0.7-11 / hunspell_3.0.3 / cluster_2.1.4 Processor: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz 2.50 GHz Installed RAM: 32.0 GB (31.6 GB usable) [Disclaimer] We developed these R packages and code to be as user-friendly as possible. However, a minimum familiarity with the R language and a good understanding of statistics are recommended for properly importing the data, running the model, and interpreting the results. There is no in-built protection against misuse. [Technical Help or Problem Report] We are expecting the R packages and codes to be easy to use and to be working well. Please send any questions or report bugs to Sunghoon Kim (E-mail: Sunghoon.Kim@rutgers.edu). Note, another repo is available at GitHub: https://github.com/Sunghoon-cloud/Topic-based-Clustering (2024-03-03)
Subject	Business and Management; Computer and Information Science; Mathematical Sciences; Social Sciences
Keyword	Group Variable Selection, Topic-based Segmentation, Unstructured Text Analysis
Related Publication	Kim, Sunghoon, Sanghak Lee, and Robert McCulloch (2024) “Topic-Based Segmentation for Identifying Segment-Level Grouped Variables from Unstructured Text Reviews,” Journal of Marketing Research (published online April 1, 2024), https://doi.org/10.7910/DVN/EE3DE2.doi: https://doi.org/10.7910/DVN/EE3DE2
License/Data Use Agreement	CC BY-NC 4.0

Filter by

	1 to 10 of 12 Files	Original Format Archival Format (.tab)
	1. Full codes for replicating Illustrative simulation study.txt Plain Text - 9.4 KB Published May 7, 2024 30 Downloads MD5: 2349f2e99e692743a03e1e0e996f0b71 This is R source code to replicate the illustrative simulation study in the Kim et al. (2024). Please run from the beginning to the end in R. [see Table 2 and Figure 2 in main text]	Preview "1. Full codes for replicating Illustrative simulation study.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	2. Instruction script to run Topic-based Segmentation Packages.txt Plain Text - 2.9 KB Published May 7, 2024 22 Downloads MD5: da719456a67c457e02ecf64f455bed23 Instruction for running the two packages of "Package_Topic-based_Segmentation.R" and "Package_Dendrogram_Dimensions.R" with a simple sample data for helping any users to apply the model to their datasets.	Preview "2. Instruction script to run Topic-based Segmentation Packages.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt Plain Text - 2.8 KB Published May 7, 2024 24 Downloads MD5: b5269db1295a250d29a002eca226b2be This is R code for preprocessing the downloaded Yelp data and building DV and IVs matrix for customer-level segmentation study.	Preview "3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	3-b. Instruction for replicating Customer-level Segmentation analysis.txt Plain Text - 1.5 KB Published May 7, 2024 22 Downloads MD5: 484ac766c74c30e75d53479d111dc08b This is instruction script for replicating customer-level segmentation study with Yelp. [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]	Preview "3-b. Instruction for replicating Customer-level Segmentation analysis.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt Plain Text - 6.0 KB Published May 7, 2024 20 Downloads MD5: cab73f15dca503e73cf8f12572e1c5e4 This is R code for preprocessing the downloaded Yelp data and building DV and IVs matrix for restaurant-level segmentation study.	Preview "4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	4-b. Instruction for replicating restaurant-level segmentation analysis.txt Plain Text - 1.6 KB Published May 7, 2024 21 Downloads MD5: a05f6d813f60ef8d1326408c8550de9e This is instruction script for replicating restaurant-level segmentation study with Yelp. [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]	Preview "4-b. Instruction for replicating restaurant-level segmentation analysis.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	5. Instructions for replicating Professor ratings review study.txt Plain Text - 1.3 KB Published May 7, 2024 27 Downloads MD5: 832368230d315dace3b8dd01e9c9d38a Instruction script to replicate the Professor ratings reviews study. [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]	Preview "5. Instructions for replicating Professor ratings review study.txt" Access File File Access Public Download Options Plain Text Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	Dendrogram.R R Syntax - 2.1 KB Published May 7, 2024 24 Downloads MD5: b7c201f9a8f772f3c838e177202aa4f4 This is a package for generating dendrogram for identifying topics. [see Figure 3 in main text; Figures F-1 and G-1 in Web Appendices]	Preview "Dendrogram.R" Access File File Access Public Download Options R Syntax Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX
	Package sample_data.tab Tabular Data - 15.3 KB Published May 7, 2024 29 Downloads 21 Variables, 300 Observations UNF:6:nkFFI1aNdeZLqrOOqW+UhQ== A sample data for running Model Package. True parameter values for this sample data are presented in the "2. Instruction script to run Topic-based Segmentation Packages.txt".	Preview "Package sample_data.tab" Preview "Package sample_data.tab" Access File File Access Public Download Options Comma Separated Values (Original File Format) Tab-Delimited RData Download Metadata DDI Codebook v2 Data File Citation Download EndNote XML Download RIS Download BibTeX Explore Options Data Explorer v2
	Package_MixtureRegression_GroupVariableSelection.R R Syntax - 7.0 KB Published May 7, 2024 23 Downloads MD5: 296880d85040dd034500af0a571b4fd7 This is a package for topic-based segmentation (i.e., latent class regression with group variable selection). [see Tables 5, 6, 7, and 10 in main text; Tables E-4, E-5, F-1, F-2, F-3, G-1, G-2, G-4 and G-5 in Web Appendices]	Preview "Package_MixtureRegression_GroupVariableSelection.R" Access File File Access Public Download Options R Syntax Download Metadata Data File Citation Download EndNote XML Download RIS Download BibTeX

Citation Metadata

Persistent Identifier	doi:10.7910/DVN/EE3DE2
Publication Date	2024-05-07
Title	Replication Data for: "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews"
Alternative URL	https://doi.org/10.7910/DVN/EE3DE2
Other Identifier	ScholarOne: JMR
Author	Rutgers, The State University of New Jersey0009-0005-0280-9094 Lee, SanghakArizona State University McCulloch, RobertArizona State University
Point of Contact	Use email button above to contact. Kim, Sunghoon (Rutgers, The State University of New Jersey)
Description	We provide instructions, codes and datasets for replicating the article by Kim, Lee and McCulloch (2024), "A Topic-based Segmentation Model for Identifying Segment-Level Drivers of Star Ratings from Unstructured Text Reviews." This repository provides a user-friendly R package for any researchers or practitioners to apply A Topic-based Segmentation Model with Unstructured Texts (latent class regression with group variable selection) to their datasets. First, we provide a R code to replicate the illustrative simulation study: see file 1. Second, we provide the user-friendly R package with a very simple example code to help apply the model to real-world datasets: see file 2, Package_MixtureRegression_GroupVariableSelection.R and Dendrogram.R. Third, we provide a set of codes and instructions to replicate the empirical studies of customer-level segmentation and restaurant-level segmentation with Yelp reviews data: see files 3-a, 3-b, 4-a, 4-b. Note, due to the dataset terms of use by Yelp and the restriction of data size, we provide the link to download the same Yelp datasets (https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6). Fourth, we provided a set of codes and datasets to replicate the empirical study with professor ratings reviews data: see file 5. Please see more details in the description text and comments of each file. [A guide on how to use the code to reproduce each study in the paper] 1. Full codes for replicating Illustrative simulation study.txt -- [see Table 2 and Figure 2 in main text]: This is R source code to replicate the illustrative simulation study. Please run from the beginning to the end in R. In addition to estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships, you will get dendrograms of selected groups of variables in Figure 2. Computing time is approximately 20 to 30 minutes 3-a. Preprocessing raw Yelp Reviews for Customer-level Segmentation.txt: Code for preprocessing the downloaded unstructured Yelp review data and preparing DV and IVs matrix for customer-level segmentation study. 3-b. Instruction for replicating Customer-level Segmentation analysis.txt -- [see Table 10 in main text; Tables F-1, F-2, and F-3 and Figure F-1 in Web Appendix]: Code for replicating customer-level segmentation study with Yelp data. You will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 3 to 4 hours. 4-a. Preprocessing raw Yelp reviews_Restaruant Segmentation (1).txt: R code for preprocessing the downloaded unstructured Yelp data and preparing DV and IVs matrix for restaurant-level segmentation study. 4-b. Instructions for replicating restaurant-level segmentation analysis.txt -- [see Tables 5, 6 and 7 in main text; Tables E-4 and E-5 and Figure H-1 in Web Appendix]: Code for replicating restaurant-level segmentation study with Yelp. you will get estimated coefficients (posterior means of coefficients), indicators of variable selections, and segment memberships. Computing time is approximately 10 to 12 hours. [Guidelines for running Benchmark models in Table 6] Unsupervised Topic model: 'topicmodels' package in R -- after determining the number of topics(e.g., with 'ldatuning' R package), run 'LDA' function in the 'topicmodels'package. Then, compute topic probabilities per restaurant (with 'posterior' function in the package) which can be used as predictors. Then, conduct prediction with regression Hierarchical topic model (HDP): 'gensimr' R package -- 'model_hdp' function for identifying topics in the package (see https://radimrehurek.com/gensim/models/hdpmodel.html or https://gensimr.news-r.org/). Supervised topic model: 'lda' R package -- 'slda.em' function for training and 'slda.predict' for prediction. Aggregate regression: 'lm' default function in R. Latent class regression without variable selection: 'flexmix' function in 'flexmix' R package. Run flexmix with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, conduct prediction of dependent variable per each segment. Latent class regression with variable selection: 'Unconstraind_Bayes_Mixture' function in Kim, Fong and DeSarbo(2012)'s package. Run the Kim et al's model (2012) with a certain number of segments (e.g., 3 segments in this study). Then, with estimated coefficients and memberships, we can do prediction of dependent variables per each segment. The same R package ('KimFongDeSarbo2012.zip') can be downloaded at: https://sites.google.com/scarletmail.rutgers.edu/r-code-packages/home 5. Instructions for replicating Professor ratings review study.txt -- [see Tables G-1, G-2, G-4 and G-5, and Figures G-1 and H-2 in Web Appendix]: Code to replicate the Professor ratings reviews study. Computing time is approximately 10 hours. [A list of the versions of R, packages, and computer specification used in the paper] R version: 4.3.1 (2023-06-16 ucrt) Package versions: MCMCpack_1.6-3 coda_0.19-4 / mvtnorm_1.2-3 / MASS_7.3-60 / tm_0.7-11 / hunspell_3.0.3 / cluster_2.1.4 Processor: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz 2.50 GHz Installed RAM: 32.0 GB (31.6 GB usable) [Disclaimer] We developed these R packages and code to be as user-friendly as possible. However, a minimum familiarity with the R language and a good understanding of statistics are recommended for properly importing the data, running the model, and interpreting the results. There is no in-built protection against misuse. [Technical Help or Problem Report] We are expecting the R packages and codes to be easy to use and to be working well. Please send any questions or report bugs to Sunghoon Kim (E-mail: Sunghoon.Kim@rutgers.edu). Note, another repo is available at GitHub: https://github.com/Sunghoon-cloud/Topic-based-Clustering (2024-03-03)
Subject	Business and Management; Computer and Information Science; Mathematical Sciences; Social Sciences
Keyword	Group Variable Selection Topic-based Segmentation Unstructured Text Analysis
Related Publication	Kim, Sunghoon, Sanghak Lee, and Robert McCulloch (2024) “Topic-Based Segmentation for Identifying Segment-Level Grouped Variables from Unstructured Text Reviews,” Journal of Marketing Research (published online April 1, 2024), https://doi.org/10.7910/DVN/EE3DE2. doi https://doi.org/10.7910/DVN/EE3DE2 https://doi.org/10.7910/DVN/EE3DE2
Depositor	Kim, Sunghoon
Deposit Date	2024-02-15
Software	R, Version: 4.3.1 or higher

Dataset Terms

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

CC BY-NC 4.0

Restricted Files + Terms of Access

Dataset Version	Summary	Contributors	Published on
No records found.

Edit File

This file has already been deleted (or replaced) in the current version. It may not be edited.

Restrict Access

Restricting limits access to published files. People who want to use the restricted files can request access by default. If you disable request access, you must add information about access to the Terms of Access field.

Learn about restricting files and dataset access in the User Guide.

Request Access

Enable access request

You must enable request access or add terms of access to restrict file access.

Terms of Access for Restricted Files

Save Changes

Edit Embargo

The selected file or files have already been published. Contact an administrator to change the embargo date or reason of the file or files.

Edit Retention Period

The selected file or files have already been published. Contact an administrator to change the retention period date or reason of the file or files.

Delete Files

The file will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Select File(s)

Please select one or more files.

Share Dataset

Share this dataset on your favorite social media networks.

Continue

Dataset Citations

Citations for this dataset are retrieved from Crossref via DataCite using Make Data Count standards. For more information about dataset metrics, please refer to the User Guide.

Sorry, no citations were found.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired or the files can only be transferred via Globus.

Ineligible Files Selected

The selected file(s) may not be transferred because you have not been granted access or the file(s) have a retention period that has expired or the files are not Globus accessible.

Download Options

The files selected are too large to download as a ZIP.

You can select individual files that are below the 15.0 GB download limit from the files table, or use the Data Access API for programmatic access to the files.

Select File(s)

Please select a file or files to be downloaded.

Inaccessible Files Selected

The selected file(s) may not be downloaded because you have not been granted access or the file(s) have a retention period that has expired.

Click Continue to download the files you have access to download.

Ineligible Files Selected

Some file(s) cannot be transferred. (They are restricted, embargoed, with an expired retention period, or not Globus accessible.)

Click Continue to transfer the elligible files.

Delete Dataset

Are you sure you want to delete this dataset and all of its files? You cannot undelete this dataset.

Delete Draft Version

Are you sure you want to delete this draft version? Files will be reverted to the most recently published version. You cannot undelete this draft.

Unpublished Dataset Preview URL

Preview URL can only be used with unpublished versions of datasets.

Unpublished Dataset Preview URL

Are you sure you want to disable the Preview URL? If you have shared the Preview URL with others they will no longer be able to use it to access your unpublished dataset.

Delete Files

The file(s) will be deleted after you click on the Delete button.

Files will not be removed from previously published versions of the dataset.

Compute

This dataset contains restricted files you may not compute on because you have not been granted access.

Deaccession Dataset

Are you sure you want to deaccession? This is permanent and the selected version(s) will no longer be viewable by the public.

Deaccession Dataset

Are you sure you want to deaccession this dataset? This is permanent an it will no longer be viewable by the public.

Version Differences Details

Please select two versions to view the differences.

Version Differences Details

Version:
Last Updated:

Select File(s)

Please select a file or files for access request.

Select File(s)

Embargoed files cannot be accessed. Please select an unembargoed file or files for your access request.

Edit Tags

Select existing file tags or create new tags to describe your files. Each file can have more than one tag.

Request Access

You need to Sign Up or Log In to request access.

Dataset Terms

Please confirm and/or complete the information needed below in order to request access to files in this dataset.

This dataset is made available under the following terms. Please confirm and/or complete the information needed below in order to continue.

License/Data Use Agreement

Our Community Norms as well as good scientific practices expect that proper credit is given via citation. Please use the data citation shown on the dataset page.

CC BY-NC 4.0

Preview Guestbook

Upon downloading files the guestbook asks for the following information.

Guestbook Name

Collected Data

Account Information

Package File Download

Use the Download URL in a Wget command or a download manager to download this package file. Download via web browser is not recommended. User Guide - Downloading a Dataverse Package via URL

Download URL

https://qa.dataverse.org/api/access/datafile/

Compute Batch

Clear Batch

Dataset	Persistent Identifier	Change Compute Batch

Compute Batch

Submit for Review

You will not be able to make changes to this dataset while it is in review.

Publish Dataset

Are you sure you want to republish this dataset?

By default datasets are published with the CC0-“Public Domain Dedication” waiver. Learn more about the CC0 waiver here.

To publish with custom Terms of Use, click the Cancel button and go to the Terms tab for this dataset.

Select if this is a minor or major version update.

Minor Release (1.1)

Major Release (2.0)

Publish Dataset

This dataset cannot be published until Journal of Marketing Research is published by its administrator.

Publish Dataset

This dataset cannot be published until Journal of Marketing Research and American Marketing Association Dataverse are published.

Return to Author

Return this dataset to contributor for modification. The reason for return entered below will be sent by email to the author.

Curation Status History

Status	Date	Assigner
No records found.

Add/Edit a Version Note

Styled Citation