Overview
This repository contains two datasets that were collected and processed as part of a study on public perception of environmental issues and climate change in Ukraine. The datasets are derived from Ukrainian Telegram news channels and include metadata, raw text, and user reactions to posts related to climate events and environmental topics. These datasets are intended to support academic research on the relationship between public discourse, user sentiment, and climate indicators.
The datasets are located in the data
folder with respect to their extension: csv
and parquet
. If you decide to read the climate_text_data_final
in CSV format, please set the encoding to utf-16
.
Datasets
- climate_text_data_final
This dataset contains raw text data from Telegram posts, along with additional metadata. It provides a comprehensive view of the content and context of climate-related discussions. The dataset can be joined with the final_reactions_data
based on the channel_name
and message_id
.
Please ensure the encoding is set to utf-16 when reading the CSV format of the dataset.
Key Features:
- Post ID: Unique identifier for each Telegram post.
- Channel Name: The name of the Telegram channel where the post was published.
- Text: The raw text of the Telegram post.
- Metadata: Includes timestamp, number of views, and number of forwards.
Purpose: This dataset is designed to support natural language processing (NLP) tasks, such as topic modeling, named entity recognition, and sentiment analysis. It provides a foundation for understanding the themes and narratives surrounding climate change and environmental issues in Ukrainian online information space.
- final_reactions_data
This dataset contains user reactions to Telegram posts, represented as emoji counts. It provides a detailed view of how users engage with climate-related content.
Key Features:
- Post ID: Unique identifier for each Telegram post.
- Channel Name: The name of the Telegram channel where the post was published.
- Emoji Reactions: Columns representing counts of various emojis used to react to the post.
- Is NA: A boolean value showing whether the emoji reaction columns have NaN or at least one non-NA value.
Purpose: This dataset enables researchers to analyze user sentiment and engagement with climate-related content. It can be used to identify patterns in public reactions to environmental issues and assess the emotional tone of the discourse. The emojis can be classified into categories to reduce dimensionality and work with a combined representation of emojis. Further, statistics on particular emoji class can be generated. This will lead to a solid understanding of user engagement patterns.
Research Context
The datasets were collected as part of a study aimed at understanding public attitudes toward environmental issues and exploring the relationship between public perception and climate indicators, especially in the period of the full-scale Russian aggression against Ukraine. The study focused on Telegram channels due to their popularity and influence in Ukraine. The research objectives included:
- Developing a methodology for automated data collection from Ukrainian Telegram channels on climate-related topics.
- Conducting a comprehensive analysis of the collected data using natural language processing and statistical methods to identify key topics, trends, and patterns.
- Investigating the relationship between message characteristics and user reactions to determine factors influencing public perception of environmental issues.
The study analyzed content from seven influential Telegram news channels: DW Ukraine, BBC Ukrainian, Ukrayinska Pravda, Voice of America, Radio Liberty, Babel, and ZN.UA. These channels were selected based on their audience size, credibility, and regularity of coverage of environmental issues. The data collection period spanned five years (01.01.2020 - 14.01.2025), allowing for an analysis of trends over time, including the impact of the Russian war in Ukraine on public discourse.
Ethical Considerations
The datasets do not contain any personally identifiable information (PII). However, we acknowledge that the dataset may contain sensitive content due to the nature of the data. Some records may describe war-related activities, destruction, harm, or other sensitive topics. We have made every effort to remain unbiased in collecting data from the selected channels and have not censored any content.
The dataset will undergo ethical clearance at Lviv Polytechnic National University to ensure compliance with ethical standards and guidelines for data collection, processing, and usage. This process aims to address potential concerns related to sensitive content and ensure the responsible use of the dataset in academic research.
Recommendations for Ethical Use:
- Fairness and Bias: Evaluate results with fairness metrics to ensure that analyses are not biased or discriminatory.
- Transparency: Use tools for interpretability and explainability to ensure transparency in machine learning models and analyses.
- Monitoring: Implement machine learning monitoring to improve observability and awareness of system performance.
- Ethical Awareness: Be mindful of the potential for sensitive, distorted, or unfair content, particularly when analyzing topics related to war or conflict.
Data Collection Methodology
To identify relevant messages, we used an approach based on the Aho-Corasick
algorithm, which enables efficient multi-pattern search in text data with linear time complexity. This was critical for processing large volumes of information. A thematic dictionary was developed, containing key terms structured into five categories:
- Climate terms
- Environmental issues
- Natural resources
- Climate events
- Environmental initiatives
The algorithm was implemented in Python using the telethon
library for collecting messages and the pyahocorasick
library for building a finite state machine for parallel pattern search. As a result, 5,732 relevant messages related to climate change and environmental issues were identified and selected.
(2025-06-18)